Monday, June 6, 2011

12-05-08 - 64 Bit Multiply

Something that I've found myself wanting to do a lot recently is multiply two 32 bit numbers, and then
take the top 32 bit dword from the 64 bit result. In C this looks like :

U32 mul32by32returnHi(U32 a,U32 b)
U64 product = (U64) a * b;
U32 ret = (U32)(product >> 32);
return ret;

That works fine and all, but the C compiler doesn't understand that you're doing something very simple. It
generates absolutely disgusting code. In particular, it actually promotes a & b to 64 bit, and then calls
a function called "_allmul" in the Windows NT RTL. This allmul function does a bunch of carries and such to
make 64-by-64 bit multiply work via the 32+32->64 multiply instruction in x86. You wind up with a function
that takes 60 clocks when it could take 6 clocks :

U32 mul32by32returnHi(U32 a,U32 b)
mov eax, a
mul b
mov eax,edx

Now, that's nice and all, the problem is that tiny inline asm functions just don't work in MSVC these days.
Calling a big chunk of ASM is fine, but using a tiny bit of ASM causes the optimizer to disable
all its really fancy optimizations, and you wind up with a function that's slower overall.

What I'd like is a way to get at some of the x86 capabilities that you can't easily get from C. The other
thing I often find myself wanting is a way to get at the carry flag, or something like "sbb" or "adc".
The main uses of those are Terje's cool tricks to
get the sign of an int or
add with saturation

It would be awesome if you could define your own little mini assembly functions that acted just like the
built-in "intrinsics". It would be totally fine if you had to add a bunch of extra data to tell the compiler
about dependencies or side effects or whatever, if you could just make little assembly widgets that worked
as neatly as their own intrinsics it would be totally awesome for little ops.

ADDENDUM : urg ; so I tested Won's suggestion of Int32x32To64 and found it was doing the right thing (actually
UInt32x32To64 - and BTW that's just a macro that's identical to my first code snippet - it doesn't seem to be
a magic intrinsic, though it does have platform switches so you should probably use that macro). That confused
me and didn't match what I was seeing, so I did some more tests...

It turns out the first code snippet of mul32by32returnHi actually *will* compile to the right thing - but only
if you call it from simple functions. If you call it from a complex function the compiler gets confused, but if
you call it from a little tight test loop function it does the right thing. URG.

Here's what it does if I try to use it in my StringHash interpolation search test code :

int start = mul32by32returnHi( query.hash, indexrange );
004087A5 xor eax,eax
004087A7 push eax
004087A8 mov eax,dword ptr [esp+38h]
004087AC push eax
004087AD push 0
004087AF push ebx
004087B0 mov dword ptr [esi],ebx
004087B2 call _allmul (40EC00h)
004087B7 mov ecx,dword ptr [esp+14h]

You can see it's extending the dwords to 64 bits, pushing them all, calling the function, then
grabbing just one dword from the results. Ugh.

And here's what it does in a tight little test loop :

U32 dw = *dwp++;
hash = mul32by32returnHi(hash,dw) ^ dw;
00401120 mov eax,ecx
00401122 mul eax,edx
00401124 add esi,4
00401127 xor edx,ecx
00401129 mov ecx,dword ptr [esi]

Which not only is correct use of 'mul' but also has got crazy reordering of the loop which makes it a lot
faster than my inline ASM version.

Saturday, June 4, 2011

06-04-11 - Keep Case

I've been meaning to do this for a long time and finally got off my ass.

TR (text replace) and zren (rename) in ChukSH now support "keep case".

Keep case is pretty much what you always want when you do text replacement (especially in source code), and everybody should copy me. For example
when I do a find-replace from "lzp1f" -> "lzp1g" what I want is :

lzp1f -> lzp1g (lower->lower)
LZP1F -> LZP1G (upper->upper)
Lzp1f -> Lzp1g (first cap->first cap)
Lzp1G -> Lzp1G (mixed -> mixed)

The kernel that does this is matchpat in cblib which will handle rename masks like : "poop*face" -> "shit*butt" with
keep case option or not.

In a mixed-wild-literal renaming spec like that, the "keep case" applies only to the literal parts. That is,
"poop -> shit" and "face -> butt" will be applied with keep-case independently , the "*" part will just get copied.

eg :

Poop3your3FACE -> Shit3your3BUTT

Also, because keep-case is applied to an entire chunk of literals, it can behave somewhat unexpectedly on file
renames. For example if you rename

src\lzp* -> src\zzh*

the keep-case will apply to the whole chunk "src\lzp" , so if you have a file like "src\LZP" that will be
considered "mixed case" not "all upper". Sometimes my intuition expects the rename to work on the file part,
not the full path. (todo : add an option to separate the case-keeping units by path delims)

The way I handle "mixed case" is I leave it up to the user to provide the mixed case version they want. It's
pretty impossible to get it right automatically. So the replacement text should be provided in the ideal
mixed case capitalization. eg. to change "HelpSystem" to "QueryManager" you need to give me "QueryManager"
as the target string, capitalized that way. All mixed case source occurances of "HelpSystem" will be changed
to the same output, eg.

helpsystem -> querymanager
Helpsystem -> Querymanager
HelpSystem -> QueryManager
HelpsYstem -> QueryManager
heLpsYsTem -> QueryManager
HeLPSYsteM -> QueryManager

you get it.

The code is trivial of course, but here it is for your copy-pasting pleasure. I want this in my dev-studio
find/replace-in-files please !

// strcpy "putString" to "into"
// but change its case to match the case in src
// putString should be mixed case , the way you want it to be if src is mixed case
void strcpyKeepCase(
char * into,
const char * putString,
const char * src,
int srcLen);

void strcpyKeepCase(
char * into,
const char * putString,
const char * src,
int srcLen)
// okay, I have a match
// what type of case is "src"
// all lower
// all upper
// first upper
// mixed

int numLower = 0;
int numUpper = 0;

for(int i=0;i<srcLen;i++)
ASSERT( src[i] != 0 );
if ( isalpha(src[i]) )
if ( isupper(src[i]) ) numUpper++;
else numLower++;

// non-alpha :
if ( numLower+numUpper == 0 )
else if ( numLower == 0 )
// all upper :
while( *putString )
*into++ = toupper( *putString ); putString++;
*into = 0;
else if ( numUpper == 0 )
// all lower :
while( *putString )
*into++ = tolower( *putString ); putString++;
*into = 0;
else if ( numUpper == 1 && isalpha(src[0]) && isupper(src[0]) )
// first upper then low

if( *putString ) //&& isalpha(*putString) )
*into++ = toupper( *putString ); putString++;
while( *putString )
*into++ = tolower( *putString ); putString++;
*into = 0;

// just copy putString - it should be mixed

ADDENDUM : on a roll with knocking off stuff I've been meaning to do for a while ...

ChukSH now also contains "fixhtmlpre.exe" which fixes any less-than
signs that are found within a PRE chunk.

Hmm .. something lingering annoying going on here. Does blogger convert and-l-t into less thans?