Monday, June 6, 2011

12-05-08 - 64 Bit Multiply

Something that I've found myself wanting to do a lot recently is multiply two 32 bit numbers, and then
take the top 32 bit dword from the 64 bit result. In C this looks like :

U32 mul32by32returnHi(U32 a,U32 b)
U64 product = (U64) a * b;
U32 ret = (U32)(product >> 32);
return ret;

That works fine and all, but the C compiler doesn't understand that you're doing something very simple. It
generates absolutely disgusting code. In particular, it actually promotes a & b to 64 bit, and then calls
a function called "_allmul" in the Windows NT RTL. This allmul function does a bunch of carries and such to
make 64-by-64 bit multiply work via the 32+32->64 multiply instruction in x86. You wind up with a function
that takes 60 clocks when it could take 6 clocks :

U32 mul32by32returnHi(U32 a,U32 b)
mov eax, a
mul b
mov eax,edx

Now, that's nice and all, the problem is that tiny inline asm functions just don't work in MSVC these days.
Calling a big chunk of ASM is fine, but using a tiny bit of ASM causes the optimizer to disable
all its really fancy optimizations, and you wind up with a function that's slower overall.

What I'd like is a way to get at some of the x86 capabilities that you can't easily get from C. The other
thing I often find myself wanting is a way to get at the carry flag, or something like "sbb" or "adc".
The main uses of those are Terje's cool tricks to
get the sign of an int or
add with saturation

It would be awesome if you could define your own little mini assembly functions that acted just like the
built-in "intrinsics". It would be totally fine if you had to add a bunch of extra data to tell the compiler
about dependencies or side effects or whatever, if you could just make little assembly widgets that worked
as neatly as their own intrinsics it would be totally awesome for little ops.

ADDENDUM : urg ; so I tested Won's suggestion of Int32x32To64 and found it was doing the right thing (actually
UInt32x32To64 - and BTW that's just a macro that's identical to my first code snippet - it doesn't seem to be
a magic intrinsic, though it does have platform switches so you should probably use that macro). That confused
me and didn't match what I was seeing, so I did some more tests...

It turns out the first code snippet of mul32by32returnHi actually *will* compile to the right thing - but only
if you call it from simple functions. If you call it from a complex function the compiler gets confused, but if
you call it from a little tight test loop function it does the right thing. URG.

Here's what it does if I try to use it in my StringHash interpolation search test code :

int start = mul32by32returnHi( query.hash, indexrange );
004087A5 xor eax,eax
004087A7 push eax
004087A8 mov eax,dword ptr [esp+38h]
004087AC push eax
004087AD push 0
004087AF push ebx
004087B0 mov dword ptr [esi],ebx
004087B2 call _allmul (40EC00h)
004087B7 mov ecx,dword ptr [esp+14h]

You can see it's extending the dwords to 64 bits, pushing them all, calling the function, then
grabbing just one dword from the results. Ugh.

And here's what it does in a tight little test loop :

U32 dw = *dwp++;
hash = mul32by32returnHi(hash,dw) ^ dw;
00401120 mov eax,ecx
00401122 mul eax,edx
00401124 add esi,4
00401127 xor edx,ecx
00401129 mov ecx,dword ptr [esi]

Which not only is correct use of 'mul' but also has got crazy reordering of the loop which makes it a lot
faster than my inline ASM version.

No comments:

Post a Comment