Saturday, January 17, 2009

01-17-09 - Float to Int



A while ago the
Some Assembly Required blog
wrote some good notes about float-to-int. I posted some notes there but I thought I'd try to summarize my thoughts coherently.



What I'd like is a pretty fast float-to-int (ftoi) conversion. The most useful variants are "truncate" (like C, fractions go
towards zero), and "round" , that is, fractions go towards the nearest int. We'd like both to be available all the time,
and both to be fast. So I want ftoi_trunc and ftoi_round.



First let me say that I hate the FPU control word with a passion. I've had so many bugs because of that fucker over the years. I write some
code and test it and everything is fine, and then we actually put in some game and all hell breaks loose. WTF happened?
Oh well, I tested it with the default word setup, and now it's running with the FPU set to single precision. The other
classic bug is people changing the rounding mode. D3D used to be really bad about this (you could use FPU_PRESERVE but it
was a pretty big hit back in the old days with software T&L, not a big deal any more). Or even worse is people who write code intentionally
designed to work with the FPU in a non-standard rounding mode (like round to nearest). Then if you call other code that's meant for the
normal rounding mode, it fails.



Ok, rant over. Don't mess with the FPU control word.



That means the classic /QIfist really doesn't do that much for us. Yes, it makes ftoi_trunc faster :



int ftoi_trunc(float f) { return (int) f; }



that's fast enough with /QIfist, but you still need round :



int ftoi_round(float f) { return ( f >= 0.f ) ? (int)(f + 0.5f) : (int)(f - 0.5f); }



note that a lot of people just do + 0.5 to round - that's wrong, for negatives you need to go the other way, because
the C truncation is *toward zero* not *down*.



Even if you could speed up the round case, I really don't like using compiler options for crucial functionality. I like
to make little code snippets that work the way I want regardless of the compiler settings. In particular if I make some
code that relies on ftoi being fast I don't want to use C casts and hope they set the compiler right. I want the code
to enforce its correctness.



Fortunately the xs routines at stereopsis by Sree Kotay are
really good. The key piece is a fast ftoi_round (which I have slightly rejiggered to use the union method of aliasing) :



union DoubleAnd64
{
uint64 i;
double d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5

inline int ftoi_round(const float val)
{
DoubleAnd64 dunion;
dunion.d = val + floatutil_xs_doublemagic;
return (int) dunion.i; // just cast to grab the bottom bits
}



in my tests this runs at almost exactly the same speed as FISTp (both around 7 clocks), and it always works regardless of
the FPU control word setting or the compiler options.



Note that this is a "banker's round" not a normal arithmetic rounding where 0.5 always goes up or down - 0.5 goes to the
nearest *even* value. So 2.5 goes to 2.0 and 3.5 goes to 4.0 ; eg. 0.5's go up half the time and down half the time.
To be more precise, ftoi_round will actually round the same way that bits that drop out of the bottom of the FPU registers
during addition round. We can see that's why making a banker_round routine was so easy, because that's what the FPU
addition does.



But, we have a problem. We need a truncate (ftoi_trunc). Sree provides one, but it uses a conditional, so it's slow
(around 11 clocks in my tests). A better way to get the truncate is to use the SSE intrinsinc :




inline int ftoi_trunc(const float f)
{
return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}




Note that the similar _mm_cvt_ss2si (one t) conversion does banker rounding, but the "magic number" xs method is faster because it pipelines better,
and because I'm building for x86 so the cvt stuff has to move the value from FPU to SSE. If you were building with arch:sse
and all that, then obviously you should just use the cvt's. (but then you dramatically change the behavior of your floating
point code by making it run through float temporaries all the time, instead of implicit long doubles like in x86).



So, that's the system I have now and I'm pretty happy with it. SSE for trunc, magic number for round, and no reliance on
the FPU rounding mode or compiler settings, and they're both fast.



For completeness I'll include my versions of the alternatives :



#include < xmmintrin.h >

typedef unsigned __int64 uint64;

union DoubleAnd64
{
uint64 i;
double d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5
static const double floatutil_xs_doublemagicdelta = (1.5e-8); //almost .5f = .5f + 1e^(number of exp bit)
static const double floatutil_xs_doublemagicroundeps = (0.5f - floatutil_xs_doublemagicdelta); //almost .5f = .5f - 1e^(number of exp bit)

// ftoi_round : *banker* rounding!
inline int ftoi_round(const double val)
{
DoubleAnd64 dunion;
dunion.d = val + floatutil_xs_doublemagic;
return (int) dunion.i; // just cast to grab the bottom bits
}

inline int ftoi_trunc(const float f)
{
return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_round_sse(const float f)
{
return _mm_cvt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_floor(const double val)
{
return ftoi_round(val - floatutil_xs_doublemagicroundeps);
}

inline int ftoi_ceil(const double val)
{
return ftoi_round(val + floatutil_xs_doublemagicroundeps);
}

// ftoi_trunc_xs = Sree's truncate
inline int ftoi_trunc_xs(const double val)
{
return (val<0) ? ftoi_round(val+floatutil_xs_doublemagicroundeps) :
ftoi_round(val-floatutil_xs_doublemagicroundeps);
}





BTW note the ceil and floor from Sree's XS stuff which are both quite handy and hard to do any other way. Note you might think
that you can easily make ceil and floor yourself from the C-style trunc, but that's not true, remember floor is *down* even on negatives.
In fact Sree's truncate is literally saying "is it negative ? then ceil, else floor".



Finally : if you're on a console where you have read-modify-write aliasing stall problems the union magic number trick is probably not good
for you. But really on a console you're locked into a specific CPU that you completely control, so you should just directly use the
right intrinsic for you.



AND : regardless of what you do, please make an ftoi() function of some kind and call that for your conversions, don't just cast.
That way it's clear where where you're converting, it's easy to see and search for, it's easy to change the method, and if you use ftoi_trunc
and ftoi_round like me it makes it clear what you wanted.



ASIDE :
in fact I'm starting to think *all* casts should go through little helper functions to make them very obvious and clear.
Two widgets I'm using to do casts are :



// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
return *( (const t_to *) &from );
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
// eg. for numeric casting to make sure the value fits in the new type
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
t_to to = static_cast< t_to >(from);
ASSERT( static_cast< t_fm >(to) == from );
return to;
}

intptr_t pointer_to_int(void * ptr)
{
return (intptr_t) ptr;
}

void * int_to_pointer(intptr_t i)
{
return (void *) i;
}





ADDENDUM :

I'm told this version of same_size_bit_cast is better on various platforms. That's why we want it gathered up in one place so it
can be easily changed ;)





// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );

union
{
t_fm fm;
t_to to;
} temp;

temp.fm = from;
return temp.to;
}



No comments:

Post a Comment