Sunday, January 25, 2009

01-25-09 - Low Level Threading Junk Part 2



Okay, we're starting to learn some primitive threading junk. So let's try to use it do something useful.



Non-trivial statics in C++ are hideously non-thread-safe. That should be extremely obvious but if you don't see it, read here :
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose! .
Obviously plain C static initialization of integral types is fine (BTW floats are NOT fine - beware using const floats in C++). Beware all things like this :





Object & Instance() { static Object s; return s; }

void func(int i)
{
static Object once;
once.DoStuff(i);
}





Not thread safe. For the most part, we should just not use them. It's very nice to do minimal work in CINIT. Much bruhaha has been made about the "thread safe singleton"
(see all these) :



opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)

Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004

Double-checked locking - Wikipedia, the free encyclopedia




For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it
work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during
CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in
CINIT :



static Object * s_instance = NULL; // okay cuz its a basic type

Object * GetSingleton()
{
if ( ! s_instance )
{
s_instance = new Object;
}
return s_instance;
}

static volatile Object * s_forceUse = GetSingleton();



The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread
safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))



Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we
have to worry about thread-safing the construction.



The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :





static CCriticalSection s_crit; // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * s_instance = NULL;

// broken version :

Object * GetSingleton()
{
if ( ! s_instance )
{
s_crit.Lock();

s_instance = new Object;

s_crit.Unlock();
}
return s_instance;
}



Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the
critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by
moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before
the stuff inside Object is written. What did we learn in Part 1?



// good version :

static CCriticalSection s_crit; // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance ) // must be memory_order_consume
{
s_crit.Lock();

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

s_crit.Unlock();
}
return s_instance;
}



This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again
to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is
written atomically.



ADDENDUM : I need to add some more notes about this. See comments for now.



Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check
if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :



class Object { } ;
static CRITICAL_SECTION s_crit = { 0 };
AT_STARTUP( InitializeCriticalSection(&s_crit); );
static Object * s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance )
{
if ( s_crit.DebugInfo == 0 ) // DebugInfo will always be nonzero for an initialized critsec
{
// must be in CINIT !
s_instance = new Object;
}
else
{
EnterCriticalSection(&s_crit);

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

LeaveCriticalSection(&s_crit);
}
}
return s_instance;
}



This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine,
it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.



Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT
mumbo jumo ?



Well, yes it would, and it's pretty trivial to make a standard simple Spin Lock thing :



typedef LONG SpinMutex; // initialize me with = 0

void SpinLock(SpinMutex volatile * mut)
{
// 0 is unlocked, 1 is locked
COMPILER_ASSERT( sizeof(SpinMutex) == sizeof(LONG) );
int spins = 0;
while( InterlockedCompareExchange((LONG *)mut,1,0) == 1 ) // Acquire
{
++spins;
if ( spins > CB_SPIN_COUNT )
{
Sleep(0);
}
}
// *mut should now be 1
}

void SpinUnlock(SpinMutex volatile * mut)
{
// *mut should now be 1
InterlockedDecrement((LONG *)mut); // Release
// *mut should now be 0
// if there was a higher priority thread stalled on us, wake it up !
}



Note that SpinLock will deadlock you if you try to recurse. Okay, now because this thing is static initializable we can use it to make a Singleton and be safe for CINIT :




static SpinMutex s_mutex = 0;
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance )
{
SpinLock(&s_mutex);

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

SpinUnlock(&s_mutex);
}
return s_instance;
}



And that works too.



Now, people get tempted to use these simple SpinLocks all over. *DON'T DO IT*. It looks like it should be faster than CRITICAL_SECTION
but in fact it's way way slower in bad cases.



First of all, what is a CRITICAL_SECTION ? See :
Matt Pietrek - Critical Sections Under Windows .




It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of
spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different
execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.



Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for
dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad
cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one
(this prevents a few runaway threads from killing the system).



One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because
more about this in a moment.



About performance : (from one of the MSDN articles) :



* MemoryBarrier was measured as taking 20-90 cycles.
* InterlockedIncrement was measured as taking 36-90 cycles.
* Acquiring or releasing a critical section was measured as taking 40-100 cycles.



which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :



Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
(so how could you beat it by using InterlockedIncrement instead?)

When there is contention and the Crit Sec does something more serious (use an Event) it's slow
but that is exactly what you want !!



In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes
do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).



An interesting alternative is this QLock thing by Vladimir Kliatchko :

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007 (QLock) ; it's quite a
bit like the MS critsec, in that he uses a simple interlocked op to check for contention, and if it's not free then he makes an Event and goes into a wait state
and all that. The code is there on Dr. Dobbs but you have to do the thing where you hit the "Download" in the top bar and then navigate around to try to find it.
it's here in MUTEX.LST .




Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days
is "multiprocessor" (multicore). That's a big difference from threading on a single core.



If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and
memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time.
With multiple cores memory can be touched by some other core that is running along at full speed.



Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a
way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at
you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times.
1000 times is a microsecond



What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock.
Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

No comments:

Post a Comment