Sunday, January 25, 2009

01-25-09 - Low Level Threading Junk Part 1



Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and
you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on
with the memory, as opposed to just using the language/OS constructs for safe sharing.



Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of
the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you
do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for
huge bugs.



I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction :
Concurrency What Every Dev Must Know About Multithreaded Apps .
I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly
what's going on with that.



First of all, let me try to list the various issues we're dealing with :



1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but
it's what most people think of as threading.



2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views
of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.



3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even
if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write
gather buffers, etc.)



4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like
"inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become
multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.



5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.



6. The compiler/optimizer caching values or eliminating ops



I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler
than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things
that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#.
I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.



Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like
this .
Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example
this Julian Bucknall guy has a lot of long articles about lock free stuff,
and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from
a place like "codeproject.com" or anybody's random blog.



I've read a lot of stuff like :




Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.




To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language
natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do
everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C,
but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution
model" because it includes things like the concept of what's atomic and what can be reordered).



Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical
sections now, what kind of threading stuff you can do without critsecs, etc.



Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a
specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc.
but forget you know that and don't rely on it).




Let's look at a simple example and see if we can get it right.



Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :



// global variables :
MyStruct * g_work = NULL;


Thread A :

int main(int argc, char ** argv)
{
g_work = new MyStruct( argc, argv );

....
}


void ThreadB()
{

while( g_work == NULL )
{
}

g_work->DoStuff();
}



Okay, now let's go through it and fix it.



First of all, this line should give you the heebee jeebees :


g_work = new MyStruct( argc, argv );


It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned
before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all
done with. We can start trying to fix it by doing :


1. MyStruct * temp = new MyStruct( argc, argv );
2. g_work = temp;


that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the
writes to memory done by line 1 are actually finished before line 2 executes.



In Windows we do that by calling
MemoryBarrier() :


MyStruct * temp = new MyStruct( argc, argv );
MemoryBarrier();
g_work = temp;


MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually
does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier
instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.



MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :


A
B
C


you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :


A
B
MemoryBarrier
C


You get an order constraint : C is always after {AB} , so it might be ABC or BAC.



Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with
_ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory
fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually
want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm,
the compiler automatically treats that as a ReadWriteBarrier (??).



Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :


MyStruct * temp = new MyStruct( argc, argv );
MemoryBarrier();
g_work = temp; <-


What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on
64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.



One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform
and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for
here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or
write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!).
Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.



One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:


volatile char * var;

// this is a non-volatile pointer to a volatile char !! probably not what you meant

char * volatile var;

// this is a volatile pointer to chars - probably what you meant
// (though of course you have to make sure the char memory accesses are synchronized right)




Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable
in its meaning, and instead you should manually specify the operations that need to be sync'ed :



On Windows there's a clean way to do this :



MyStruct * temp = new MyStruct( argc, argv );
InterlockedExchangePointer( &g_work, temp );



The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op.
The Interlocked functions also automatically generate memory barriers and optimizer barriers
(!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get
written to temp first.



Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see
is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More
formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move
after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).



So if you have :


A
B
C - Release
D


The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a
full barrier so they give the processor more room to wiggle, which is good for it.



So lets do another example of this simple thread passing (stolen from comp.programming.threads) :



int a;
int b;
LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

EnterCriticalSection(&mtx);
if (! valid) {
a = something;
b = something;
InterlockedExchangeAddRelease(&valid, 1);
// writes to memory 'a' and 'b' will be done before valid is set to 1
}
do_other_stuff();
LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

if (InterlockedExchangeAddAcquire(&valid, 0)) {
return a + b;
} else {
return 0;
}

}

int function3 () { // broken

if ( valid ) {
return a + b;
} else {
return 0;
}

}




Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get
set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the
compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should
know the actual execution of function 3 could go something like :



fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a



which is now reading a and b before 'valid'. Acquire stops this.



Some good links on this basic memory barrier stuff : (read Kang Su in particular)



Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005

Memory barrier - Wikipedia, the free encyclopedia

Acquire and Release Semantics

The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models

Memory ordering, barriers, acquire and release semantics niallryan.com

Memory barrier + volatile (C++) - comp.programming.threads Google Groups




BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version.
In summary :



ReadBarrier/WriteBarrier - just prevents compiler reordering

MemoryBarrier() - CPU memory load/store barrier

Interlocked & volatile
Interlocked functions on Windows automatically generate a compiler & a CPU barrier
Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it

special volatile thing in MSVC ?
volatile automatically does the right thing on VC >= 2005
not for XBox though





ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called
volatile vs volatile . It covers a lot of this same territory.



And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")



Bartosz Milewski - Publication Safety on Multicore

Hans Boehm - Why atomics have integrated ordering constraints

Bartosz Milewski - C Atomics and Memory Ordering

No comments:

Post a Comment