Wednesday, January 21, 2009

01-21-09 - Towards Lock-Free Threading

Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually.
Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug
thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.

It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.

Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed
while the mutex is held. This is just the standard easy old way :

{Thread A} <---> {Communication Region} <---> {Thread B}

Thread A - lock mutex on CR
change values ;
--- Thread B lock mutex
read values

Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically
pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.

To keep the code the same, they both get copies of the Communication Region :

{Thread A | Communication Region} <---> {Communication Region | Thread B}

and they send messages via two stacks :

{Thread A | Communication Region} --- stack AtoB --> {Communication Region | Thread B}
{Thread A | Communication Region} <-- stack BtoA --- {Communication Region | Thread B}

The messages you send across the stacks apply the deltas to the communication regions to keep them in sync, just like networked game.

So for example, if Thread A is the "main thread" and Thread B is a "job doer thread" , a common case might be that the
communication region is a list of pending jobs and a list of completed jobs.

Main Thread
{Pending Jobs}
{Completed Jobs}

Request Job :
add to my {Pending Jobs}
push message on stackAtoB

Worker Thread
{Pending Jobs}
{Completed Jobs}

Sit and pop jobs off stackAtoB
put in pending jobs list
work on pending jobs
put result on completed jobs list
push completion message on stackBtoA

The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs
are still pending or done yet awaiting examination.

Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to
start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free.
This way 99% of the code is identical, but it still just talks to a "Communication Region".


I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a
copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't
page pieces back to main memory as they go.

While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it
might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the
main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without
locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle

Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That
is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication
model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks
for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.

That is, this kind of model :

[ xxxxxxxxxx giant shared data xxxxxxxxxx ]

| | | |
| | | |
| | | |

[thread A] [thread B] [thread C] [thread D]

is slow regardless of whether you use locks or not. And this kind of model :

[data A ] [data B ] [data C ] [data D ]

| / | / | / |
| / | / | / |
| / | / | / |

[thread A] [thread B] [thread C] [thread D]

is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.


Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write.
The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free
methods safe. Let me try to make this clear with an example :

You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work.
So you have a communication region like :

// globals :
bool isDone;
int workParams[32];
int workResults[32];

Now a lot of people try to do lock-free work passing trivially by going :

Thread A :

isDone = false;
// set up workParams
// tell Thread B to start

... my stuff ...

if ( isDone )
// use workResults

Thread B :

// read workParams

// fill out workResults

isDone = true;

Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs.
(I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and
results don't all happen together and in-order). Instead we can just pass the object :

struct ThreadMessage
bool isDone;
int workParams[32];
int workResults[32];

Thread A :
ThreadMessage myLocalCopy; // in TLS or stack
// fill out myLocalCopy
// push myLocalCopy to thread message queue to thread B

... my stuff ...

if ( pop my message queue )
myLocalCopy = copy from queue
// use myLocalCopy.workResults

Thread B :
ThreadMessage myLocalCopy; // in TLS or stack
myLocalCopy = pop message queue

// read myLocalCopy.workParams

// fill out myLocalCopy.workResults

push myLocalCopy to queue for thread S

Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written
and read in one big chunk. That is, we move the shared data from one large consistent state to another.

No comments:

Post a Comment