Friday, June 1, 2012

06-01-12 - On C++ Atomic Fences Part 3

Finally a small note of confusion. There are cases where I think "what I need is a #StoreLoad",
but a seq_cst fence doesn't work, and changing the load to an RMW does work. Let's try to look into
one of those cases a bit.



I will use
an
example that I used before
, a very simple "futex" (not really) waitset based exchange-mutex.
I'm gonna use the exact same code here as I did there, but I'm putting the futex_system ops into
the mutex to make it all a bit leaner :



struct futex_mutex3
{
std::atomic<int> m_state; // mutex locked flag

// waitset :
HANDLE m_handle; // actual wait event
atomic<int> m_count; // waiter count

/*************/

futex_mutex3() : m_state(0), m_count(0)
{
m_handle = CreateEvent(NULL,0,0,NULL);
}
~futex_mutex3()
{
CloseHandle(m_handle);
}

void lock()
{
// try to grab the mutex by exchanging in a 1
// if it returns 0, we got the lock
if ( m_state($).exchange(1,rl::mo_acquire) )
{
// prepare_wait :
// add one to waiter count
m_count($).fetch_add(1,mo_acquire);

// (*1)

// double check :
while ( m_state($).exchange(1,rl::mo_acquire) )
{
// wait :
WaitForSingleObject(m_handle, INFINITE);
}

// retire_wait :
m_count($).fetch_add(-1,mo_relaxed);
}
}

void unlock()
{
m_state($).store(0,rl::mo_release);

//notify_one :

// need #StoreLoad before loading m_count

// (*2)

int c = m_count($).load(mo_acquire);

if ( c > 0 )
{
SetEvent(m_handle);
}
}
};



The code as written does not work. The problem is the interaction of spots *1 and *2.



(mutual exclusion is gauranteed by this code by our actions on m_state , which also provides the necessary acquire & release
to make the mutex valid. So when I say it "doesn't work" it means the waitset interaction with the mutex doesn't work, eg.
we deadlock by failing to wake a waiter, it's a "missed wakeup" problem).



In brief, the problem is that the unlocker at *2 can load a waiter count of 0 (and thus not signal), even though the waiter
has passed point *1 (and thus count should not be zero).



The bad execution goes like this :



1. thread 0 holds the lock on the mutex

thread 1 calls lock()

2. thread 1 tries to lock the mutex, sees m_state=1 and goes into prepare_wait
3. thread 1 does m_count ++
4. thread 1 tries the exchange again, sees m_state=1 and goes into the wait

thread 0 calls unlock()

5. thread 0 stores a 0 to m_state
6. thread 0 loads m_count and gets 0 (out of date value)



Now, you might think the problem is that the load can act like it hoists above the store.
That is, we know the store happens after the exchange (#4), because the exchange didn't see a zero. Therefore #3 (the inc to count)
must already have happened. But the load at #6 is seeing the value before #3; sure, that's allowed, the load has no ordering
contraint that stops it from moving back in time.



So clearly we want a #StoreLoad between 5 and 6 to prevent that load from backing up. You cannot express that in C++0x and that's
what I meant in
this original blog post when I
said that the C++0x seq_cst fence is not really a StoreLoad and there's no way to really express this kind of StoreLoad in C++0x.

Specifically, just adding a seq_cst fence here where you want a StoreLoad does not work :



void unlock()
{
m_state($).store(0,rl::mo_release);

//notify_one :

// need #StoreLoad before loading m_count

std::atomic_thread_fence(mo_seq_cst);

int c = m_count($).load(mo_acquire);

if ( c > 0 )
{
SetEvent(m_handle);
}
}



The problem as I understand it that is a single fence in C++0x doesn't really do what you want. I kind of got at this in Part 2
as well, like you can't just use a fence as a way to publish and have relaxed loads receive it. You need another fence in the
receiving thread, so that the fences can "synchronize with" each other. Also if you go back to Part 1 and look at most of the rules
about fences, they only provide ordering if they have the right kind of object to connect through; you need something to carry a transitive
relationship.



Note : I believe that on every real CPU, putting a MemoryBarrier() there where you want the #StoreLoad would make this code work.
This example is actually very similar to the Peterson lock we saw in part 2.



Boiled down, the problem is like this :



A & B initially zero

thread 0 :
A = 1
#StoreLoad
load B

thread 1 :
B = 1
#StoreLoad
load A



It should not be possible for both threads to load 0. Either one or both threads should see a 1. Now if you
make both #StoreLoads into atomic_thread_fence(seq_cst) then it works - but not because the fence is a #StoreLoad.
It works because the two seq_cst fences must have a definite order against each other in the total order S, and then
that provides reference for all the other ops to be "happens before/after" each other.



To be very clear, this code works :



thread 0:

A($).store(1,mo_relaxed);
std::atomic_thread_fence(mo_seq_cst,$);
r1($) = B($).load(mo_relaxed);

thread 1:

B($).store(1,mo_relaxed);
std::atomic_thread_fence(mo_seq_cst,$);
r2($) = A($).load(mo_relaxed);

after :

r1+r2 == 1 or 2 (never 0)



(BTW this is actually the classic WFMO case in semi-disguise; you're waiting on the two conditions A and B to
become true; if two separate threads set the conditions, then at least one should see the joint A && B be true,
but that only works with the appropriate #StoreLoad;
see this blog post )



But what if we remove the need for one of the threads to have a seq_cst fence :



thread 0:

A($).store(1,mo_relaxed);
std::atomic_thread_fence(mo_seq_cst,$); // #StoreLoad ?
r1($) = B($).load(mo_relaxed);

thread 1:

B($).store(1,mo_relaxed);
// load A will be after store B via StoreStore release ordering
r2($) = A($).fetch_add(0,mo_acq_rel);

after :

r1+r2 == 1 or 2 (never 0)



Here thread 1 uses an RMW to ensure StoreLoad ordering, so we get rid of the fence.
This code no longer works. Now the B.load in thread 0 can hoist above the B.store in thread 1. The reason
is that the seq_cst fence in thread 0 is not acting as a StoreLoad any more because it has nothing to synchronize against
in the other thread.



Take away : atomic_thread_fence(mo_seq_cst) does NOT really act as a #StoreLoad. It can be used in spots where
you need a #StoreLoad, but only in the right situation, eg. if it has the right other stuff to synchronize with.



So, getting back to our futex_mutex example, you can use a seq_cst fence at (*2) to act as your storeload, but
only if you also have one at (*1) for it synchronize with :



struct futex_mutex3
{
std::atomic<int> m_state; // mutex locked flag

// waitset :
HANDLE m_handle; // actual wait event
atomic<int> m_count; // waiter count

/*************/

futex_mutex3() : m_state(0), m_count(0)
{
m_handle = CreateEvent(NULL,0,0,NULL);
}
~futex_mutex3()
{
CloseHandle(m_handle);
}

void lock()
{
// try to grab the mutex by exchanging in a 1
// if it returns 0, we got the lock
if ( m_state($).exchange(1,rl::mo_acquire) )
{
// prepare_wait :
// add one to waiter count
m_count($).fetch_add(1,mo_relaxed);

// (*1)
std::atomic_thread_fence(mo_seq_cst,$);

// double check :
while ( m_state($).exchange(1,rl::mo_acquire) )
{
// wait :
WaitForSingleObject(m_handle, INFINITE);
}

// retire_wait :
m_count($).fetch_add(-1,mo_relaxed);
}
}

void unlock()
{
m_state($).store(0,rl::mo_release);

//notify_one :

// (*2)
// need #StoreLoad before loading m_count

std::atomic_thread_fence(mo_seq_cst,$);

int c = m_count($).load(mo_acquire);

if ( c > 0 )
{
SetEvent(m_handle);
}
}
};



That is, the two fences allow you to set up a transitive relationship - like the m_count.load in unlock() is definitely
after the fence in unlock, the fence in unlock is after the fence in lock, and the fence in lock is after the m_count.fetch_add ;
therefore the m_count.load must see count > 0.



Or, alternatively, if you leave the fence out of lock(), you can fix it just in unlock by changing either the store
or the load into an RMW :



variant "apple" :

void unlock()
{
// release is for the mutex innards
m_state($).exchange(0,rl::mo_acq_rel);

// #StoreLoad is achieved as a #LoadLoad on m_state

int c = m_count($).load(mo_relaxed);

if ( c > 0 )
{
SetEvent(m_handle);
}
}

variant "banana" :

void unlock()
{
// release is for the mutex innards
m_state($).store(0,rl::mo_release);

// #StoreLoad is achieved as a #StoreStore on m_count :

int c = m_count($).fetch_add(0,mo_release); //(*3)

if ( c > 0 )
{
SetEvent(m_handle);
}
}



(variant "apple" only works if the double check on m_state in lock is acq_rel).



(variant "banana" appears to work if *3 is only mo_relaxed, which is a bit of a mystery! We'll leave that as
an exercise for the reader). (update : it doesn't actually work, see comments).

Thursday, May 31, 2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.



(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we
will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU
memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers
will change in the future to be more aggressive about allowing relaxed memory ordering).



The C++0x fence is not visible to other threads unless they specifically schedule against the fence.
That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of
a real CPU fence.



A very common example in old-school code is to use a CPU barrier for publication. Something like this :



Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
then reading stuff should see set up



this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :



atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

// set up stuff :
x($).store(1,mo_relaxed);
// barrier :
std::atomic_thread_fence(mo_seq_cst,$);
// publish :
y($).store(1,mo_relaxed);

thread 1 :

// wait for publication :
rl::backoff bo;
while ( y($).load(mo_relaxed) == 0 )
bo.yield($);

// read it :
int xx = x($).load(mo_relaxed);
RL_ASSERT( xx == 1 );



this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the
fence, for example this would work :



thread 1 :

// wait for publication :
rl::backoff bo;
while ( y($).load(mo_relaxed) == 0 )
bo.yield($);

// added -
std::atomic_thread_fence(mo_acquire,$);

// read it :
int xx = x($).load(mo_relaxed);
RL_ASSERT( xx == 1 );



Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory
and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's
which creates a sync point for all of them (not exactly true, but effectively true).
When thread 1 sees that y is non-zero, then the store to y on thread 0
must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the
barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the
originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way
of connecting a "happens before/after" to it. In this case we can say :



thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x





(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and
load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization
at all, because causality and load-consume takes care of it for you).



Okay. We can see the exact same thing in a more complex example.
This is
Dmitriy's example
Peterson Mutex
. Here's my test harness :





struct test7 : rl::test_suite<test7, 2>
{
// the mutex :
std::atomic<int> m_flag[2];
std::atomic<int> m_turn;
// something protected by the mutex :
std::atomic<int> m_data;

void before()
{
m_flag[0]($).store(0);
m_flag[1]($).store(0);
m_turn($).store(0);
m_data($).store(0);
}

void after()
{
int d = m_data($).load();
RL_ASSERT( d == 42 );
}

void lock(int tidx)
{
int other = tidx^1;

m_flag[tidx]($).store(1,std::mo_relaxed);
m_turn($).store(other,std::mo_release);

// ** Barrier here **

rl::backoff bo;
for(;;)
{
int f = m_flag[other]($).load(std::mo_acquire);
int t = m_turn($).load(std::mo_relaxed);
if ( f && t == other )
{
bo.yield($);
continue;
}
break;
}

}

void unlock(unsigned tidx)
{
m_flag[tidx]($).store(0,std::mo_release);
}

void thread(unsigned tidx)
{
lock(tidx);

// do m_data += 7; m_data *= 2;
int d = m_data($).load(std::mo_relaxed);
m_data($).store(d+7,std::mo_relaxed);
d = m_data($).load(std::mo_relaxed);
m_data($).store(d*2,std::mo_relaxed);

unlock(tidx);
}
};



This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like
a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't
actually provide mutual exclusion, eg. the code doesn't do what you expect it to).



The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that
are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a
seq_cst fence) goes like this :



thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0



(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a
normal CPU you could say :



load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1



therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the
later load of turn on thread 0 sees a 0).



So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object.
So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn
into an RMW :



void lock(int tidx)
{
int other = tidx^1;

m_flag[tidx]($).store(1,std::mo_relaxed);

m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW

rl::backoff bo;
for(;;)
{
int f = m_flag[other]($).load(std::mo_acquire);
int t = m_turn($).load(std::mo_relaxed);
if ( f && t == other )
{
bo.yield($);
continue;
}
break;
}

}



and that works (it's the same thing described here :
Peterson's lock with C++0x atomics - Just Software Solutions
among other places).



Some links on the topic :



Subtle difference between C++0x MM and other MMs

Subtle difference between C++0x MM and other MMs - Page 3

stdatomic_thread_fence - cppreference.com

Relacy finds fence bugs in spsc

Re questions about memory_order_seq_cst fence 2

Re questions about memory_order_seq_cst fence 1

Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in

C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2

C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1