Thursday, May 31, 2012

05-31-12 - On C++ Atomic Fences Part 2

Last post we talked about what C++0x fences are. Now we'll look at what they are not.



(NOTE : I am talking about what a C++0x fence is guaranteed to be by the standard; at the moment we
will not concern ourselves with the fact that at the moment a C++0x fence always actually issues a CPU
memory barrier which is somewhat stronger than what C++0x promises; I have no doubt that CPUs and compilers
will change in the future to be more aggressive about allowing relaxed memory ordering).



The C++0x fence is not visible to other threads unless they specifically schedule against the fence.
That maybe sounds obvious if you are thinking in terms of the C++0x definitions, but it's not true of
a real CPU fence.



A very common example in old-school code is to use a CPU barrier for publication. Something like this :



Set up stuff
Barrier
Publish stuff

other threads :

if stuff is published
then reading stuff should see set up



this does *not* work if "Barrier" is a C++0x fence. In particular we can construct this simple example :



atomic<int> x; // stuff
atomic<int> y; // publication flag

x & y initially 0

thread 0 :

// set up stuff :
x($).store(1,mo_relaxed);
// barrier :
std::atomic_thread_fence(mo_seq_cst,$);
// publish :
y($).store(1,mo_relaxed);

thread 1 :

// wait for publication :
rl::backoff bo;
while ( y($).load(mo_relaxed) == 0 )
bo.yield($);

// read it :
int xx = x($).load(mo_relaxed);
RL_ASSERT( xx == 1 );



this does not work (xx can be zero). To make it work in C++0x you need something that can synchronize with the
fence, for example this would work :



thread 1 :

// wait for publication :
rl::backoff bo;
while ( y($).load(mo_relaxed) == 0 )
bo.yield($);

// added -
std::atomic_thread_fence(mo_acquire,$);

// read it :
int xx = x($).load(mo_relaxed);
RL_ASSERT( xx == 1 );



Why does it work on real CPU's then? (assuming the "relaxed" loads are at least C++ volatile so they go to memory
and all that, but not otherwise ordered). On all current real CPU's a memory barrier sends a message to all other CPU's
which creates a sync point for all of them (not exactly true, but effectively true).
When thread 1 sees that y is non-zero, then the store to y on thread 0
must have happened, which means the barrier must have happened, so x must be set up, and our load of x occurs after the
barrier, so it must see the set up value. That is, the barrier forms a sync point on *all* threads, not just the
originator, and you don't necessarily need your own fence to tack into that sync, all you need to do is have a way
of connecting a "happens before/after" to it. In this case we can say :



thread 1 reads x after
thread 1 loads y with value == 1 after
thread 0 stores y with value == 1 after
thread 0 does a barrier after
thread 0 stores x





(aside : this is of course a terrible way to do publication in the modern world, you should just use a store-release and
load-acquire pair to do publish and consume; alternatively if you publish a pointer, then you don't need any synchronization
at all, because causality and load-consume takes care of it for you).



Okay. We can see the exact same thing in a more complex example.
This is
Dmitriy's example
Peterson Mutex
. Here's my test harness :





struct test7 : rl::test_suite<test7, 2>
{
// the mutex :
std::atomic<int> m_flag[2];
std::atomic<int> m_turn;
// something protected by the mutex :
std::atomic<int> m_data;

void before()
{
m_flag[0]($).store(0);
m_flag[1]($).store(0);
m_turn($).store(0);
m_data($).store(0);
}

void after()
{
int d = m_data($).load();
RL_ASSERT( d == 42 );
}

void lock(int tidx)
{
int other = tidx^1;

m_flag[tidx]($).store(1,std::mo_relaxed);
m_turn($).store(other,std::mo_release);

// ** Barrier here **

rl::backoff bo;
for(;;)
{
int f = m_flag[other]($).load(std::mo_acquire);
int t = m_turn($).load(std::mo_relaxed);
if ( f && t == other )
{
bo.yield($);
continue;
}
break;
}

}

void unlock(unsigned tidx)
{
m_flag[tidx]($).store(0,std::mo_release);
}

void thread(unsigned tidx)
{
lock(tidx);

// do m_data += 7; m_data *= 2;
int d = m_data($).load(std::mo_relaxed);
m_data($).store(d+7,std::mo_relaxed);
d = m_data($).load(std::mo_relaxed);
m_data($).store(d*2,std::mo_relaxed);

unlock(tidx);
}
};



This is an example where if "Barrier" is an actual CPU barrier, this code works. But if "Barrier" acts like
a C++0x seq_cst fence (and no more), then it doesn't work. (it doesn't work in the sense that it doesn't
actually provide mutual exclusion, eg. the code doesn't do what you expect it to).



The failure is the same kind of thing as the first trivial example; all current CPU's have ordering relations that
are stronger than C++0x. In particular the bad execution case that's possible in C++0x (when barrier is a
seq_cst fence) goes like this :



thread 0 : starts lock
thread 0 : flag[0] = 1
thread 0 : turn = 1

thread 1 : starts lock
thread 1 : flag[1] = 1
thread 1 : turn = 0 ; (overwrites the previous turn=1)
thread 1 : barrier
thread 1 : load flag[0] ; sees 0 - old value (*!)

thread 0 : barrier
thread 0 : load flag[1] ; sees 1
thread 0 : load turn ; sees 0



(*!) is the problem point. For the code to work we must load a 1 there (which thread 0 set already). On a
normal CPU you could say :



load flag[0] is after
barrier, which is after
store to turn = 0, which is after
store to turn = 1, which is after
store flag[0] = 1



therefore load flag[0] must see a 1. (I know the store to turn of 0 is after the store of 1 because the
later load of turn on thread 0 sees a 0).



So part of the problem is C++0x doesn't count stores in the linear "modification order" of an atomic object.
So the easy fix to ensure the "is after" relationship above actually happens is to change the store to turn
into an RMW :



void lock(int tidx)
{
int other = tidx^1;

m_flag[tidx]($).store(1,std::mo_relaxed);

m_turn($).exchange(other,std::mo_acq_rel); // changed to RMW

rl::backoff bo;
for(;;)
{
int f = m_flag[other]($).load(std::mo_acquire);
int t = m_turn($).load(std::mo_relaxed);
if ( f && t == other )
{
bo.yield($);
continue;
}
break;
}

}



and that works (it's the same thing described here :
Peterson's lock with C++0x atomics - Just Software Solutions
among other places).



Some links on the topic :



Subtle difference between C++0x MM and other MMs

Subtle difference between C++0x MM and other MMs - Page 3

stdatomic_thread_fence - cppreference.com

Relacy finds fence bugs in spsc

Re questions about memory_order_seq_cst fence 2

Re questions about memory_order_seq_cst fence 1

Implementing Dekker's algorithm with Fences Just Software Solutions - Custom Software Development and Website Development in

C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 2

C++0x sequentially consistent atomic operations - comp.programming.threads Google Groups 1