Sunday, January 25, 2009

01-25-09 - Low Level Threading Junk Part 2



Okay, we're starting to learn some primitive threading junk. So let's try to use it do something useful.



Non-trivial statics in C++ are hideously non-thread-safe. That should be extremely obvious but if you don't see it, read here :
The Old New Thing C++ scoped static initialization is not thread-safe, on purpose! .
Obviously plain C static initialization of integral types is fine (BTW floats are NOT fine - beware using const floats in C++). Beware all things like this :





Object & Instance() { static Object s; return s; }

void func(int i)
{
static Object once;
once.DoStuff(i);
}





Not thread safe. For the most part, we should just not use them. It's very nice to do minimal work in CINIT. Much bruhaha has been made about the "thread safe singleton"
(see all these) :



opbarnes.com Code Snippet Released - Thread-safe Singleton (for Windows platforms)

Dr. Dobb's Singleton Creation the Thread-safe Way October 1, 1999 - BROKEN

Dr. Dobb's C++ and The Perils of Double-Checked Locking Part I July 1, 2004

Double-checked locking - Wikipedia, the free encyclopedia




For my money this is a little silly because there's a really damn trivial solution to this. The main point of the old C++ initialize-on-demand Singleton pattern (like Instance() above) is to make it
work during CINIT with complex initialization order dependencies. Now, CINIT we know is run entirely on one thread before any threads are made. (you are fucking nuts if you're making threads during
CINIT). That means we know any Singletons that get made during CINIT don't have to worry about threading. That means you can use a trivial singleton pattern, and just make sure it gets used in
CINIT :



static Object * s_instance = NULL; // okay cuz its a basic type

Object * GetSingleton()
{
if ( ! s_instance )
{
s_instance = new Object;
}
return s_instance;
}

static volatile Object * s_forceUse = GetSingleton();



The forceUse thing just makes sure our GetSingleton is called during cinit, so that we can be sure that we're all made by the time threads come around. The GetSingleton() that we have here is NOT thread
safe, but it doesn't need to be ! (btw I assume that "CreateThread" acts as a MemoryBarrier (??))



Okay, that's nice and easy. What if you want a Singleton that doesn't usually get made at all, so you don't want to just make it during CINIT ? (you actually want it to get made on use). Okay, now we
have to worry about thread-safing the construction.



The easiest way is you know your Singleton won't get used during CINIT. In that case you can just use Critical Section :





static CCriticalSection s_crit; // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * s_instance = NULL;

// broken version :

Object * GetSingleton()
{
if ( ! s_instance )
{
s_crit.Lock();

s_instance = new Object;

s_crit.Unlock();
}
return s_instance;
}



Okay, that was easy, but it's horribly broken. First of all, there's no gaurantee that Object is only made once. The thread can switch while the instruction pointer is between the first if() and the
critsec lock. If that happens, some other thread can get in and make s_instance, and then when we come back to execute we run through and make it again. (If you like, you could say we put the critsec in the right place - we could fix it by
moving the critsec out of the if). Even aside from that the line that assigns s_instance is all wrong because the pointer to s_instance is not necessarilly being written atomically, and it might be written before
the stuff inside Object is written. What did we learn in Part 1?



// good version :

static CCriticalSection s_crit; // CRITICAL_SECTION wrapper class will construct during CINIT
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance ) // must be memory_order_consume
{
s_crit.Lock();

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

s_crit.Unlock();
}
return s_instance;
}



This is a "double-checked lock" that works. The purpose of the double-check is to avoid taking the critsec when we can avoid doing so. If instance is NULL we take the critsec, then we have to check again
to make sure its still null, then we rock on and make sure that the Object's memory is flushed before s_instance is set, by using a "Release" memory barrier. Also using Interlocked ensures the pointer is
written atomically.



ADDENDUM : I need to add some more notes about this. See comments for now.



Okay, that's all good, but it doesn't work if you now ever try to use this Singleton from CINIT. The problem is that s_crit might not be constructed yet. There's a pretty easy solution to that - just check
if s_crit has been initialized, and if it hasn't, then don't use it. That works because we know CINIT is single threaded, so you can do something like :



class Object { } ;
static CRITICAL_SECTION s_crit = { 0 };
AT_STARTUP( InitializeCriticalSection(&s_crit); );
static Object * s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance )
{
if ( s_crit.DebugInfo == 0 ) // DebugInfo will always be nonzero for an initialized critsec
{
// must be in CINIT !
s_instance = new Object;
}
else
{
EnterCriticalSection(&s_crit);

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

LeaveCriticalSection(&s_crit);
}
}
return s_instance;
}



This actually works pretty dandy. Note that in CINIT you might take the upper or lower paths - you have no way of knowing if GetSingleton() will be called before or after the critsec is initialized. But that's fine,
it works either way, by design. Note that we are crucially relying here on the fact that all non-trivial CINIT work is done after all the simple-type zeroing.



Okay, so that's all fine, but this last thing was pretty ugly. Wouldn't it be nice to have a critical section type of mutex object that can be statically initialized so that we don't have to worry about CINIT
mumbo jumo ?



Well, yes it would, and it's pretty trivial to make a standard simple Spin Lock thing :



typedef LONG SpinMutex; // initialize me with = 0

void SpinLock(SpinMutex volatile * mut)
{
// 0 is unlocked, 1 is locked
COMPILER_ASSERT( sizeof(SpinMutex) == sizeof(LONG) );
int spins = 0;
while( InterlockedCompareExchange((LONG *)mut,1,0) == 1 ) // Acquire
{
++spins;
if ( spins > CB_SPIN_COUNT )
{
Sleep(0);
}
}
// *mut should now be 1
}

void SpinUnlock(SpinMutex volatile * mut)
{
// *mut should now be 1
InterlockedDecrement((LONG *)mut); // Release
// *mut should now be 0
// if there was a higher priority thread stalled on us, wake it up !
}



Note that SpinLock will deadlock you if you try to recurse. Okay, now because this thing is static initializable we can use it to make a Singleton and be safe for CINIT :




static SpinMutex s_mutex = 0;
static Object * volatile s_instance = NULL;

Object * GetSingleton()
{
if ( ! s_instance )
{
SpinLock(&s_mutex);

if ( ! s_instance )
{
Object * pNew = new Object;

InterlockedExchangeRelease( &s_instance , pNew );
}

SpinUnlock(&s_mutex);
}
return s_instance;
}



And that works too.



Now, people get tempted to use these simple SpinLocks all over. *DON'T DO IT*. It looks like it should be faster than CRITICAL_SECTION
but in fact it's way way slower in bad cases.



First of all, what is a CRITICAL_SECTION ? See :
Matt Pietrek - Critical Sections Under Windows .




It starts out as a simple spin lock like the above. It does the same kind of thing to busy-spin the processor first. But then it doesn't just Sleep. This kind of
spin-and-sleep is a really really bad thing to do in heavy threading scenarios. If you have lots of threads contending over one resource, especially with very different
execution patterns the spin-and-sleep can essentially serialize them (or worse). They can get stuck in loops and fail to make any progress.



Once CRITICAL_SECTION sees contention, it creates a kernel Event to wait on, and puts the thread into an altertable wait. The Windows scheduler has lots of good mojo for
dealing with events and threads in altertable waits - it's the best way to do threading sleeping and wake ups generally. For example, it has mojo to deal with the bad
cases of heavy contention by doing things like randomly choosing one of the threads that is waiting on a critsec to wake up, rather than just waking up the next one
(this prevents a few runaway threads from killing the system).



One important note : you should almost alwayse use "InitializeCriticalSectionAndSpinCount" these days to give your crit secs a spinner. That's because
more about this in a moment.



About performance : (from one of the MSDN articles) :



* MemoryBarrier was measured as taking 20-90 cycles.
* InterlockedIncrement was measured as taking 36-90 cycles.
* Acquiring or releasing a critical section was measured as taking 40-100 cycles.



which clearly shows that the idea that a critical section is "too slow" is nonsense. I've said this already but let me emphasize :



Most of the time (no contention) a Crit Sec *is* just an InterlockedIncrement
(so how could you beat it by using InterlockedIncrement instead?)

When there is contention and the Crit Sec does something more serious (use an Event) it's slow
but that is exactly what you want !!



In fact, the big win from "lock free" is not that InterlockedIncrement or whatever is so much faster - it's that you can sometimes
do just one interlocked op, unlike crit sec which requires two (one for Enter and one for Leave).



An interesting alternative is this QLock thing by Vladimir Kliatchko :

Dr. Dobb's Developing Lightweight, Statically Initializable C++ Mutexes May 1, 2007 (QLock) ; it's quite a
bit like the MS critsec, in that he uses a simple interlocked op to check for contention, and if it's not free then he makes an Event and goes into a wait state
and all that. The code is there on Dr. Dobbs but you have to do the thing where you hit the "Download" in the top bar and then navigate around to try to find it.
it's here in MUTEX.LST .




Ok, now a bit more InitializeCriticalSectionAndSpinCount. The reason you want a Spin is because basically every new machine these days
is "multiprocessor" (multicore). That's a big difference from threading on a single core.



If you're threading on a single core - memory will never change underneath you while you are executing. You can get swapped out and
memory can change and you can get swapped in, so it looks like memory changed underneath you, but it doesn't happen in real time.
With multiple cores memory can be touched by some other core that is running along at full speed.



Often this doesn't affect the way you write threading code, but it does affect performance issues. It means you can have contention on a
way finer scale. In a normal OS single proc environment, you get large time slices so other threads aren't frequently randomly poking at
you. With real multicore the other guy can be poking at you all the time. A good range is to spin something like 1000-5000 times.
1000 times is a microsecond



What the spin count does is let you stay in the same thread and avoid an OS thread switch if some other processor is holding the lock.
Note that if it's a thread on the same processor - spinning is totally pointless (in fact it's harmful).

01-25-09 - Low Level Threading Junk Part 1



Okay, this is definitely one of those posts where I'm no expert by a long shot so I'll probably write some things that are wrong and
you should correct me. By "low level" I mean directly accessing shared variables and worrying about what's going on
with the memory, as opposed to just using the language/OS constructs for safe sharing.



Let me say up front that writing lots of low-level thread-sharing code is a very very bad idea and should not be in 99% of
the cases. Just use CriticalSection and once in a while use Interlocked and don't worry about the tiny inefficiency; if you
do things right they won't be a speed hit. Trying to get things right in lots of low level threading code is a recipe for
huge bugs.



I'm going to assume you know about race conditions and basic threading primitives and such. If you're hazy, this is a pretty good introduction :
Concurrency What Every Dev Must Know About Multithreaded Apps .
I'm also going to assume that you know how to do simple safe thread interaction stuff using Interlocked, but maybe you don't know exactly
what's going on with that.



First of all, let me try to list the various issues we're dealing with :



1. Multiple threads running on one core (shared cache). This means you can get swapped in and out; this is actually the *easy* part but
it's what most people think of as threading.



2. Multiple cores/processors running multiple threads, possibly with incoherent caches. We have #1 to deal with, but also now the memory views
of various threads can be different. The memories sync up eventually, but it's almost like they're talking through a delated communication channel.



3. CPU OOP instruction reorder buffers and cache gather/reorder buffers. Instructions (even in ASM) may not execute in the order you wrote, and even
if they do exectue in that order, memory reads & writes may not happen in the order of the instructions (because of cache line straddle issues, write
gather buffers, etc.)



4. Single CISC CPU instructions being broken into pieces (such as unaligned accesses). Single ASM instructions may not be single operations; things like
"inc" become "load, add, store" and there's an opportunity for a thread to interleave in there. Even apparently atomic ops like just a "load" can become
multiple instructions if the load is unaligned, and that creates a chance for another thread to poke in the gap.



5. The compiler/optimizer reordering operations. Obviously things don't necessarily happen in the order that you write them in your C program.



6. The compiler/optimizer caching values or eliminating ops



I think that's the meat of it. One thing that sort of mucks this all up is that x86 and MSVC >= 2005 are sort of special cases which are much simpler
than most other compilers & platforms. Unfortunately most devs and authors are working with x86 and MSVC 2005+ which means they do lazy/incorrect things
that happen to work in that case. Also I'm going to be talking about C++ but there are actually much better memory model controls now in Java and C#.
I'm going to try to describe things that are safe in general, not just safe on x86/Windows, but I will use the x86/Windows functions as an example.



Almost every single page I've read on this stuff get its wrong. Even by the experts. I always see stuff like
this .
Where they implement a lock-free fancy doohicky, and then come back later and admit that oh, it doesn't actually work. For example
this Julian Bucknall guy has a lot of long articles about lock free stuff,
and then every 3rd article he comes back and goes "oh BTW the last article was wrong, oops". BTW never try to use any of the lock free stuff from
a place like "codeproject.com" or anybody's random blog.



I've read a lot of stuff like :




Unfortunately, Matt's answer features what's called double-checked locking which isn't supported by the C/C++ memory model.




To my mind, that's a really silly thing to say. Basically C/C++ doesn't *have* a memory model. Modern C# and Java *do* which means that the language
natively supports the ideas of memory access ordering and such. With C/C++ you basically have zero gaurantees of anything. That means you have to do
everything manually. But when you do things manually you can of course do whatever you want. For example "double checked locking" is just fine in C,
but you have to manually control your memory model in the right way. (I really don't even like the term "memory model" that much; it's really an "execution
model" because it includes things like the concept of what's atomic and what can be reordered).



Some things I'm going to talk about : how lock free is really like spin locking, how critical sections work, why you should spincount critical
sections now, what kind of threading stuff you can do without critsecs, etc.



Something I am NOT going to talk about is the exact details of the memory model on x86/windows, because I don't think you should be writing code for a
specific processor memory model. Try to write code that will always work. x86/windows has strong constraints (stores are not reordered past stores, etc. etc.
but forget you know that and don't rely on it).




Let's look at a simple example and see if we can get it right.



Thread A is trying to pass some work to thread B. Thread B sits and waits for the work then does it. Our stupid example looks like :



// global variables :
MyStruct * g_work = NULL;


Thread A :

int main(int argc, char ** argv)
{
g_work = new MyStruct( argc, argv );

....
}


void ThreadB()
{

while( g_work == NULL )
{
}

g_work->DoStuff();
}



Okay, now let's go through it and fix it.



First of all, this line should give you the heebee jeebees :


g_work = new MyStruct( argc, argv );


It's doing a bunch of work and assigning to a shared variable. There are no gaurantees about what order that gets written to memory, so g_work could be assigned
before the struct is set up, then ThreadB could start poking into it while I'm still constructing it. We want to release a full object to g_work that we're all
done with. We can start trying to fix it by doing :


1. MyStruct * temp = new MyStruct( argc, argv );
2. g_work = temp;


that's good, but again you cannot assume anything about the memory model in C or the order of operations. In particular, we need to make sure that the
writes to memory done by line 1 are actually finished before line 2 executes.



In Windows we do that by calling
MemoryBarrier() :


MyStruct * temp = new MyStruct( argc, argv );
MemoryBarrier();
g_work = temp;


MemoryBarrier is an intrinsic in MSVC ; it actually does two things. 1. It emits an instruction that causes the processor to force a sync point (this also actually
does two things : 1A : flushes caches and write gather buffers and 1B. puts a fence in the reorder buffer so the processor can't speculate ahead). 2. the MemoryBarrier
instrinsic also acts as a compiler optimizer barrier - so that the MSVC compiler won't move work before MemoryBarrier ahead of MemoryBarrier.



MemoryBarrier is a full memory fence, it creates an ordering point. In general if you just write memory operations :


A
B
C


you can't say what order they actually happen in. If another thread is watching that spot it might see C,A,B or whatever. With MemoryBarrier :


A
B
MemoryBarrier
C


You get an order constraint : C is always after {AB} , so it might be ABC or BAC.



Another digression about the compiler optimizer fence : in windows you can also control just the compiler optimization with
_ReadWriteBarrier (and _ReadBarrier and _WriteBarrier). This doesn't generate a memory
fence to the processor, it's just a command to the optimizer to not move memory reads or writes across a specific line. I haven't seen a case where I would actually
want to use this without also generating a memory barrier (??). Another thing I'm not sure about - it seems that if you manually output a fence instruction with __asm,
the compiler automatically treats that as a ReadWriteBarrier (??).



Alright, so we're getting close, we've made a work struct and forced it to flush out before becoming visible to the other thread :


MyStruct * temp = new MyStruct( argc, argv );
MemoryBarrier();
g_work = temp; <-


What about this last line? It looks inocuous, but it holds many traps. Assignment is atomic - but only if g_work is a 32 bit pointer on 32-bit x86, or a 64 bit pointer on
64-bit x86. Also since g_work is just a variable it could get optimized out or deferred or just stored in local cache and not flushed out to the bus, etc.



One thing we can do is use "volatile". I hesitate to even talk about volatile because it's not well defined by C and it means different things depending on platform
and compiler. (In particular, MS has added lots of threading usefulness to volatile, but nobody else does what they do, so don't use it!). What we want "volatile" for
here is to force the compiler to actually generate a memory store for g_work. To my mind I like to think that volatile means "don't put me in registers - always read or
write memory". (again on x86 volatile means extra things, for example volatile memory accesses won't get reordered, but don't rely on that!).
Note you might also have to make sure g_work is aligned unless you are sure the compiler is doing that.



One thing to be careful with about volatile is how you put it on a pointer. Remember to read pointer adjective from right to left in C:


volatile char * var;

// this is a non-volatile pointer to a volatile char !! probably not what you meant

char * volatile var;

// this is a volatile pointer to chars - probably what you meant
// (though of course you have to make sure the char memory accesses are synchronized right)




Note that volatile is a pretty big performance hit. I actually think most of the time you should just not use "volatile" at all, because it's too variable
in its meaning, and instead you should manually specify the operations that need to be sync'ed :



On Windows there's a clean way to do this :



MyStruct * temp = new MyStruct( argc, argv );
InterlockedExchangePointer( &g_work, temp );



The Interlocked functions are guaranteed to be atomic. Now we don't have to just hope that the code we wrote actually translated into an atomic op.
The Interlocked functions also automatically generate memory barriers and optimizer barriers
(!! NOTE : only true on Windows, NOT true on Xenon !!). Thus the InterlockedExchangePointer forces MyStruct to get
written to temp first.



Let me just briefly mention that this full MemoryBarrier is *very* expensive and you can get away with less. In particular, something you will see
is "Acquire" and "Release". The heuristic rule of thumb is that you use "Acquire" to read shared conditions and "Release" to write them. More
formally, "Acquire" is a starting memory barrier - it means any memory op done after the Acquire will not move before it (but ones done before can move
after). "Release" is a finishing memory barrier - memory ops done before it will not move after (but ones done after can be done before).



So if you have :


A
B
C - Release
D


The actual order could be {A B C D} or {B A C D} or {A B D C} but never {A C B D}. Obviously Acquire and Release are a slightly weaker constraint than a
full barrier so they give the processor more room to wiggle, which is good for it.



So lets do another example of this simple thread passing (stolen from comp.programming.threads) :



int a;
int b;
LONG valid = 0;
CRITICAL_SECTION mtx;

void function1 () { // called from many threads

EnterCriticalSection(&mtx);
if (! valid) {
a = something;
b = something;
InterlockedExchangeAddRelease(&valid, 1);
// writes to memory 'a' and 'b' will be done before valid is set to 1
}
do_other_stuff();
LeaveCriticalSection(&mtx);

}

int function2 () { // called from many threads

if (InterlockedExchangeAddAcquire(&valid, 0)) {
return a + b;
} else {
return 0;
}

}

int function3 () { // broken

if ( valid ) {
return a + b;
} else {
return 0;
}

}




Now it's easy to fall into a trap of thinking that because we did the "Release" that function3 is okay. I mean, by the time function3 sees valid get
set to 1, 'a' and 'b' will already be set, so function 3 is right, okay? Well, sort of. That would be true *if* function 3 was in assembly so the
compiler couldn't reorder anything, and if the chip couldn't reorder memory ops (or if we rely on the x86 constraint of read ordering). You should
know the actual execution of function 3 could go something like :



fetch a
fetch b
add b to a
test valid
set conditional a to 0
return a



which is now reading a and b before 'valid'. Acquire stops this.



Some good links on this basic memory barrier stuff : (read Kang Su in particular)



Kang Su's Blog volatile, acquirerelease, memory fences, and VC2005

Memory barrier - Wikipedia, the free encyclopedia

Acquire and Release Semantics

The Old New Thing Acquire and release sound like bass fishing terms, but they also apply to memory models

Memory ordering, barriers, acquire and release semantics niallryan.com

Memory barrier + volatile (C++) - comp.programming.threads Google Groups




BTW note that volatile does some magic goodness in VC >= 2005 , but *not* on Xenon even with that compiler version.
In summary :



ReadBarrier/WriteBarrier - just prevents compiler reordering

MemoryBarrier() - CPU memory load/store barrier

Interlocked & volatile
Interlocked functions on Windows automatically generate a compiler & a CPU barrier
Interlocked on Xbox 360 does not generate a CPU barrier - you need to do it

special volatile thing in MSVC ?
volatile automatically does the right thing on VC >= 2005
not for XBox though





ADDENDUM : lol the new Dr. Dobbs has a good article by Herb called
volatile vs volatile . It covers a lot of this same territory.



And some more links that are basically on the same topic : (the thing with the valid flag we've done here is called the "publication safety pattern")



Bartosz Milewski - Publication Safety on Multicore

Hans Boehm - Why atomics have integrated ordering constraints

Bartosz Milewski - C Atomics and Memory Ordering

Saturday, January 24, 2009

01-24-09 - linkies



Complexification Gallery - wow really gorgeous images. All made algorithmically with
Processing, and most driven by semi-physical-mathematical models. Lots of applets to play with. This is seriously fucking amazing and inspiring.



Killzone 2 making of video
is pretty awesome.



Mischief and Ivo Beltchev
have some crazy debugger database plugin thing for the string-CRC model. IMO this is fucking nuts. But I do love me
some autoexp.dat



We were talking about this the other day and I realized I've forgotten what "Koenig lookup" is and why you need it.
Basically it just means that functions are looked up in the namespace of their arguments. So :



namespace my_ns
{
class Vec3 { ... };

void func( Vec3 & x )
{
...
}

};

int main()
{
my_ns::Vec3 v;
func(v);
}



works, even though it's calling a func() that's not in the global namespace.



If you think about it a second this is nice, because it means that non-member functions on a class act like they are in
the same namespace as that class. This is pretty crucial for non-member operators; it would be syntactically horrific to
have to call the operators in the right namespace. But it's nice for other stuff too, it means you don't need to jam everything into
member functions in order to make it possible to hand a class out to another namespace.





namespace my_ns
{
class Vec3 { ... };

bool operator == (const Vec3 &a , const Vec3 & b) { ... }
};


int main()
{
my_ns::Vec3 a,b;

if ( a == b ) // ***
{
...
}
}


At the line marked *** we're calling my_ns::operator == (Vec3,Vec3) - but how did we get to call a function in my_ns when
we're not in that namespace? Koenig lookup.



Now, this really becomes crucial when you start doing generic programming and using templates and namespaces. The reason is
your containers in the STL are in std:: namespace. You are passing in objects that are in your namespace. Obviously
the STL containers and algorithms need to get access to the operations in your namespace. The only way they can do that
is Koenig lookup - they use the namespace of the type they are operating on. For example to use std::sort and make use of your " operator < "
it needs to get to your namespace.



See Herb's good articles :



What's In a Class - The Interface Principle

Namespaces and the Interface Principle

Argument dependent name lookup - Wikipedia, the free encyclopedia




Not exactly related but other good C++ stuff :
Koenig has a Blog .



And if you want to get really befuddled about C++ name lookup you can mix this in :
Why Not Specialize Function Templates .



Finally, I've had "dos here" for a long time, but of course it really should be "tcc here". Just follow
these registry instructions
but for the "command" put :



"C:\Program Files\JPSoft\TCCLE9\tcc.exe" /IS cdd %1



Wednesday, January 21, 2009

01-21-09 - Towards Lock-Free Threading



Currently Oodle threads communicate through plain old mutex locking, but I'd like to go towards lock-free communication eventually.
Now, lock-free coding doesn't have the normal deadlock problems, but it has a whole new host of much scarier and harder to debug
thread timing problems, because you don't have the simplicity of mutex blocking out concurrency during shared data access.



It occurs to me there's a pretty simple way to make the transition and sort of have a middle ground.



Start by writing your threading using plain old mutexes and a locked "communication region". The communication area can only be accessed
while the mutex is held. This is just the standard easy old way :



{Thread A} <---> {Communication Region} <---> {Thread B}

Thread A - lock mutex on CR
change values ;
unlock
--- Thread B lock mutex
read values
unlock



Now find yourself a lockfree stack (aka singly linked list LIFO). The good old "SList" in Win32 is one fine choice. Now basically
pretend that Thread A and Thread B are like over a network, and send messages to each other via the lock-free stacks.



To keep the code the same, they both get copies of the Communication Region :



{Thread A | Communication Region} <---> {Communication Region | Thread B}

and they send messages via two stacks :

{Thread A | Communication Region} --- stack AtoB --> {Communication Region | Thread B}
{Thread A | Communication Region} <-- stack BtoA --- {Communication Region | Thread B}



The messages you send across the stacks apply the deltas to the communication regions to keep them in sync, just like networked game.



So for example, if Thread A is the "main thread" and Thread B is a "job doer thread" , a common case might be that the
communication region is a list of pending jobs and a list of completed jobs.



Main Thread
{Pending Jobs}
{Completed Jobs}

Request Job :
add to my {Pending Jobs}
push message on stackAtoB


Worker Thread
{Pending Jobs}
{Completed Jobs}

Sit and pop jobs off stackAtoB
put in pending jobs list
work on pending jobs
put result on completed jobs list
push completion message on stackBtoA



The nice thing is that Main Thread can at any time poke around in his own Pending and Completed list to see if various jobs
are still pending or done yet awaiting examination.



Obviously if you were architecting for lock-free from the start you wouldn't do things exactly like this, but I like the ability to
start with a simple old mutex-based system and debug it and make sure everything is solid before I start fucking around with lock-free.
This way 99% of the code is identical, but it still just talks to a "Communication Region".




ADDENDUM :



I should note that this is really neither new nor interesting. This is basically what every SPU programmer does. SPU "threads" get a
copy of a little piece of data to work on, they do work on their own copy, and then they send it back to the main thread. They don't
page pieces back to main memory as they go.



While the SPU thread is working, the main thread can either not look at the "communication region", or it can look at it but know that it
might be getting old data. For many applications that's fine. For example, if the SPU is running the animation system and you want the
main thread to query some bone positions to detect if your punch hit somebody - you can just go ahead and grab the bone positions without
locking, and you might get new ones or you might get last frame's and who cares. (a better example is purely visual things like particle
systems)



Now I should also note that "lock free" is a bit of a false grail. The performance difference of locks vs. no locks is very small. That
is, whether you use "CriticalSection" or "InterlockedExchange" is not a big difference. The big difference comes from the communication
model. Having lots of threads contending over one resource is slow, whether that resource is "lock free" or locked. Obviously holding locks
for a long time is bad, but you can implement a "lock free" model using locks and its plenty fast.



That is, this kind of model :



[ xxxxxxxxxx giant shared data xxxxxxxxxx ]

| | | |
| | | |
| | | |

[thread A] [thread B] [thread C] [thread D]



is slow regardless of whether you use locks or not. And this kind of model :



[data A ] [data B ] [data C ] [data D ]

| / | / | / |
| / | / | / |
| / | / | / |

[thread A] [thread B] [thread C] [thread D]



is fast regardless of whether you use locks or not. Okay I've probably made this way more wrong and confusing now.




ADDENDUM #2 :



Let me try to express it another way. The "message passing" model that I described is basically a way of doing a large atomic memory write.
The message that you pass can contain various fields, and it is processed synchronously by the receiver. That makes common unsafe lock-free
methods safe. Let me try to make this clear with an example :



You want Thread B to do some work and set a flag when it's done. Thread A is waiting to see that flag get set and then will process the work.
So you have a communication region like :



// globals :
bool isDone;
int workParams[32];
int workResults[32];



Now a lot of people try to do lock-free work passing trivially by going :



Thread A :

isDone = false;
// set up workParams
// tell Thread B to start

... my stuff ...

if ( isDone )
{
// use workResults
}

Thread B :

// read workParams

// fill out workResults

isDone = true;



Now it is possible to make code like this work, but it's processor & compiler dependent and can be very tricky and causes bugs.
(I think I'll write some about this in a new post, see later). (the problem is that the reads & writes of isDone and the params and
results don't all happen together and in-order). Instead we can just pass the object :



struct ThreadMessage
{
LinkNode
bool isDone;
int workParams[32];
int workResults[32];
}

Thread A :
ThreadMessage myLocalCopy; // in TLS or stack
// fill out myLocalCopy
// push myLocalCopy to thread message queue to thread B

... my stuff ...

if ( pop my message queue )
{
myLocalCopy = copy from queue
// use myLocalCopy.workResults
}

Thread B :
ThreadMessage myLocalCopy; // in TLS or stack
myLocalCopy = pop message queue

// read myLocalCopy.workParams

// fill out myLocalCopy.workResults

push myLocalCopy to queue for thread S



Okay. Basically we have taken the separate variables and linked them together, so that as far as our thread is concerned they get written
and read in one big chunk. That is, we move the shared data from one large consistent state to another.

Tuesday, January 20, 2009

01-20-09 - Laptops Part 3



A little more reportage from the laptop front. Holy christ looking for laptops sucks so bad. They're basically
all the same inside, but of course people can't just name them clearly based on what they're like. You have to
look into every single fucking model in great detail to see what's actually going on with it. Some general
thoughts :



802.11n (aka "Wireless N") seems cool; I wasn't really aware of that. The Intel Wifi 5100 and 5300 seem like the way to go. A lot
of newer laptops don't have 802.11n which seems pretty dumb.



As much as I hate it, Vista-64 seems like the way to go for the future. All the rotten programmers out there are gonna bloat things up
to all hell in the next few years, and I imagine you'll need 8 GB of RAM to run anything in 2010. Hell photoshop
already wants 2GB+ for itself. A 32 bit OS locks you into 3 GB which is gonna suck in 5 years.



XP can definitely be found for laptops, but it does greatly reduce your choices. The Dell business line offers
"XP downgrades" on most laptops. Lenovo Thinkpads and Toshiba Satellites also have some XP options. Oddly it
seems that all of the Mini (10" or less) lappies are on XP.



LED backlighting is brighter, lighter weight, thinner, and uses less power (than the older CCFL backlighting).
That all sounds good, *however* I have
seen a lot of reports of problems with LED screens. Unfortunately it seems to be one of those things where they
vary by screen manufacturer (LG or Samsung or whoever), and every laptop brand sources their screens from various
manufacturers, so you never know what you're going to get. The problems I've seen reported are edge bleeding,
flickering, brightness various, and "chunky" appearance. I think it's worth the gamble going with LED now, but
you might have to fight your manufacturer for RMA if you get a bad one.



SSD's are clearly awesome for laptops but don't seem quite ready for prime time. Again as with the LED screens,
not all SSD's are created equal, and it's hard to tell what brand you're going to get when you buy a laptop
(mainly some of the off brands have very very poor write performance).
Also, there appear to still be major problems with the bios/driver interaction with SSD's - I've widely seen
reports that they completely stall the CPU during heavy write activity, indicating a failure to do asynchronous
writes. Oddly the SSD upgrade options for laptops seem uniformly overpriced (just buy your own and stick it in
yourself) - but there are some laptops with SSD's included by default that are moderately priced.



As for graphics - I've seen a lot of reports of heat problems with NV 8000 series of mobile chips (See previous
laptop post for links). Apparently the 9000
series is better. The ATI chips seem to have less of a problem with heat, however there are

many reports
of problems
with ATI mobile drivers under Vista. Yay. The Intel integrated chips are the best choice if all you care about is
battery life.



The NV9400 is an integrated part (integrated with the south bridge / memory controller all that mForce whatever).
It's supposed to be pretty good for power and is a good DX10.1 part. It's what's in the MacBook Pro for example.
BTW
it's
a real LOL
to see the tech writers speculate about Apple shunting OS work off to CUDA. Buy the marketing nonsense
much? BTW also lots of laptop specs lie about the "graphics memory" with NV 9 chips. NV 9 is a shared-memory model
chip because it's built into the memory controller, kind of like the XBoxes, so the "1 Gig" of graphics memory
they're claiming just means that the NV 9 is using 1 G of your main memory.



Another interesting new thing in graphics is the
Thinkpad T400 Switchable Graphics .
The T400 has an ATI HD3470 for games and an Intel integrated X4500 for battery life. Under Vista apparently you can
switch with just a mouse click - no reboot. Under XP you have to reboot. This is kind of cool, but it's also scary as
hell because apparently it uses a hacked-together FrankenDriver. Personally I think just going with the NV9400 is a
better choice. (also, the MacBook Pro has a 9400 -> 9600 switch option; that's weird because the 9400 is good enough IMO.
anyway apparently you have to log out and log back in to switch which is kind of lame; it's not quite a reboot, but it's
not a mouse click either).



There are a lot of shitty screens out there. I ranted before about the prevalence of 1280x800 15" screens. Oddly,
there are also 1920x1200 15" screens now, which is equally insane in the opposite direction. (I assume those laptops
come with a complimentary magnifying glass). Many laptop manufacturers now seem determined to fuck up the screens
for no reason but taking glossy screens, and then PUTTING AN EXTRA GLOSSY FILM ON ALREADY GLOSSY SCREENS. Apple for
example does this with the new Macbooks. The manufacutrers usually call them "edgeless" or "infinity" screens or
something like that, but they should call them "nerfed" or "bricked" screens because they make your laptop useless
outside of pitch dark.
BTW it's super LOL that the
MacBook website shows the pictures of the screen with
giant reflections across it; the web site design people must have thought it was cool looking and intentional so they're
trying to show off the sweet glare. (the MBP is available with matte I gather, but the MB is not).
BTW also this is not a question of taste - yes, matte vs. glossy is a reasonable choice that could go either way,
but "glossy with extra gloss layer" is not a reasonable choice.



On the plus side, matte screens definitely can be found (or just plain glossy without the extra stupid layer if you
like glossy). With quite a lot of searching I found a bunch of 14" and 15"
laptops with matte WXGA+ (1440x900) screens. Something I could not find was non-widescreen (1400x1050). Everything is
widescreen now. (you can find 1680x1050 aka WSXGA+ in 15" if you want a bit more res)



Intel Core 2 Duo is the way to go, but there's a little trap. Some of them are 25W and some are 35W. Obviously if
you care about heat and battery you want the 25W. It seems the secret code here is a "P" in the name and ignore the
number, so P8400 or P9500 is good. Like everything else Intel has decided to jump into fucking retarded obfuscated
naming land. Oh yes, of course I want a Penryn on Montevina that's a 9500 (beware, some Penryns on Montevina are 35W).
Pretty good guide to Intel nomenclature .



The really thin & light laptops are crazy overpriced right now for no reason. It's the same fricking parts, in fact
often they use *cheaper* parts like older CPU's and lesser GPU's, but then they charge almost a $1000 markup for being
small. For example the Dell Latitude E4300 is around $1700 but a more capable and identical (but larger) Latitude E6400
is around $900. This applies obviously to stuff like the Air and the Envy, but also to pretty much every good Sony
Vaio. If you want the super light laptop you have to either shell out, or go off-brand, or wait.



There are countless things to verify that many lappies fail on : heat, noise, touchpad works decently, hinges okay,
screen latch (!), solid case, gigabit ethernet, eSATA.



Finally, www.notebookcheck.net was my savior in finding laptop information.
They actually have all the models gathered and a pretty clear simple list of specs with each one, so you can decode the
laptop model jungle. Notebookcheck reviews have actually noise volume measurements and exact temperature measurements.
Booya. They do it in Celcius. 30 C is okay. 35 C or more is uncomfortable.




Now the specific models :




Dell Latitude E6400
is one of my favorites. Available in 14.1" with WXGA+ (1440x900) matte LED.
It's a metal case. It's reasonable light
and small and has decent battery life. The specs are not top of the line but they're fine.
It also has no huge "style" markup price. Price range $800-1100. The E5400 is a good budget choice,
it's just a bit heavier and bigger and slightly downgraded components, but can be had for $600.
(these have no top-end graphics options, they're for non-gamers)



Thinkpad T400 is probably
what I'd buy for myself right now if I was buying something. 14.1" with WXGA+ (1440x900) matte LED.
Very good battery life. Good build quality and keyboard, though I have seen reports that they are flimsier
and the keyboard is worse than the older Thinkpads like the T61 line. T400 is around $1200 well equiped.
BTW the Thinkpad R400 and Ideapad Y430 are both very similar to the T400, but not really much cheaper, so just go with the T400.



Dell Studio 15
is a pretty solid middle of the road choice for a 15" lappy. 1440x900 LED , but sadly only glossy. It's
rather heavy, battery not great. I rather like the sloping keyboard design of these things, much more ergonomic than the
big brick style of the Thinkpads.



HP dv5t is a 15" with 1680x1050.
Saddly only glossy and offers the retarded "infinity display" which is awesome
if you like seeing reflections of the sky .
The HP dv4 and dv5 are the cheapest (major brand) laptops with high end GPUs and decent specs. Battery life
is notoriously poor for HP's. The use the dedicated NV chips not the integrated.



The Toshiba Satellites seem okay but have nothing in particular to recommend them over one of these.



The Acer Aspire line is pretty amazing if you want a sub-$500 laptop. The components are basically fine, the only
thing sucky about them is shitty plastic build quality. If you're not a style snob, they could be a great choice.



The Sony
Vaio Z is a very good ultraportable 13", but expensive ($1699). The cheaper Vaio's like the NR,CS,NZ are not
competitive in build quality or specs with the laptops above.



The new "unibody" MacBook (not pro) is actually a decent value; I mean, it's not a good value, but it's not awful.
Unfortunately it's only available in awful-gloss.



Lastly, laptop pricing is fucking awful. Because you're locked into the manufacturers, they do weird shit with sales and coupons.
The prices can fluctuate wildly from one day to the next. Chances are one of the laptops mentioned here is available with a close to 50%
discount right now. So if you are not locked into a particular model and can wait a bit, do so.

Monday, January 19, 2009

01-19-09 - swap, templates, and junk



I keep running into annoying problems.



One is having macro conflicts. I like to just have stuff like



#define MIN(a,b) ( (a) < (b) ? (a) : (b) )



for myself, which is fine and all, until you try to mix multiple codebases that all have their
own idea of what that macro should be.



(obviously MIN is not actually a big deal because everybody agrees on what that should be; a bigger
issue is like if people are doing a #define of malloc or SWAP or Log).



The cleanest thing I've come up with is :



1. All #defines in headers must be "namespaced" , like CB_MIN instead of just MIN.

2. Never #define anything that looks like it's in global namespace (eg. no #defining malloc)

3. Make a header called "use_cb.h" which does
#define MIN CB_MIN
#define malloc cb::malloc

etc.



Basically this is hacky way of "namespacing" the macros and then "using the namespace" in the CPP
files. Not awesome.




The other issue I'm running into is generic functions and template specialization.



Say you're working with some codebase, let's call it "stupidlib". They have their own ::swap
template :



template < typename Type > inline void stupidSwap(Type &a,Type &b)
{
Type c = a; a = b; b = c;
}



And then they also override it for some special cases :



template < >
inline void stupidSwap<stupidObject>(stupidObject &a,stupidObject &b)
{
a.Swap(b);
}



Now, you are trying to munge stupidlib into a codebase that just uses the normal STL. It's
putting objects in containers, and you want it to use std::swap correctly, but when you call
std::swap on a stupidObject - it doesn't see that a nice swap override has been made.



So far as I can't tell you cannot fix this. It is impossible to make std::swap redirect to
stupidSwap for the cases where stupidSwap has been overriden. You have to do it manually for
each object type, and any time somebody changes stupidlib your code can silently break.



Now there is a pretty easy solution to this : never define a generic operation which already exists
in the STL - just override the STL.



The problem is that people like to replace bits of the STL all the time. Hell I do it, because the STL
headers are so retardedly bloated they kill compile times, so I try to use them minimally.



But that "solution" isn't really a solution anyway - it's just a convention to use the STL defs as bases
for overloading when they exist. When you're trying to overload a generic operation that isn't defined
in the STL you go back to having the same problem - two codebases can define that operation and various
specializations, and you can't mix them.



Generic/template programming simply does not work well when mixing codebases.

Saturday, January 17, 2009

01-17-09 - Float to Int



A while ago the
Some Assembly Required blog
wrote some good notes about float-to-int. I posted some notes there but I thought I'd try to summarize my thoughts coherently.



What I'd like is a pretty fast float-to-int (ftoi) conversion. The most useful variants are "truncate" (like C, fractions go
towards zero), and "round" , that is, fractions go towards the nearest int. We'd like both to be available all the time,
and both to be fast. So I want ftoi_trunc and ftoi_round.



First let me say that I hate the FPU control word with a passion. I've had so many bugs because of that fucker over the years. I write some
code and test it and everything is fine, and then we actually put in some game and all hell breaks loose. WTF happened?
Oh well, I tested it with the default word setup, and now it's running with the FPU set to single precision. The other
classic bug is people changing the rounding mode. D3D used to be really bad about this (you could use FPU_PRESERVE but it
was a pretty big hit back in the old days with software T&L, not a big deal any more). Or even worse is people who write code intentionally
designed to work with the FPU in a non-standard rounding mode (like round to nearest). Then if you call other code that's meant for the
normal rounding mode, it fails.



Ok, rant over. Don't mess with the FPU control word.



That means the classic /QIfist really doesn't do that much for us. Yes, it makes ftoi_trunc faster :



int ftoi_trunc(float f) { return (int) f; }



that's fast enough with /QIfist, but you still need round :



int ftoi_round(float f) { return ( f >= 0.f ) ? (int)(f + 0.5f) : (int)(f - 0.5f); }



note that a lot of people just do + 0.5 to round - that's wrong, for negatives you need to go the other way, because
the C truncation is *toward zero* not *down*.



Even if you could speed up the round case, I really don't like using compiler options for crucial functionality. I like
to make little code snippets that work the way I want regardless of the compiler settings. In particular if I make some
code that relies on ftoi being fast I don't want to use C casts and hope they set the compiler right. I want the code
to enforce its correctness.



Fortunately the xs routines at stereopsis by Sree Kotay are
really good. The key piece is a fast ftoi_round (which I have slightly rejiggered to use the union method of aliasing) :



union DoubleAnd64
{
uint64 i;
double d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5

inline int ftoi_round(const float val)
{
DoubleAnd64 dunion;
dunion.d = val + floatutil_xs_doublemagic;
return (int) dunion.i; // just cast to grab the bottom bits
}



in my tests this runs at almost exactly the same speed as FISTp (both around 7 clocks), and it always works regardless of
the FPU control word setting or the compiler options.



Note that this is a "banker's round" not a normal arithmetic rounding where 0.5 always goes up or down - 0.5 goes to the
nearest *even* value. So 2.5 goes to 2.0 and 3.5 goes to 4.0 ; eg. 0.5's go up half the time and down half the time.
To be more precise, ftoi_round will actually round the same way that bits that drop out of the bottom of the FPU registers
during addition round. We can see that's why making a banker_round routine was so easy, because that's what the FPU
addition does.



But, we have a problem. We need a truncate (ftoi_trunc). Sree provides one, but it uses a conditional, so it's slow
(around 11 clocks in my tests). A better way to get the truncate is to use the SSE intrinsinc :




inline int ftoi_trunc(const float f)
{
return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}




Note that the similar _mm_cvt_ss2si (one t) conversion does banker rounding, but the "magic number" xs method is faster because it pipelines better,
and because I'm building for x86 so the cvt stuff has to move the value from FPU to SSE. If you were building with arch:sse
and all that, then obviously you should just use the cvt's. (but then you dramatically change the behavior of your floating
point code by making it run through float temporaries all the time, instead of implicit long doubles like in x86).



So, that's the system I have now and I'm pretty happy with it. SSE for trunc, magic number for round, and no reliance on
the FPU rounding mode or compiler settings, and they're both fast.



For completeness I'll include my versions of the alternatives :



#include < xmmintrin.h >

typedef unsigned __int64 uint64;

union DoubleAnd64
{
uint64 i;
double d;
};

static const double floatutil_xs_doublemagic = (6755399441055744.0); // 2^52 * 1.5
static const double floatutil_xs_doublemagicdelta = (1.5e-8); //almost .5f = .5f + 1e^(number of exp bit)
static const double floatutil_xs_doublemagicroundeps = (0.5f - floatutil_xs_doublemagicdelta); //almost .5f = .5f - 1e^(number of exp bit)

// ftoi_round : *banker* rounding!
inline int ftoi_round(const double val)
{
DoubleAnd64 dunion;
dunion.d = val + floatutil_xs_doublemagic;
return (int) dunion.i; // just cast to grab the bottom bits
}

inline int ftoi_trunc(const float f)
{
return _mm_cvtt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_round_sse(const float f)
{
return _mm_cvt_ss2si( _mm_set_ss( f ) );
}

inline int ftoi_floor(const double val)
{
return ftoi_round(val - floatutil_xs_doublemagicroundeps);
}

inline int ftoi_ceil(const double val)
{
return ftoi_round(val + floatutil_xs_doublemagicroundeps);
}

// ftoi_trunc_xs = Sree's truncate
inline int ftoi_trunc_xs(const double val)
{
return (val<0) ? ftoi_round(val+floatutil_xs_doublemagicroundeps) :
ftoi_round(val-floatutil_xs_doublemagicroundeps);
}





BTW note the ceil and floor from Sree's XS stuff which are both quite handy and hard to do any other way. Note you might think
that you can easily make ceil and floor yourself from the C-style trunc, but that's not true, remember floor is *down* even on negatives.
In fact Sree's truncate is literally saying "is it negative ? then ceil, else floor".



Finally : if you're on a console where you have read-modify-write aliasing stall problems the union magic number trick is probably not good
for you. But really on a console you're locked into a specific CPU that you completely control, so you should just directly use the
right intrinsic for you.



AND : regardless of what you do, please make an ftoi() function of some kind and call that for your conversions, don't just cast.
That way it's clear where where you're converting, it's easy to see and search for, it's easy to change the method, and if you use ftoi_trunc
and ftoi_round like me it makes it clear what you wanted.



ASIDE :
in fact I'm starting to think *all* casts should go through little helper functions to make them very obvious and clear.
Two widgets I'm using to do casts are :



// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );
return *( (const t_to *) &from );
}

// check_value_cast just does a static_cast and makes sure you didn't wreck the value
// eg. for numeric casting to make sure the value fits in the new type
template < typename t_to, typename t_fm >
t_to check_value_cast( const t_fm & from )
{
t_to to = static_cast< t_to >(from);
ASSERT( static_cast< t_fm >(to) == from );
return to;
}

intptr_t pointer_to_int(void * ptr)
{
return (intptr_t) ptr;
}

void * int_to_pointer(intptr_t i)
{
return (void *) i;
}





ADDENDUM :

I'm told this version of same_size_bit_cast is better on various platforms. That's why we want it gathered up in one place so it
can be easily changed ;)





// same_size_bit_cast casts the bits in memory
// eg. it's not a value cast
template < typename t_to, typename t_fm >
t_to same_size_bit_cast( const t_fm & from )
{
COMPILER_ASSERT( sizeof(t_to) == sizeof(t_fm) );

union
{
t_fm fm;
t_to to;
} temp;

temp.fm = from;
return temp.to;
}



Friday, January 16, 2009

01-16-09 - Virtual Memory



So I just had kind of a weird issue that took me a while to figure out and I thought I'd write up
what I learned so I have it somewhere.
(BTW I wrote some stuff last year about
VirtualAlloc and the zeroer.)



The problem was this Oodle bundler app I'm working on was running out of memory at around 1.4 GB of
memory use. I've got 3 GB in my machine, I'm not dumb, etc. I looked into some things - possible
virtual address space fragmentation? No. Eventually by trying various allocation patterns I
figured it out :



dwAllocationGranularity



On Windows XP all calls to VirtualAlloc get rounded up to the next multiple of 64k. Pages are 4k - and pages
will actually be allocated to your process on 4k granularity - but the virtual address space is reserved in 64k
chunks. I don't know if there's any fundamental good reason for this or if it's just a simplification for
them to write a faster/smaller allocator because it only deals with big aligned chunks.



Anyway, my app happened to be allocating a ton of memory that was (64k + 4k) bytes (there was a texture that
was exactly 64k bytes, and then a bit of header puts you into the next page, so the whole chunk was 68k). With
VirtualAlloc that actually reserves two 64k pages, so you are wasting almost 50% of your virtual address space.



NOTE : that blank space you didn't get in the next page is just *gone*. If you do a VirtualQuery it tells you
that your region is 68k bytes - not 128k. If you try to do a VirtualAlloc and specify an address in that range,
it will fail. If you do all the 68k allocs you can until VirtualAlloc returns NULL, and then try some more 4k
allocs - they will all fail. VirtualAlloc will never give you back the 60k bytes wasted on granularity.



The weird thing is there doesn't seem to be any counter for this.
Here are the TaskMgr & Procexp reading meanings :



TaskMgr "Mem Usage" = Procexp "Working Set"



This is the amount of memory whose pages are actually allocated to your app. That means the pages have actually been touched! Note that pages
from an allocated range may not all be assigned.



For example, if you VirtualAlloc a 128 MB range , but then only go and touch 64k of it - your "Mem Usage" will show 64k. Those pointer touches
are essentially page faults which pull pages for you from the global zero'ed pool. The key thing that you may not be aware of is that
even when you COMMIT the memory you have not actually got those pages yet - they are given to you on demand in a kind of "COW" pattern.



TaskMgr "VM Size" = Procexp "Private Bytes"



This is pretty simple - it's just the amount of virtual address space that's COMMITed for your app. This should equal to the total "Commit Charge"
in the TaskMgr Performance view.



ProcExp "Virtual Size" =



This one had me confused a bit and seems to be undocumented anywhere. I tested and figured out that this is the amount of virtual address
space RESERVED by your app, which is always >= COMMIT. BTW I'm not really sure why you would ever reserve mem and not
commit it, or who exactly is doing that, maybe someone can fill in that gap.



Thus :



2GB >= "Virtual Size" >= "Private Bytes" >= "Working Set".



Okay, that's all cool. But none of those counters shows that you have actually taken all 2 GB of your address space
through the VirtualAlloc granularity.



ADDENDUM : while I'm explaining mysteriously named counters, the "Page File Usage History" in Performance tab of task manager
has absolutely nothing to do with page file. It's just your total "Commit Charge" (which recall the same as the "VM Size" or
"Private Bytes"). Total Commit Charge is technically limited by the size of physical ram + the size of the paging file.
(which BTW, should be zero - Windows runs much better with no paging file).




To be super clear I'll show you some code and what the numbers are at each step :



int main(int argc,char *argv[])
{

lprintf("UseAllMemory...\n");

vector<void *> mems;

#define MALLOC_SIZE ((1<<16) + 4096)

lprintf("reserving:\n");

uint32 total = 0;

for(;;)
{
void * ptr = VirtualAlloc( NULL, MALLOC_SIZE , MEM_RESERVE, PAGE_READWRITE );

if ( ! ptr )
{
break;
}

total += MALLOC_SIZE;
mems.push_back(ptr);

lprintf("%d\r",total);
}
lprintf("%u\n",total);

lprintf("press a key :\n");
getch();



This does a bunch of VirtualAlloc reserves with a stupid size. It prints :




UseAllMemory...

reserving:

1136463872

press a key :





The ProcExp Performance tab shows :




Private Bytes : 2,372 K

Virtual Size : 1,116,736 K

Working Set : 916 K





Note we only got around 1.1 GB. If you change MALLOC_SIZE to be a clean power of two you should get all 2 GB.



Okay, so let's do the next part :




lprintf("comitting:\n");

for(int i=0;i < mems.size();i++)
{
VirtualAlloc( mems[i], MALLOC_SIZE, MEM_COMMIT, PAGE_READWRITE );
}

lprintf("press a key :\n");
getch();



We committed it so we now see :




Private Bytes : 1,112,200 K

Virtual Size : 1,116,736 K

Working Set : 2,948 K





(Our working set also grew - not sure why that happened, did Windows just alloc a whole bunch? It would appear
so. It looks like roughly 128 bytes are needed for each commit).



Now let's actually make that memory get assigned to us. Note that it is implicity zero'ed, so you can read
from it any time and pull a zero.





lprintf("touching:\n");

for(int i=0;i < mems.size();i++)
{
*( (char *) mems[i] ) = 1;
}

lprintf("press a key :\n");
getch();



We now see :




Private Bytes : 1,112,200 K

Virtual Size : 1,116,736 K

Working Set : 68,296 K





Note that the Working Set is still way smaller than the Private Bytes because we have only actually been
given one 4k page from each of the chunks that we allocated.



And wrap up :



lprintf("freeing:\n");

while( ! mems.empty() )
{
VirtualFree( mems.back(), 0, MEM_RELEASE );

mems.pop_back();
}

lprintf("UseAllMemory done.\n");

return 0;
}






For background now you can go read some good links about Windows Virtual memory :



Page table - Wikipedia - good intro/background

RAM, Virtual Memory, Pagefile and all that stuff

PAE and 3GB and AWE oh my...

Mark's Blog : Pushing the Limits of Windows Virtual Memory

Managing Virtual Memory in Win32

Chuck Walbourn Gamasutra 64 bit gaming

Brian Dessent - Re question high virtual memory usage

Tom's Hardware - My graphics card stole my memory !




I'm assuming you all basically know about virtual memory and so on. It kind of just hit me for the first time, however, that our problem now
(in 32 bit aps) is the amount of virtal address space. Most of us have 3 or 4 GB of physical RAM for the first time in history, so you actually
cannot use all your physical RAM - and in fact you'd be lucky to even use 2 GB of virtual address space.



Some issues you may not be aware of :



By default Windows apps get 2 GB of address space for user data and 2 GB is reserved for mapping to the kernel's memory. You can change that
by putting /3GB in your boot.ini , and you must also set the LARGEADDRESSAWARE option in your linker. I tried this and it in fact worked just
fine. On my 3 GB work system I was able to allocated 2.6 GB to my app. HOWEVER I was also able to easily crash my app by making the kernel run
out of memory. /3GB means the kernel only gets 1 GB of address space and apparently something that I do requires a lot of kernel address space.



If you're running graphics, the AGP window is mirrored into your app's virtual address space. My card has 256MB and it's all mirrored, so as soon
as I init D3D my memory use goes down by 256MB (well, actually more because of course D3D and the driver take memory too). There are 1GB cards out
there now, but mapping that whole video mem seems insane, so they must not do that. Somebody who knows more about this should fill me in.




This is not even addressing the issue of the "memory hole" that device mapping to 32 bits may give you.
Note that PAE could be used to map your devices above 4G so that you can get to the full 4G of memory, if you also
turn that on in the BIOS, and your device drivers support it; apparently it's not recommended.



There's also the Address Windowing Extensions (AWE) stuff. I can't imagine a reason why any normal person would
want to use that. If you're running on a 64-bit OS, just build 64-bit apps.



VirtualQuery tells me something about what's going on with granularity. It may not be obvious from the docs,
but you can call VirtualQuery with *ANY* pointer. You can call VirtualQuery( rand() ) if you want to. It
doesn't have to be a pointer to the base of an allocation range. From that pointer it gives you back the base
of the allocation. My guess is that they do this by stepping back through buckets of size 64k. To make 2G of
ram you need 32k chunks of 64k bytes. Each chunk has something like MEMORY_BASIC_INFORMATION, which is about 32
bytes. To hold 32k of those would take 1 MB. This is just pure guessing.



SetSystemFileCacheSize is interesting to me but I haven't explored it.



Oh, some people apparently have problems with DLL's that load to fixed addresses fragmenting virtual memory. It's
an option in the DLL loader to specify a fixed virtual address. This is naughty but some people do it. This could make
it impossible for you to get a nice big 1.5 GB virtual alloc or something. Apparently you can see the fixed address
in the DLL using "dumpbin.exe" and you can modify it using "rebase.exe"



ADDENDUM : I found a bunch of links about /3GB and problems with Exchange Server fragmenting virtual address space. Most interestingly to me these links also
have a lot of hints about the way the kernel manages the PTE's (Page Table Entries). The crashes I was getting with /3GB were most surely running
out of PTE's ; apparently you can tell the OS to make more room for PTE's with the /USERVA flag. Read here :



The number of free page table entries is low, which can cause system instability

How to Configure the Paged Address Pool and System Page Table Entry Memory Areas

Exchange Server memory management with 3GB, USERVA and PAE

Clint Huffman's Windows Performance Blog Free System Page Table Entries (PTEs)





I found this GameFest talk by Chuck Walkbourn :

Why Your Windows Game Won�t Run In 2,147,352,576 Bytes
that covers some of these same issues. In particular he goes into detail about the AGP
and memory mirroring and all that. Also in Vista with the new WDDM apparently you can make video-memory only resources that don't take any app virtual
address space, so that's a pretty huge win.



BTW to be clear - the real virtual address pressure is in the tools. For Oodle, my problem is that to set up the paging for a region, I want
to load the whole region, and it can easily be > 2 GB of content. Once I build the bundles and make paging units, then you page them in and out
and you have nice low memory use. It just makes the tools much simpler if they can load the whole world and not worry about it. Obviously that
will require 64 bit for big levels.



I'm starting to think of the PC platform as just a "big console". For a console you have maybe 10 GB of data, and you are paging that through
256 MB or 512 MB of memory. You have to be careful about memory use and paging units and so on. In the past we thought of the PC as "so much
bigger" where you can be looser and not worry about hitting limits, but really the 2 GB Virtual Address Space limit is not much bigger (and
in practice it's more like 1.5 GB). So you should think of the PC as have a "small" 1 GB of memory, and you're paging 20 GB of data through it.

01-16-09 - Automated NFL Bettor



I was thinking about some NFL betting this weekend and realized I never wrote about my automated NFL bettor and thought I
should remedy that.



I was working on the Netflix prize stuff
(see my post here then
here ) and it occured to me that it's a very
similar problem.



If you think about it in terms of the Netflix sparse matrix way of thinking, you have 32 x 32 teams , and each time they play
fills out a slot in the matrix with the score delta of that game. The matrix is very sparse - there are lots of blank spots
where teams haven't played each other (it's less than half full), and the challenge for an automatic bettor is to fill in
the blanks - aka making a prediction for teams that haven't played yet.



Now, it should be obvious that you could use one of the Netflix methods right off the bat. For example "Collaborative Filtering"
can be directly applied - it just means using the point delta of teams I've already played to predict the delta vs. a new team.
So eg. if team A beat team B by +3 , and team B beat lost to team C by -7, then we predict team A will lose to team C by -4.
You can also obviously use an SVD approach. Take the matrix of scores with blanks. Use SVD to find a low-rank generator that
approximates the existing values, and makes predictions for the blanks.



Both of these techniques "work" but are terrible. The problem is the NFL scores have lots of randomness. That is, we fundamentally
assume that if team A and team B play over and over the average score delta will converge to some number, but in the short term the
actual score is drawn from some random source centered on the average. We generally assume that it's Gaussian, but in fact in the NFL
it's very non-Gaussian, there are weird peaks at +3/-3 and +7/-7 and a hole at 0 (BTW this is where the "Wong Teaser" comes from -
it exploits the fact that oddsmakers use Gaussian models but the distribution is not actually Gaussian).



So, we really need to use some kind of Bayesian or Hidden Model approach. That is, the score that we actually observed cannot be
taken as the average expected score of those teams. Instead we must find a model for each team which maximizes the likelihood
that we would observe the actually seen scores given that model ; this is an ML approach , and it's crucial that you also include
a prior for regularization. Let me say that again in a more Bayesian way -



We saw some set of scores {S} ; there is some team model {M} and the model tells us the expected average score between teams; the
model is our predictor. There's some probability that the Model M creates scores S : P(S|M) ; but that is not the right thing to
maximize ! We want to find the model that maximizes P(M|S) which is :



P(M|S) = P(S|M) * P(M) / P(S)



Now P(S|M) is known, it's given by our model, but the inclusion of P(M) here is really crucial - Not all models are equally likely!



In particular, you need to make more extreme models less likely or you will get bad overfitting. The data is way too sparse and too
random. The prior model should know that all the teams are really very close in strength and extreme score differences are not
very reflective of future performance, especially when they're anomalies.



Most of the existing automatic NFL predictors use some kind of "power ranking". This just means assigning a single value to
each team and predicting the score of A vs. B to be power(A) - power(B). You can fit these trivially using least squares and you
have plenty of data, but that still sucks because of the variance issue. The good people who do this automatically use various
forms of prior conditioning and regularization in the least squares fit. For example they use things like fitting to the square roots, which reduces the
importance of large differences.



Dr K's Automatic Betting Talk (PPT) has a good summary of the classic ways of doing
regularized power rankings.



Anyway I got into that and started doing some reading and realized I wasn't going to beat the stuff that people are already doing.
There's also just really not enough data to make good fits in the NFL. I mean the "Power Ranking" is a really pathetic model if
you think about it, it gives each team a single "strength" number so it has absolutely no way to model the style that the team plays
and what kinds of strenghts and weaknesses it might have. An obvious idea would be to try to fit something like 5 numbers -
{rushing offence, rushing defence, passing offence,rushing defence,home field advantage} - but you don't have nearly enough information
to fit that.



If you look at The Prediction Tracker he's got aggregated lots
of predictions from around the net including some that are formulated with good mathematical models. However, not one of them
can reliably beat the spread.



For example :
Dr K's preds and history he has a 134-125 record this year. That's actually
remarkable, most people do worse. Assuming a very good 105 vig (often called a 105 bet in the NFL because you're putting up 105 to win 100),
so a bet pays +1 on a win and -1.05 on a loss, his total profit is
134 - 125*1.05 = 2.75 ; yay he actually made money. If he bet $105 on every game all season he made $275 on the year.
(BTW a lot of suckers make bets at 110 vigs which is much worse and pretty much impossible to beat).



But that is assuming that you make every bet. If you look at the automatic predictions, often they predict a score delta that's
very close to the spread. Clearly you shouldn't make that bet, because the vig will mean you lose money on average. You should
only place a bet when your confidence is greater than the vig. To break even you have to win 105/205 of the time = 51.2 %



So, I thought - is there any way to formulate a *strategy* for betting that gaurantees a profit? My idea was to go to
The Prediction Tracker and grab all the predictions from various
sources and create a decision tree that tells me whether to bet or not.



You can think about a decision tree as a sort of BSP tree on the space of the parameters. Each parameter (in this case all the
predictors) are each a coordinate axis which defines a large dimensional space. In this case the axes are {LINE, SAG, PFZ, ARGH, KAM ... }
it's about a 30 dimensional space. The value for each of those is the coordinate in that dimension, so we have specified a point in
this big space. The decision tree cuts up the space into regions of {BET} and {NO BET} , basically a 1 or a 0. These spatial regions
can capture a lot of rules. For example, it can capture rules like :



If abs( (average pred) - (line) ) < 1.0
then don't bet

If ( (average pred) - (pred sdev) ) <= line - 1.0 && ( (average pred) + (pred sdev) ) >= line + 1.0
then don't bet

Else
bet



These are some rules that seem intuitive to me - if the pred is too close to the line, don't bet, or if the sdev of the various preds
is too wide and near the line then don't bet. But the point is not to hard code rules like that, the decision tree finds them automatically
and finds the right dividing values.
It could also come up with rules that you may not expect, like it might find things like :



if SAG > 10 and L2 > 8 and LINE < 7 then
do bet



We can train the decision tree based on past performance if we have past scores and past lines. We can make an optimal in-hindsight
decision tree. Basically what you do is for every game in the past you make a point in space from all the parameters on that game;
you give this point either a category of {bet} if the prediction was right, or {no bet} if the prediction was wrong. Then you try to put
down splitting planes to chop up the space to separate the {bet} points from the {no bet} points. You hope that they are clustered together
in pockets. BTW it's important to use the right cost function for tree building here because of the vig - it's not just the number of points
miscategorized, there's a penalty of 105 for putting a {no bet} into {bet}, but only a penalty of 100 for putting a {bet} into {no bet}.
That is, it's better to abstain from possible bets once in a while than it is to make a bet you shouldn't have. In fact I found better trees
if I pretended I was betting at 110, eg. make the tree artificially more conservative.



In practice again you can't actually do this directly because you don't have enough history and the space is too big and sparse,
but you can fix that in this case by collapsing dimensions.
For example you can eliminate most of the predictors because only a few of them actually help, so instead of 30 dimensions you have 4.
You can also make optimized linear combinations of predictors, and then make the decision tree on the combined predictors.
Another option would be to use PCA to collapse the dimensions rigorously (note that we don't have the sparseness problem because we're
doing PCA on all the prediction values, and we have a pred for each axis for every game each week).



Of course in testing your DT you need to do holdouts. eg. if you have 4 seasons of past data, hold out 100 games randomly chosen. Train the
DT on the other games, and then test it on the 100 that you held out. That way you're not testing on the data you trained on.



With this approach I was able to find a meta-betting strategy that had a 55% success rate on the random holdouts tested. That's well over
the 51.22% needed to break even. If you bet $105 with a 55% success rate you expect $7.75 of profit from each bet.



That's pretty good, but man sports betting sucks as a way to actually make money. You're putting up huge sums at near coinflips to make very
small profits.

Tuesday, January 13, 2009

01-13-09 - Strings



I just had another idea for strings that I think is rather appealing. I've ranted here before about refcounted strings and the
suckitude of std::string and bstring and so on. Anyway, here's my new idea :



Mutable strings are basically a vector< char > like std::string or whatever. They go through a custom allocator which *never frees*.
What that means is you can always just take a c_str char * off the string and hold onto it forever.



Thus the readable string is just char *, and you can store those in your hashes or whatever. Mutable string is a String thingy that
supports operator += and whatnot, but you just hold those temporarily to do edits and then grab the char * out.



So the usage is that you always just pass around char *'s , your objects all store char *'s, nobody ever worries about
who owns it and whether to free it, you can pass it across threads and not worry. To make strings you put a String on the stack
and munge it all you want, then pull the char * out and rock with that.



Obviously this wastes memory, BUT in typical gamedev usage I think the waste is usually microscopic. I almost always just read
const strings out of config files and then never edit them.



One exception that I'd like to handle better is frequently mutated strings. For example, you might have something in a spawner that
does something like this :



for(int variant=0;variant < numVariants;variant++)
{
// char name[80];
// sprintf(name,"spawnmobj_%d",variant);
String name("spawnmobj_");
name += variant;

Spawn(name);
}



I don't love making names programatically like this, but lots of people do it and it's quite convenient, so it
should be supported.
With the model that I have proposed here, this would do allocs every time you spawn and memory use would increase forever. One way to fix
this is to use a global string pool and merge duplicates at the time they are converted to char *. That way you don't every increase your
memory use when you make strings you made before - only when you make new strings.



With the string pool model, the basic op becomes :



const char * StringPool::GetPermanentPointer( String & str );



in which the 'str' is added to the pool (or an existing one is found), and the char * you get back will never go away.



ADDENDUM : to be clear, this is not intended as an optimization at all, it's simply a way to make the programming easy
without being too retarded about crazy memory use. (eg. not just making String a char [512])