Friday, January 16, 2009

01-16-09 - Virtual Memory

So I just had kind of a weird issue that took me a while to figure out and I thought I'd write up
what I learned so I have it somewhere.
(BTW I wrote some stuff last year about
VirtualAlloc and the zeroer.)

The problem was this Oodle bundler app I'm working on was running out of memory at around 1.4 GB of
memory use. I've got 3 GB in my machine, I'm not dumb, etc. I looked into some things - possible
virtual address space fragmentation? No. Eventually by trying various allocation patterns I
figured it out :


On Windows XP all calls to VirtualAlloc get rounded up to the next multiple of 64k. Pages are 4k - and pages
will actually be allocated to your process on 4k granularity - but the virtual address space is reserved in 64k
chunks. I don't know if there's any fundamental good reason for this or if it's just a simplification for
them to write a faster/smaller allocator because it only deals with big aligned chunks.

Anyway, my app happened to be allocating a ton of memory that was (64k + 4k) bytes (there was a texture that
was exactly 64k bytes, and then a bit of header puts you into the next page, so the whole chunk was 68k). With
VirtualAlloc that actually reserves two 64k pages, so you are wasting almost 50% of your virtual address space.

NOTE : that blank space you didn't get in the next page is just *gone*. If you do a VirtualQuery it tells you
that your region is 68k bytes - not 128k. If you try to do a VirtualAlloc and specify an address in that range,
it will fail. If you do all the 68k allocs you can until VirtualAlloc returns NULL, and then try some more 4k
allocs - they will all fail. VirtualAlloc will never give you back the 60k bytes wasted on granularity.

The weird thing is there doesn't seem to be any counter for this.
Here are the TaskMgr & Procexp reading meanings :

TaskMgr "Mem Usage" = Procexp "Working Set"

This is the amount of memory whose pages are actually allocated to your app. That means the pages have actually been touched! Note that pages
from an allocated range may not all be assigned.

For example, if you VirtualAlloc a 128 MB range , but then only go and touch 64k of it - your "Mem Usage" will show 64k. Those pointer touches
are essentially page faults which pull pages for you from the global zero'ed pool. The key thing that you may not be aware of is that
even when you COMMIT the memory you have not actually got those pages yet - they are given to you on demand in a kind of "COW" pattern.

TaskMgr "VM Size" = Procexp "Private Bytes"

This is pretty simple - it's just the amount of virtual address space that's COMMITed for your app. This should equal to the total "Commit Charge"
in the TaskMgr Performance view.

ProcExp "Virtual Size" =

This one had me confused a bit and seems to be undocumented anywhere. I tested and figured out that this is the amount of virtual address
space RESERVED by your app, which is always >= COMMIT. BTW I'm not really sure why you would ever reserve mem and not
commit it, or who exactly is doing that, maybe someone can fill in that gap.

Thus :

2GB >= "Virtual Size" >= "Private Bytes" >= "Working Set".

Okay, that's all cool. But none of those counters shows that you have actually taken all 2 GB of your address space
through the VirtualAlloc granularity.

ADDENDUM : while I'm explaining mysteriously named counters, the "Page File Usage History" in Performance tab of task manager
has absolutely nothing to do with page file. It's just your total "Commit Charge" (which recall the same as the "VM Size" or
"Private Bytes"). Total Commit Charge is technically limited by the size of physical ram + the size of the paging file.
(which BTW, should be zero - Windows runs much better with no paging file).

To be super clear I'll show you some code and what the numbers are at each step :

int main(int argc,char *argv[])


vector<void *> mems;

#define MALLOC_SIZE ((1<<16) + 4096)


uint32 total = 0;


if ( ! ptr )

total += MALLOC_SIZE;


lprintf("press a key :\n");

This does a bunch of VirtualAlloc reserves with a stupid size. It prints :




press a key :

The ProcExp Performance tab shows :

Private Bytes : 2,372 K

Virtual Size : 1,116,736 K

Working Set : 916 K

Note we only got around 1.1 GB. If you change MALLOC_SIZE to be a clean power of two you should get all 2 GB.

Okay, so let's do the next part :


for(int i=0;i < mems.size();i++)

lprintf("press a key :\n");

We committed it so we now see :

Private Bytes : 1,112,200 K

Virtual Size : 1,116,736 K

Working Set : 2,948 K

(Our working set also grew - not sure why that happened, did Windows just alloc a whole bunch? It would appear
so. It looks like roughly 128 bytes are needed for each commit).

Now let's actually make that memory get assigned to us. Note that it is implicity zero'ed, so you can read
from it any time and pull a zero.


for(int i=0;i < mems.size();i++)
*( (char *) mems[i] ) = 1;

lprintf("press a key :\n");

We now see :

Private Bytes : 1,112,200 K

Virtual Size : 1,116,736 K

Working Set : 68,296 K

Note that the Working Set is still way smaller than the Private Bytes because we have only actually been
given one 4k page from each of the chunks that we allocated.

And wrap up :


while( ! mems.empty() )
VirtualFree( mems.back(), 0, MEM_RELEASE );


lprintf("UseAllMemory done.\n");

return 0;

For background now you can go read some good links about Windows Virtual memory :

Page table - Wikipedia - good intro/background

RAM, Virtual Memory, Pagefile and all that stuff

PAE and 3GB and AWE oh my...

Mark's Blog : Pushing the Limits of Windows Virtual Memory

Managing Virtual Memory in Win32

Chuck Walbourn Gamasutra 64 bit gaming

Brian Dessent - Re question high virtual memory usage

Tom's Hardware - My graphics card stole my memory !

I'm assuming you all basically know about virtual memory and so on. It kind of just hit me for the first time, however, that our problem now
(in 32 bit aps) is the amount of virtal address space. Most of us have 3 or 4 GB of physical RAM for the first time in history, so you actually
cannot use all your physical RAM - and in fact you'd be lucky to even use 2 GB of virtual address space.

Some issues you may not be aware of :

By default Windows apps get 2 GB of address space for user data and 2 GB is reserved for mapping to the kernel's memory. You can change that
by putting /3GB in your boot.ini , and you must also set the LARGEADDRESSAWARE option in your linker. I tried this and it in fact worked just
fine. On my 3 GB work system I was able to allocated 2.6 GB to my app. HOWEVER I was also able to easily crash my app by making the kernel run
out of memory. /3GB means the kernel only gets 1 GB of address space and apparently something that I do requires a lot of kernel address space.

If you're running graphics, the AGP window is mirrored into your app's virtual address space. My card has 256MB and it's all mirrored, so as soon
as I init D3D my memory use goes down by 256MB (well, actually more because of course D3D and the driver take memory too). There are 1GB cards out
there now, but mapping that whole video mem seems insane, so they must not do that. Somebody who knows more about this should fill me in.

This is not even addressing the issue of the "memory hole" that device mapping to 32 bits may give you.
Note that PAE could be used to map your devices above 4G so that you can get to the full 4G of memory, if you also
turn that on in the BIOS, and your device drivers support it; apparently it's not recommended.

There's also the Address Windowing Extensions (AWE) stuff. I can't imagine a reason why any normal person would
want to use that. If you're running on a 64-bit OS, just build 64-bit apps.

VirtualQuery tells me something about what's going on with granularity. It may not be obvious from the docs,
but you can call VirtualQuery with *ANY* pointer. You can call VirtualQuery( rand() ) if you want to. It
doesn't have to be a pointer to the base of an allocation range. From that pointer it gives you back the base
of the allocation. My guess is that they do this by stepping back through buckets of size 64k. To make 2G of
ram you need 32k chunks of 64k bytes. Each chunk has something like MEMORY_BASIC_INFORMATION, which is about 32
bytes. To hold 32k of those would take 1 MB. This is just pure guessing.

SetSystemFileCacheSize is interesting to me but I haven't explored it.

Oh, some people apparently have problems with DLL's that load to fixed addresses fragmenting virtual memory. It's
an option in the DLL loader to specify a fixed virtual address. This is naughty but some people do it. This could make
it impossible for you to get a nice big 1.5 GB virtual alloc or something. Apparently you can see the fixed address
in the DLL using "dumpbin.exe" and you can modify it using "rebase.exe"

ADDENDUM : I found a bunch of links about /3GB and problems with Exchange Server fragmenting virtual address space. Most interestingly to me these links also
have a lot of hints about the way the kernel manages the PTE's (Page Table Entries). The crashes I was getting with /3GB were most surely running
out of PTE's ; apparently you can tell the OS to make more room for PTE's with the /USERVA flag. Read here :

The number of free page table entries is low, which can cause system instability

How to Configure the Paged Address Pool and System Page Table Entry Memory Areas

Exchange Server memory management with 3GB, USERVA and PAE

Clint Huffman's Windows Performance Blog Free System Page Table Entries (PTEs)

I found this GameFest talk by Chuck Walkbourn :

Why Your Windows Game Won�t Run In 2,147,352,576 Bytes
that covers some of these same issues. In particular he goes into detail about the AGP
and memory mirroring and all that. Also in Vista with the new WDDM apparently you can make video-memory only resources that don't take any app virtual
address space, so that's a pretty huge win.

BTW to be clear - the real virtual address pressure is in the tools. For Oodle, my problem is that to set up the paging for a region, I want
to load the whole region, and it can easily be > 2 GB of content. Once I build the bundles and make paging units, then you page them in and out
and you have nice low memory use. It just makes the tools much simpler if they can load the whole world and not worry about it. Obviously that
will require 64 bit for big levels.

I'm starting to think of the PC platform as just a "big console". For a console you have maybe 10 GB of data, and you are paging that through
256 MB or 512 MB of memory. You have to be careful about memory use and paging units and so on. In the past we thought of the PC as "so much
bigger" where you can be looser and not worry about hitting limits, but really the 2 GB Virtual Address Space limit is not much bigger (and
in practice it's more like 1.5 GB). So you should think of the PC as have a "small" 1 GB of memory, and you're paging 20 GB of data through it.

No comments:

Post a Comment