The GC: Concurrency

The tracing Gc`s (see previous posts) biggest disadvantage is the unpredictability of the garbage collection. It can happen anytime and the time it takes varies. This while normal execution is halted, can cause serious problems to certain applications which are required to have a low latency.

Incremental Gc

One trick to improve the latency of the gc is to make the garbage collection incremental. Instead of doing everything in one big garbage collection, tiny chunks of execution are interchanged with garbage collection.

Generations

When using a generations gc, a garbage collector which groups objects by their age, garbage collection could be done on each seperate generation. Objects that have survived a few collections, and settle in older generations, tend to remain persistent. (those are the objects that remain global during the duration of the program). Therefore the older generations don’t need a lot of garbage collections. Whereas the youngest collections typically contain a lot of garbage and require frequent garbage collections.

Granularity

Very much small garbage collections could have worse performance than one big garbage collection. Although very big garbage collections which are very infrequent are really bad for latency. When it gets extreme there even could be 100MB allocated memory for 1MB used memory at a moment, because garbage isn’t collected frequently enough. Having a lot of garbage between live data also gets in the way of locality of reference (used memory close together gets all in one load of the L2-cache which is a lot faster than normal RAM).

Although a figure like 100MB sounds ridiculous, take for instance a program that reads through a text file and print every line:

while there-is-a-new-line
   string line = get-line-from-file
   print line
end

Every line read would go on the heap as a string. After the line string goes out of scope it becomes unreferenced from the stack (part of the root-set) and therefore garbage. When reading a big file the heap fills up pretty fast. When doing another similar operations but then from the lot faster memory than file it could even get worse.

When doing a lot of garbage collects small pieces of memory are copied each time over short distances, which is less efficient than bigger pieces, but having to copy pieces over a very great distance, which doesn’t fit in the L2-cache really slows down instead of speeding up.

Trace-only garbage collections

When a lot of memory is allocated in a short time the garbage collector could get suspicious and fire away a full collection, which could be a waste of time when all allocated memory seems to still be live. Moving memory, marking freespaces and updating pointers takes most time of the whole garbage collection. When only tracing what is garbage and what isn’t and after that deciding what to do could provide a gain in performance.

Not only does tracing down garbage take less time, it also is easier to do while the rest of the program is running.

Internal data garbage collections

Most gc’s keep internal data structures optimized for quick access and modification for thing like tracking types, freeblocks, blocks, unmanaged pointers and others. An example could be a list which contains the pointers to unmanaged pointers to be able to link into managed memory from unmanaged memory. This list would be accessed a lot when in unmanaged code, which requires all pointers on its unmanaged stack to be registered on that list.

A chained-block stack would perform best. The whole stack contains out of a few chained blocks, where the last block is pointed to from the stack structure, which on itself points to the previous block, and so on all the way to the first block. Adding a pointer-pointer to the stack would be as easy as increment the pointer count, and put it in the last block. And possibly creating a new block when the current block is full.

Removing a pointer-pointer would work best by searching from end to begin for the pointer and simply NULL-ing it. This for the most frequent use for the stack-list would be for registering pointers from the unmanaged stack and deregistering them afterwards. Although when another global unmanaged pointer is registered there could be a few block registered for the stack, where only the very last contains that one single pointer-pointer to that global pointer which won’t be deregistered soon.

The algorithm could be changed to compact on every operation, or to track freespaces in the stack-list, but it can be done a lot easier. Simply collection the garbage and compacting the stack-list would work just as well.

It would be tempting to perform these garbage collections for internal structures during the normal garbage collection, but this would hurt latency. Instead scheduling them somewhere in between would be best.

Concurrency

When a garbage collection is in progress all other threads are usually stopped to avoid the program itself (the mutator) to change memory and corrupting the managed heap. Although it is possible to make most parts of the garbage collection concurrent with normal application execution.

Tracing

When a pointer is changed during tracing down garbage in an object which already is black (all pointers in the object have been checked), the object pointer to will remain white (no black object seems to point to it) and will be scheduled for garbage collection.

This condition is rare, but it could happen, and it could be fatal. Other conditions like an allocation during trace could be easily solved by marking the object black when allocated.

Write barrier

An easy way to cope with these conditions is to detect any reading or writing to pointers that already are marked black, and acting upon this. This is called a write barrier.

The problem with a write barrier is that it is very hard to implement for languages like C, and cause a performance hit, for each pointer-derefence+write would require a few extra opcodes, and would be in most cases unacceptible. When you are dealing with a JIT-ed language (Just In Time compiled) however, you can emit these extra opcodes to detect writes into the black objects only when a Gc is in progress.

Compacting and Pointer change

After garbage has been traced down it is time for the gc to compact the memory by moving live objects in freed garbage, and changing pointers to these objects accordingly. When traced the gc already would have made a translation table for pointers (usually offset based for efficiency) and would walk its way through the heap compacting and changing pointers in one go.

When the application is running at that time it would encounter problems when reading dereferencing pointers.

Read barrier

To combat this a read barrier would be required. Which is the same as a write barrier, but then for normal pointer-dereferencing. This too requires opcodes to be added on each dereferencing, which would be easy for a JIT and unacceptible for C. The added opcodes would simply look to the place where the garbage collection is at the moment, and convert the pointer as if it was the Gc itself using the Gc’s translation table.

Concurrency outside collection

Making a Gc work concurrent outside a garbage collection is relatively easy for no really collection is happening at runtime and a few rw-locks would do the trick.

Conclusion

Concurrency in a Gc is a must for some applications, and generally a good idea anyway. A better latency can be gained from a fine-coarsed incremental garbage collections. Real concurrency can be gained by using read and write barriers to allow execution during the garbage collection, which is only feasible when working with a JIT.

Dual Cores, Traffic and Gaming

Dual Cores are the new thing in the processor bussiness. They claim to be perfect for people who want to do different things at the same time but in reality they are just a cheap solution to deliver more.

The truth is that two 500 mhz processors just don’t deliver the calculation power of one 1 000mhz processor. Given offcourse that they are of the same architecture. This is due to the fact that there is a lot of additional logic required to let two processors work together without getting eachother in the way. When two processors both want to write another thing to the same spot of memory you would get impredictable results. To solve this you have all kind of ways to solve it, by for instance locking a region of memory for an amount of time, making the other processor to wait for the first one to finish. But this all creates a lot of overhead and complexity.

You could compare it with driving a car. Driving a car is relatively easy. You just have to steer around some static obstacles, no big deal. When you know the way you could even do it with your eyes closed. That is unless there are other people driving a car. When driving a car you are keeping an eye, not on the road, but on the other people on the road. This not only slows down your maximum speed or decreases efficiency – you can’t just drive full speed over a junction – but it also increases the complexity and the likelyhood of errors.

The same thing goes in the case of dual core processors (or even hyper-threaded and normal multi processor platforms). Although the comparison isn’t really valid because having two processors doesn’t mean that you have to do two things at the same time. The issue is though that what you normally would do, would only be done by one processor and you are therefore wasting the other’s capability.

A good example of this are games. Games tend to be single threaded, which gives best performance for no processor time is wasted to multithreading and it is the easiest thing to do for multithreaded is rather complicated. Complicated enough that there have been a few lengthy discussions in the mono mailing lists how to lock a simple object.

Because we are getting dual core and propably ‘more’core processors lateron for the companies are too lazy to make decent processors1 games should become multithreaded to exploit the full capability of the machine it runs on.

Although it makes creating performance applications a lot more difficult it will surelly benifit distributed computing for the change from different processors to different machines is less than from 1 processor to more.

[1] Native threaded CPU’s like dualcores/multi processor/hyperthreaded processors are ideal for server applications where multiple short living requests have to be resolved. Switching an allocating on a software-threaded processor would create too much overhead for such a simple request.