Kaja has set up a wiki for paradox to document his Compiler + JIT + Interpreter in the making.
I added quite some documentation about my Garbage Collector IMGC there too for it will be used by Paradox.
Kaja has set up a wiki for paradox to document his Compiler + JIT + Interpreter in the making.
I added quite some documentation about my Garbage Collector IMGC there too for it will be used by Paradox.
Whilest tracing down garbage during the Garbage Collection, as described in earlier posts, it is required to find mark an object pointed to from another object.
This could lead to problems when it is allowed to point anywhere inside an object, or when the object information bit can’t be found by a pointer to the start of the object.
In both cases the Gc should somehow translate a pointer (eg.
0x1234) to the head of the object it points in (eg.
0x1200). The problem here is how to efficiently get
This problem can be circumvented by disallowing pointers pointing to inside the object and instead require all pointers to point to the start of the object. This will only work when the header of the object has a fixed size, otherwise it will still be impossible to work back to the start of the header, which contains the flags you want to alter when tracing garbage.
Pointers inside a managed object aren’t required and could easily be removed. They can be very usefull though in some situations. One of these situations is being able to inline a private referenced instance into the contained class, of which is known that it won’t be modified, which would result in pointers inside the container class. It will only increase performance in some situations, but it will be usefull to have, and is optional.
A bigger issue are pointers that point to the start of an object with a header in front of it which has got a dynamic size. Either the header shouldn’t be of dynamic size, and a fixed (big) size, or instead of to the start of the object, pointers should point to the header of an object. In the latter case each pointer derefence should require an ‘add’ call to increase the address to match the start of the object, with the additional logic required to attain the size of the header which can be significant depending on the implementation. These extra opcodes for each pointer derefence could not be easily be worked away even when using a JIT because of multi-threading, and certainly could not be avoided when using C. The fixed header size isn’t nice either. For it probably would end up being at least 7 bytes on a 64bit platform (flags + pointer).
Oh, and there is also the java way which uses a handle for each object. Which basicly means that an integer pointer isn’t int*, but int** in java. Where the pointer in between contains the meta-data about the object. This is even more less ideal for it requires an extra derefence for each derefence and it results in twice the amount of pointers to track.
As far as I concern this problem can’t be avoided, when dealing with a Gc that is required to interop with non-managed code. With a JIT it could be done though, but this requires a runtime performance/garbage collection performance trade-off. And I personally like a bit slower garbage collect instead of slower runtime.
Before tracing the garbage collection usually walks through the heap to mark all objects white and put those in the root-set in the gray-list. During this first walk through the heap a temporary look-up structure could be build to allow efficient look-up during trace.
When only allowing pointers to the start of the object it would be rather easy to do. A modulus-based bucket like implementation, a hash table without the hash-part, would do. This would allow a nice even redistribution of entries throughout the table, for it uses a modulus.
When allowing pointers in the middle of an object it becomes more tricky, because you can’t precompute the pointers to put in the table. You need to have a structure which allows you to look to the left and find the object the pointer points into. Although most pointers still point to the start of the object, the case that they could point inside the object shouldn’t be concidered.
Instead of a modulus-based bucket, division-based buckets should be used. This creates a problem for the likely distribution of blocks between the buckets will vary heavily. One whole bucket could be empty when there is a big allocation in the memory it represents, where almost all small objects which represent most of the pointers could be concentrated in just two buckets.
To compensate for this a bucket-tree could be used. Basicly when a bucket contains a lot of elements it would split itself in a new set of buckets. This way a minimal amount of buckets could be wasted on really big allocations, and a lot of buckets could be allocated for small allocations. The look-up time would be a bit worse, but it would be neglectable to the huge lookup time a normal search through the heap would require.
When there are only a one or two pointers in the whole heap that point inside another object they could be treated seperately instead, and the rest could keep using the modulus-based buckets.
One way to give those pointers a special treatment is marking them, and resolving to which object they point during the initial walk through the heap where finding objects pointed to from pointers is a lot easier. (pointers needed to be looked up could be sorted by address, and be checked during the walk, which is very efficient).
Another way would be to simulate pointers pointing inside another pointer by making the pointer consist out of an object pointer and a offset from there, which could be cached by the JIT, or expanded before the garbage collection.
Yet another way would be to walk the modulus-based buckets by hand when the pointer isn’t found.
The best method would depend on the implementation of the rest of the Gc.
I`m still uncertain about whether the distribution in a modulus-based bucket can be trusted at all times. Putting a bucket-tree in default, for worst case would be nice, just to be sure.
When we put pointers to all allocated blocks sorted somewhere in the memory, one could use binary search, or other search algorithms to search in sorted lists, to find the pointer. When adding shortcuts each ~64K this could be quite efficient. I’ve implemented this method for my Garbage Collector temporarily for it is amongst the easier to implement and I suspect the fastest. Actually, I hope it is the fastest.
I’ll do some tests on different algorithms on a dump of a ‘modal’ application’s managed heap, when I’ve finished the Gc sufficiently to make an application work with it.
The tracing Gc`s (see previous posts) biggest disadvantage is the unpredictability of the garbage collection. It can happen anytime and the time it takes varies. This while normal execution is halted, can cause serious problems to certain applications which are required to have a low latency.
One trick to improve the latency of the gc is to make the garbage collection incremental. Instead of doing everything in one big garbage collection, tiny chunks of execution are interchanged with garbage collection.
When using a generations gc, a garbage collector which groups objects by their age, garbage collection could be done on each seperate generation. Objects that have survived a few collections, and settle in older generations, tend to remain persistent. (those are the objects that remain global during the duration of the program). Therefore the older generations don’t need a lot of garbage collections. Whereas the youngest collections typically contain a lot of garbage and require frequent garbage collections.
Very much small garbage collections could have worse performance than one big garbage collection. Although very big garbage collections which are very infrequent are really bad for latency. When it gets extreme there even could be 100MB allocated memory for 1MB used memory at a moment, because garbage isn’t collected frequently enough. Having a lot of garbage between live data also gets in the way of locality of reference (used memory close together gets all in one load of the L2-cache which is a lot faster than normal RAM).
Although a figure like 100MB sounds ridiculous, take for instance a program that reads through a text file and print every line:
while there-is-a-new-line string line = get-line-from-file print line end
Every line read would go on the heap as a string. After the
line string goes out of scope it becomes unreferenced from the stack (part of the root-set) and therefore garbage. When reading a big file the heap fills up pretty fast. When doing another similar operations but then from the lot faster memory than file it could even get worse.
When doing a lot of garbage collects small pieces of memory are copied each time over short distances, which is less efficient than bigger pieces, but having to copy pieces over a very great distance, which doesn’t fit in the L2-cache really slows down instead of speeding up.
When a lot of memory is allocated in a short time the garbage collector could get suspicious and fire away a full collection, which could be a waste of time when all allocated memory seems to still be live. Moving memory, marking freespaces and updating pointers takes most time of the whole garbage collection. When only tracing what is garbage and what isn’t and after that deciding what to do could provide a gain in performance.
Not only does tracing down garbage take less time, it also is easier to do while the rest of the program is running.
Most gc’s keep internal data structures optimized for quick access and modification for thing like tracking types, freeblocks, blocks, unmanaged pointers and others. An example could be a list which contains the pointers to unmanaged pointers to be able to link into managed memory from unmanaged memory. This list would be accessed a lot when in unmanaged code, which requires all pointers on its unmanaged stack to be registered on that list.
A chained-block stack would perform best. The whole stack contains out of a few chained blocks, where the last block is pointed to from the stack structure, which on itself points to the previous block, and so on all the way to the first block. Adding a pointer-pointer to the stack would be as easy as increment the pointer count, and put it in the last block. And possibly creating a new block when the current block is full.
Removing a pointer-pointer would work best by searching from end to begin for the pointer and simply NULL-ing it. This for the most frequent use for the stack-list would be for registering pointers from the unmanaged stack and deregistering them afterwards. Although when another global unmanaged pointer is registered there could be a few block registered for the stack, where only the very last contains that one single pointer-pointer to that global pointer which won’t be deregistered soon.
The algorithm could be changed to compact on every operation, or to track freespaces in the stack-list, but it can be done a lot easier. Simply collection the garbage and compacting the stack-list would work just as well.
It would be tempting to perform these garbage collections for internal structures during the normal garbage collection, but this would hurt latency. Instead scheduling them somewhere in between would be best.
When a garbage collection is in progress all other threads are usually stopped to avoid the program itself (the mutator) to change memory and corrupting the managed heap. Although it is possible to make most parts of the garbage collection concurrent with normal application execution.
When a pointer is changed during tracing down garbage in an object which already is black (all pointers in the object have been checked), the object pointer to will remain white (no black object seems to point to it) and will be scheduled for garbage collection.
This condition is rare, but it could happen, and it could be fatal. Other conditions like an allocation during trace could be easily solved by marking the object black when allocated.
An easy way to cope with these conditions is to detect any reading or writing to pointers that already are marked black, and acting upon this. This is called a write barrier.
The problem with a write barrier is that it is very hard to implement for languages like C, and cause a performance hit, for each pointer-derefence+write would require a few extra opcodes, and would be in most cases unacceptible. When you are dealing with a JIT-ed language (Just In Time compiled) however, you can emit these extra opcodes to detect writes into the black objects only when a Gc is in progress.
After garbage has been traced down it is time for the gc to compact the memory by moving live objects in freed garbage, and changing pointers to these objects accordingly. When traced the gc already would have made a translation table for pointers (usually offset based for efficiency) and would walk its way through the heap compacting and changing pointers in one go.
When the application is running at that time it would encounter problems when reading dereferencing pointers.
To combat this a read barrier would be required. Which is the same as a write barrier, but then for normal pointer-dereferencing. This too requires opcodes to be added on each dereferencing, which would be easy for a JIT and unacceptible for C. The added opcodes would simply look to the place where the garbage collection is at the moment, and convert the pointer as if it was the Gc itself using the Gc’s translation table.
Making a Gc work concurrent outside a garbage collection is relatively easy for no really collection is happening at runtime and a few rw-locks would do the trick.
Concurrency in a Gc is a must for some applications, and generally a good idea anyway. A better latency can be gained from a fine-coarsed incremental garbage collections. Real concurrency can be gained by using read and write barriers to allow execution during the garbage collection, which is only feasible when working with a JIT.
Most Garbage Collectors are tracing. Instead of using Reference Counting, they trace down unused objects (garbage) by tracing from the root set of objects and mark all objects found. Those who weren’t marked have to garbage.
The root-set are all objects that are definitely live objects. The root set typically consists out of the objects linked from the stack and registers of each thread. Also object vital to the Gc and globals are in the root-set.
An easy way to represent tracing through objects is the tri-color method. There are three colors an object can have: white, gray and black. All white objects haven’t seem to be referenced yet from black objects which are sure to be non-garbage. The gray objects are those that are referenced by black objects but didn’t have all of their references checked. Sounds confusing? A more practical explanation:
An implementation of the Tri-Color algorithm can be accomplished by putting a mark on each object which either can be 1 or 0. 1 representing a black object, and 0 representing a white object. A stack list can be used to keep track of the gray objects:
NB do not confuse this stack, which is a normal Last In First Out list with the call stack.
Studies have shown that in almost all applications most allocations have a lifetime less than a few miliseconds, although the rest of the allocations would have a far greater lifespan. When collecting every few miliseconds would cause unnesessary tracing through the long living objects which would still take up most of the memory, and when collecting every few seconds the uncollected short-lifed objects would have spammed the heap and would make the garbage collector locality-of-reference unfriendly (everything worked on wouldn’t fit in one l2-cache load and therefore would be a lot slower).
The older objects which are in comparison with the newer objects highly unlikely to be garbage could be searched less frequently. This could be done by seperating the heap into generations. Each generation is basicly a piece of the heap. A typical garbage collection would only target the youngest few generation. Everytime an object survives a garbage collection it’s moved into an older generation, and finally becomes part of the oldest generation that is only garbage collected very infrequently.
The duration of the garbage collections would be hard to predict and could vary a lot. Although generation based gc’s could increase the amount of garbage collections and decrease the total time in comparison with one big, applications that use a garbage collection can not guarentee (practically) uninterupted execution.
Tracing gc‘s have shown to easily outperform reference counter gc‘s. The prerequeste to know all pointers in an object can be solved easily for interpreted and jit-ted languages using the gc and can also be managed even in C with effort. The real problem remains the arbitrary and unpredictable length of garbage collections. There are ways to make tracing gc’s have a better latency, which I`ll discuss in coming posts.
Most Garbage Collectors aren’t limited to just tracing the garbage (unused objects) and scheduling it for deletion, but also manage memory as a whole including allocating it.
The kernel of an Operating System takes care of mapping physical memory (RAM/Swap memory/IO memory/Cache’s) to virtual memory available to a process. A process is allocated a few segment’s of memory. The one that is important to memory allocators is the Data Segment.
The Data Segment is the segment which contains memory which is used for random access by the program. The size of the Data Segment can be changed using the
brk C function, which asks the kernel for more memory to be mapped for the process, which then returns a pointer to the start of the newly allocated space. (providing a negative value will result in decreasing the data segment size).
A non-Gc memory manager/allocator are
malloc and its friends. Malloc uses
sbrk to obtain memory for the Data Segment to allocate memory from as almost all other allocators. It prefixes and suffixes all blocks of free or allocated memory by size fields so it can walk through the whole heap. It puts in each freeblock a pointer to the next and previous freeblock, which makes freeblocks easy to insert and find.
Freeing a block is simply adding into the freeblock chain and setting a bit to flag it is a free block. Allocating a block is simply walking the freeblock chain until a freeblock is found that can hold the required size, and adds new size’s and flags to the new freeblock and updates the pointers one block previous and next into the freeblock chain to match the new offset of the freeblock.
The big problem with
malloc is fragmentation.
sbrk could be required to be called to enlarge the heap size to create a freeblock big enough for the allocation although there is enough freespace for that allocation, which is scattered and not contiguous.
Modern implementations of
malloc feature tricks to prefer perfect fits and automaticly coalesce freeblocks to create bigger blocks, which all makes
malloc a whole lot slower than it should be with still quite a lot of memory fragmentation.
Most Gc’s implement their own memory allocator routines instead of using
malloc, usually even overriding the old malloc. This is because Gc’s usually know more about the requirements of the application than
It gets even better when the Gc knows all pointers to the managed heap. In this case a Gc could decide to move an object, update the pointers that linked to it, and save a lot of space from fragmentation. This is easily implemented when dealing with a interpreted language which uses a Gc for it knows all types runtime. It even can be done in C, but that requires all pointers both inside as outside the managed heap that point into the managed heap to be registered.
The difficult part is to find all pointers pointing to a certain object. This usually requires the whole heap to be walked, once or twice. Waiting until a reasonable amount of objects would be subject to move, and then doing one big move of all those objects, and possibly other operations requiring heap walks while doing it anyway as one big “garbage collection” would be way more efficient than doing this all ‘live’. In the first pass each object subject to move could be tagged with the new location and on the second pass all pointers would be checked and adjusted accordingly.
The big problem of being able to move managed memory is keeping track of all pointers into the managed memory. If you would want to store a file handle in managed memory you would have a hard time when the internals of file access keep contain pointers to your file handle.
Fixed allocations, which remain stationary would solve the issue but this could, sadly enough, result in fragmentation of your heap. When all objects can be moved it is very easy to move, but when there are fixed objects somewhere in the heap, complicated ‘fitting’ algorithms must be used which decrease performance.
Fixing memory temporarily, or preventing a GC temporarily is a much cleaner solution to the problem. In some cases this just isn’t possible and big blocks with their own
malloc style allocator could be used to handle unmanaged memory so that it won’t interfere too much with the normal managed memeory. On most *nix systems
mmap could be used to map additional usable memory to the virtual memory space of the process. Which is slower than bsrk and create some extra overhead, but it will avoid fragmentation in managed heap due to fixed blocks.
It is only usefull for a Gc to implement its own malloc when it can move its objects, or when there is a clear advantage to implementing the
malloc close to the Gc, which could be for instance to more easily walk through all objects in the heap without having to implement a walk algorithm for each malloc used accross different platforms.
The simplest form of Garbage Collector is the Reference Counter Garbage Collector.
Each time an object is referenced (a pointer to it is made) the reference count on the object is incremented. If the pointer isn’t used anymore the reference count is decreased. If the reference count hits 0 the object will be freed.
Python uses a Reference Counter.
freepair required for each allocation without Gc, it now requires
free_reffor each pointer/reference to an allocation. When someone forgets to release its reference the object persists.
Circular references can be fixed by adding a trace algorithm. This algorithm will trace through all objects from the root objects (pointers on the stack, etc.) and mark them. The objects that aren’t marked aren’t accessible and therefore are subject to Garbage Collection.
The problem is that a RefCounter Gc usually doesn’t keep track of what pointers are in objects, and that there could be pointers outside the managed heap which only use the
free_ref. Python struggled with this problem and it implemented the a trace algorithm that depended on the programmer’s of not-python-modules (of which pointers aren’t known) to implement the list of references to objects that also contain objects. By traversing links that way circular dependencies could be found without hurting unmanaged pointers, but this has a high performance hit on Python, and it is still adviced to avoid circular references.
RefCounter Gc’s are easy to implement, and a good idea for scripting languages which know all references so that circular references can be dealt with accoringly. But they still have a lot of disadvantages which other kinds of Gc’s solve a lot better, and more elegantly.
I’ve been working on Garbage Collectors recently, and I`ll share some of my experiences with them with you on my blog.
First off, what is a Garbage Collector?
A Garbage Collector, a.k.a GC is an algorithm that keeps track of your objects in the memory and gets rid of objects that aren’t used (referenced to) anymore, a.k.a garbage collecting. This can only be accomplished when the GC knows all pointers to the objects managed (objects on the managed part of the heap). And they are therefore almost always used with an interpreted/jit-ted language, like Java, Python or C#.
Since the first GC’s (reference counters) they have gained a lot more responsibility and perform a lot more tasks, where collecting the garbage is just one of the things they do.
I`ll discuss the various kinds and aspects of Garbage Collectors with their advantages and disadvantages in coming blog posts.
You can read all articles about the gc in its category.