Memory Alignment Havoc

Memory Alignment is one of the most important concepts when working low-level with memory. The bliss of the x86 processor, which is very kind about alignment, has left me unaware of it for years. That was until Kaja mentioned it, and therefore I decided to rebuild my Gc.

Like err.. what you talking ’bout?

Data should be aligned in the memory. Why? Because the processor doesn’t see the memory as most programmers do. For the processor the memory isn’t an array of bytes, but rather an array of blocks. In the case of the x86 processor this is usually 4 bytes.

When I got a simple piece of code like ‘(*i)++‘ the processor will find the 4 byte block in which this 4 byte integer i is located, increments it and stores it.

Well.. it does this when i fits exactly in one block. When i is partially in one block, and the rest is in a second block (it is unaligned) the processor will have to fetch both blocks, reconstruct the integer, increment it, split it, and store it back.

This is a lot of extra work. The processor will spend a few times more work on a simple increment. Although the actual difference is only about 5% because requesting 2 blocks or 1 block doesn’t differ that much.

Because this means a lot of extra work for the processor, and even more transistors, some processor manufacturers decided to barely or not support unaligned data.

Summarized:

  • Some processors will just cope with unalligned data, although it takes a bit more time. (x68)
  • Other’s will throw an hardware exception and let the OS fix it up, which takes a hell of a lot more time.
  • A bunch will get unstable and the process crashes, or the whole processor will.
  • Ans some will silently ignore the unaligned data using the first aligned data that comes close instead which leads to silent corruptions and hard to trace bugs.

Bliss of C

Maybe this will shock you, at least it shocked me. Luckily you don’t need to worry a lot when you aren’t writing specific low-level applications like memory managers or compilers. Your C compiler will take care of alignment for you.

Gimmi more

Kaja wrote an article about memory alignment.
IBM has got an excelent article on it too.

The GC: Tracing, reversing pointer to object

The problem

Whilest tracing down garbage during the Garbage Collection, as described in earlier posts, it is required to find mark an object pointed to from another object.

This could lead to problems when it is allowed to point anywhere inside an object, or when the object information bit can’t be found by a pointer to the start of the object.

In both cases the Gc should somehow translate a pointer (eg. 0x1234) to the head of the object it points in (eg. 0x1200). The problem here is how to efficiently get 0x1200 from 0x1234.

Avoiding the problem

This problem can be circumvented by disallowing pointers pointing to inside the object and instead require all pointers to point to the start of the object. This will only work when the header of the object has a fixed size, otherwise it will still be impossible to work back to the start of the header, which contains the flags you want to alter when tracing garbage.

Pointers inside a managed object aren’t required and could easily be removed. They can be very usefull though in some situations. One of these situations is being able to inline a private referenced instance into the contained class, of which is known that it won’t be modified, which would result in pointers inside the container class. It will only increase performance in some situations, but it will be usefull to have, and is optional.

A bigger issue are pointers that point to the start of an object with a header in front of it which has got a dynamic size. Either the header shouldn’t be of dynamic size, and a fixed (big) size, or instead of to the start of the object, pointers should point to the header of an object. In the latter case each pointer derefence should require an ‘add’ call to increase the address to match the start of the object, with the additional logic required to attain the size of the header which can be significant depending on the implementation. These extra opcodes for each pointer derefence could not be easily be worked away even when using a JIT because of multi-threading, and certainly could not be avoided when using C. The fixed header size isn’t nice either. For it probably would end up being at least 7 bytes on a 64bit platform (flags + pointer).

Oh, and there is also the java way which uses a handle for each object. Which basicly means that an integer pointer isn’t int*, but int** in java. Where the pointer in between contains the meta-data about the object. This is even more less ideal for it requires an extra derefence for each derefence and it results in twice the amount of pointers to track.

As far as I concern this problem can’t be avoided, when dealing with a Gc that is required to interop with non-managed code. With a JIT it could be done though, but this requires a runtime performance/garbage collection performance trade-off. And I personally like a bit slower garbage collect instead of slower runtime.

Look-up structure

Before tracing the garbage collection usually walks through the heap to mark all objects white and put those in the root-set in the gray-list. During this first walk through the heap a temporary look-up structure could be build to allow efficient look-up during trace.

When only allowing pointers to the start of the object it would be rather easy to do. A modulus-based bucket like implementation, a hash table without the hash-part, would do. This would allow a nice even redistribution of entries throughout the table, for it uses a modulus.

When allowing pointers in the middle of an object it becomes more tricky, because you can’t precompute the pointers to put in the table. You need to have a structure which allows you to look to the left and find the object the pointer points into. Although most pointers still point to the start of the object, the case that they could point inside the object shouldn’t be concidered.

Instead of a modulus-based bucket, division-based buckets should be used. This creates a problem for the likely distribution of blocks between the buckets will vary heavily. One whole bucket could be empty when there is a big allocation in the memory it represents, where almost all small objects which represent most of the pointers could be concentrated in just two buckets.

Bucket-tree

To compensate for this a bucket-tree could be used. Basicly when a bucket contains a lot of elements it would split itself in a new set of buckets. This way a minimal amount of buckets could be wasted on really big allocations, and a lot of buckets could be allocated for small allocations. The look-up time would be a bit worse, but it would be neglectable to the huge lookup time a normal search through the heap would require.

Special treatment for pointers pointing inside objects

When there are only a one or two pointers in the whole heap that point inside another object they could be treated seperately instead, and the rest could keep using the modulus-based buckets.

One way to give those pointers a special treatment is marking them, and resolving to which object they point during the initial walk through the heap where finding objects pointed to from pointers is a lot easier. (pointers needed to be looked up could be sorted by address, and be checked during the walk, which is very efficient).

Another way would be to simulate pointers pointing inside another pointer by making the pointer consist out of an object pointer and a offset from there, which could be cached by the JIT, or expanded before the garbage collection.

Yet another way would be to walk the modulus-based buckets by hand when the pointer isn’t found.

The best method would depend on the implementation of the rest of the Gc.

About modulus-based buckets

I`m still uncertain about whether the distribution in a modulus-based bucket can be trusted at all times. Putting a bucket-tree in default, for worst case would be nice, just to be sure.

Update: some more possibilities:

Binary search

When we put pointers to all allocated blocks sorted somewhere in the memory, one could use binary search, or other search algorithms to search in sorted lists, to find the pointer. When adding shortcuts each ~64K this could be quite efficient. I’ve implemented this method for my Garbage Collector temporarily for it is amongst the easier to implement and I suspect the fastest. Actually, I hope it is the fastest.

I’ll do some tests on different algorithms on a dump of a ‘modal’ application’s managed heap, when I’ve finished the Gc sufficiently to make an application work with it.

The GC: Concurrency

The tracing Gc`s (see previous posts) biggest disadvantage is the unpredictability of the garbage collection. It can happen anytime and the time it takes varies. This while normal execution is halted, can cause serious problems to certain applications which are required to have a low latency.

Incremental Gc

One trick to improve the latency of the gc is to make the garbage collection incremental. Instead of doing everything in one big garbage collection, tiny chunks of execution are interchanged with garbage collection.

Generations

When using a generations gc, a garbage collector which groups objects by their age, garbage collection could be done on each seperate generation. Objects that have survived a few collections, and settle in older generations, tend to remain persistent. (those are the objects that remain global during the duration of the program). Therefore the older generations don’t need a lot of garbage collections. Whereas the youngest collections typically contain a lot of garbage and require frequent garbage collections.

Granularity

Very much small garbage collections could have worse performance than one big garbage collection. Although very big garbage collections which are very infrequent are really bad for latency. When it gets extreme there even could be 100MB allocated memory for 1MB used memory at a moment, because garbage isn’t collected frequently enough. Having a lot of garbage between live data also gets in the way of locality of reference (used memory close together gets all in one load of the L2-cache which is a lot faster than normal RAM).

Although a figure like 100MB sounds ridiculous, take for instance a program that reads through a text file and print every line:

while there-is-a-new-line
   string line = get-line-from-file
   print line
end

Every line read would go on the heap as a string. After the line string goes out of scope it becomes unreferenced from the stack (part of the root-set) and therefore garbage. When reading a big file the heap fills up pretty fast. When doing another similar operations but then from the lot faster memory than file it could even get worse.

When doing a lot of garbage collects small pieces of memory are copied each time over short distances, which is less efficient than bigger pieces, but having to copy pieces over a very great distance, which doesn’t fit in the L2-cache really slows down instead of speeding up.

Trace-only garbage collections

When a lot of memory is allocated in a short time the garbage collector could get suspicious and fire away a full collection, which could be a waste of time when all allocated memory seems to still be live. Moving memory, marking freespaces and updating pointers takes most time of the whole garbage collection. When only tracing what is garbage and what isn’t and after that deciding what to do could provide a gain in performance.

Not only does tracing down garbage take less time, it also is easier to do while the rest of the program is running.

Internal data garbage collections

Most gc’s keep internal data structures optimized for quick access and modification for thing like tracking types, freeblocks, blocks, unmanaged pointers and others. An example could be a list which contains the pointers to unmanaged pointers to be able to link into managed memory from unmanaged memory. This list would be accessed a lot when in unmanaged code, which requires all pointers on its unmanaged stack to be registered on that list.

A chained-block stack would perform best. The whole stack contains out of a few chained blocks, where the last block is pointed to from the stack structure, which on itself points to the previous block, and so on all the way to the first block. Adding a pointer-pointer to the stack would be as easy as increment the pointer count, and put it in the last block. And possibly creating a new block when the current block is full.

Removing a pointer-pointer would work best by searching from end to begin for the pointer and simply NULL-ing it. This for the most frequent use for the stack-list would be for registering pointers from the unmanaged stack and deregistering them afterwards. Although when another global unmanaged pointer is registered there could be a few block registered for the stack, where only the very last contains that one single pointer-pointer to that global pointer which won’t be deregistered soon.

The algorithm could be changed to compact on every operation, or to track freespaces in the stack-list, but it can be done a lot easier. Simply collection the garbage and compacting the stack-list would work just as well.

It would be tempting to perform these garbage collections for internal structures during the normal garbage collection, but this would hurt latency. Instead scheduling them somewhere in between would be best.

Concurrency

When a garbage collection is in progress all other threads are usually stopped to avoid the program itself (the mutator) to change memory and corrupting the managed heap. Although it is possible to make most parts of the garbage collection concurrent with normal application execution.

Tracing

When a pointer is changed during tracing down garbage in an object which already is black (all pointers in the object have been checked), the object pointer to will remain white (no black object seems to point to it) and will be scheduled for garbage collection.

This condition is rare, but it could happen, and it could be fatal. Other conditions like an allocation during trace could be easily solved by marking the object black when allocated.

Write barrier

An easy way to cope with these conditions is to detect any reading or writing to pointers that already are marked black, and acting upon this. This is called a write barrier.

The problem with a write barrier is that it is very hard to implement for languages like C, and cause a performance hit, for each pointer-derefence+write would require a few extra opcodes, and would be in most cases unacceptible. When you are dealing with a JIT-ed language (Just In Time compiled) however, you can emit these extra opcodes to detect writes into the black objects only when a Gc is in progress.

Compacting and Pointer change

After garbage has been traced down it is time for the gc to compact the memory by moving live objects in freed garbage, and changing pointers to these objects accordingly. When traced the gc already would have made a translation table for pointers (usually offset based for efficiency) and would walk its way through the heap compacting and changing pointers in one go.

When the application is running at that time it would encounter problems when reading dereferencing pointers.

Read barrier

To combat this a read barrier would be required. Which is the same as a write barrier, but then for normal pointer-dereferencing. This too requires opcodes to be added on each dereferencing, which would be easy for a JIT and unacceptible for C. The added opcodes would simply look to the place where the garbage collection is at the moment, and convert the pointer as if it was the Gc itself using the Gc’s translation table.

Concurrency outside collection

Making a Gc work concurrent outside a garbage collection is relatively easy for no really collection is happening at runtime and a few rw-locks would do the trick.

Conclusion

Concurrency in a Gc is a must for some applications, and generally a good idea anyway. A better latency can be gained from a fine-coarsed incremental garbage collections. Real concurrency can be gained by using read and write barriers to allow execution during the garbage collection, which is only feasible when working with a JIT.

The GC: Tracing Garbage

Most Garbage Collectors are tracing. Instead of using Reference Counting, they trace down unused objects (garbage) by tracing from the root set of objects and mark all objects found. Those who weren’t marked have to garbage.

The Root-set

The root-set are all objects that are definitely live objects. The root set typically consists out of the objects linked from the stack and registers of each thread. Also object vital to the Gc and globals are in the root-set.

Tri-Color algorithm

An easy way to represent tracing through objects is the tri-color method. There are three colors an object can have: white, gray and black. All white objects haven’t seem to be referenced yet from black objects which are sure to be non-garbage. The gray objects are those that are referenced by black objects but didn’t have all of their references checked. Sounds confusing? A more practical explanation:

  1. Colour all objects white.
  2. Colour the root-set objects gray.
  3. (Repeat this as long as there are gray coloured objects) Pick a gray object. Colour all objects referenced to from that gray object gray too. (Except those who are black). And colour itself black.
  4. All black objects are now reachable, and all white objects are unreachable. Free those that are white!

Pop-mark-push referenced

An implementation of the Tri-Color algorithm can be accomplished by putting a mark on each object which either can be 1 or 0. 1 representing a black object, and 0 representing a white object. A stack list can be used to keep track of the gray objects:

NB do not confuse this stack, which is a normal Last In First Out list with the call stack.

  1. Mark all objects 0 (white).
  2. Push all objects in the root-set on the gray-stack.
  3. (while there are objects in the gray-stack) Pop an object from the gray-stack. Mark it 1 (black) and push all referenced objects from that object on the gray-stack that aren’t marked 1 (thus white).
  4. All objects still marked 0 (white) can be freed rid of.

Generations

Studies have shown that in almost all applications most allocations have a lifetime less than a few miliseconds, although the rest of the allocations would have a far greater lifespan. When collecting every few miliseconds would cause unnesessary tracing through the long living objects which would still take up most of the memory, and when collecting every few seconds the uncollected short-lifed objects would have spammed the heap and would make the garbage collector locality-of-reference unfriendly (everything worked on wouldn’t fit in one l2-cache load and therefore would be a lot slower).

The older objects which are in comparison with the newer objects highly unlikely to be garbage could be searched less frequently. This could be done by seperating the heap into generations. Each generation is basicly a piece of the heap. A typical garbage collection would only target the youngest few generation. Everytime an object survives a garbage collection it’s moved into an older generation, and finally becomes part of the oldest generation that is only garbage collected very infrequently.

Unpredictable garbage collection duration

The duration of the garbage collections would be hard to predict and could vary a lot. Although generation based gc’s could increase the amount of garbage collections and decrease the total time in comparison with one big, applications that use a garbage collection can not guarentee (practically) uninterupted execution.

Advantages

  • No extra fields required to keep track of the reference count.
  • There hasn’t got to be registration of references to objects, cause the gc keeps traces those down.
  • Hasn’t got any problem with circular references.
  • Outperforms reference Gc. The registration for each reference in the reference counter Gc takes more time than tracing down that reference. And it doesn’t do unnesessary checking of older objects when generations are kept in mind.

Disadvantages

  • Way harder to implement.
  • All pointers inside an object should be known.
  • Worse latency than reference counter. Although it outperforms refcounter, a garbage collects usually requires the whole program to be halted, which is a really bad thing for some applications.

Conclusion

Tracing gc‘s have shown to easily outperform reference counter gc‘s. The prerequeste to know all pointers in an object can be solved easily for interpreted and jit-ted languages using the gc and can also be managed even in C with effort. The real problem remains the arbitrary and unpredictable length of garbage collections. There are ways to make tracing gc’s have a better latency, which I`ll discuss in coming posts.

The GC: As A Managed Memory Allocator

Most Garbage Collectors aren’t limited to just tracing the garbage (unused objects) and scheduling it for deletion, but also manage memory as a whole including allocating it.

Data segment

The kernel of an Operating System takes care of mapping physical memory (RAM/Swap memory/IO memory/Cache’s) to virtual memory available to a process. A process is allocated a few segment’s of memory. The one that is important to memory allocators is the Data Segment.

The Data Segment is the segment which contains memory which is used for random access by the program. The size of the Data Segment can be changed using the sbrk or brk C function, which asks the kernel for more memory to be mapped for the process, which then returns a pointer to the start of the newly allocated space. (providing a negative value will result in decreasing the data segment size).

Malloc

A non-Gc memory manager/allocator are malloc and its friends. Malloc uses sbrk to obtain memory for the Data Segment to allocate memory from as almost all other allocators. It prefixes and suffixes all blocks of free or allocated memory by size fields so it can walk through the whole heap. It puts in each freeblock a pointer to the next and previous freeblock, which makes freeblocks easy to insert and find.

Freeing a block is simply adding into the freeblock chain and setting a bit to flag it is a free block. Allocating a block is simply walking the freeblock chain until a freeblock is found that can hold the required size, and adds new size’s and flags to the new freeblock and updates the pointers one block previous and next into the freeblock chain to match the new offset of the freeblock.

The big problem with malloc is fragmentation. sbrk could be required to be called to enlarge the heap size to create a freeblock big enough for the allocation although there is enough freespace for that allocation, which is scattered and not contiguous.

Modern implementations of malloc feature tricks to prefer perfect fits and automaticly coalesce freeblocks to create bigger blocks, which all makes malloc a whole lot slower than it should be with still quite a lot of memory fragmentation.

Gc managed pointers and malloc

Most Gc’s implement their own memory allocator routines instead of using malloc, usually even overriding the old malloc. This is because Gc’s usually know more about the requirements of the application than malloc itself.

It gets even better when the Gc knows all pointers to the managed heap. In this case a Gc could decide to move an object, update the pointers that linked to it, and save a lot of space from fragmentation. This is easily implemented when dealing with a interpreted language which uses a Gc for it knows all types runtime. It even can be done in C, but that requires all pointers both inside as outside the managed heap that point into the managed heap to be registered.

The difficult part is to find all pointers pointing to a certain object. This usually requires the whole heap to be walked, once or twice. Waiting until a reasonable amount of objects would be subject to move, and then doing one big move of all those objects, and possibly other operations requiring heap walks while doing it anyway as one big “garbage collection” would be way more efficient than doing this all ‘live’. In the first pass each object subject to move could be tagged with the new location and on the second pass all pointers would be checked and adjusted accordingly.

Unmanaged code interop

The big problem of being able to move managed memory is keeping track of all pointers into the managed memory. If you would want to store a file handle in managed memory you would have a hard time when the internals of file access keep contain pointers to your file handle.

Fixed allocations, which remain stationary would solve the issue but this could, sadly enough, result in fragmentation of your heap. When all objects can be moved it is very easy to move, but when there are fixed objects somewhere in the heap, complicated ‘fitting’ algorithms must be used which decrease performance.

Fixing memory temporarily, or preventing a GC temporarily is a much cleaner solution to the problem. In some cases this just isn’t possible and big blocks with their own malloc style allocator could be used to handle unmanaged memory so that it won’t interfere too much with the normal managed memeory. On most *nix systems mmap could be used to map additional usable memory to the virtual memory space of the process. Which is slower than bsrk and create some extra overhead, but it will avoid fragmentation in managed heap due to fixed blocks.

When

It is only usefull for a Gc to implement its own malloc when it can move its objects, or when there is a clear advantage to implementing the malloc close to the Gc, which could be for instance to more easily walk through all objects in the heap without having to implement a walk algorithm for each malloc used accross different platforms.

The GC: Reference Counter

The simplest form of Garbage Collector is the Reference Counter Garbage Collector.

Each time an object is referenced (a pointer to it is made) the reference count on the object is incremented. If the pointer isn’t used anymore the reference count is decreased. If the reference count hits 0 the object will be freed.

Python uses a Reference Counter.

Advantages

  • A reference counter does its tracking at runtime. There is no big garbage collection, and the execution of the program won’t be interupted for a while when the garbage is collected.
  • It’s very easy to implement.

Disadvantages

  • Instead of the required malloc free pair required for each allocation without Gc, it now requires add_ref free_ref for each pointer/reference to an allocation. When someone forgets to release its reference the object persists.
  • Circular references will never be collected. An example of a circular reference is a list that contains as an item a reference to itself. When the list is released its count will remain 1 because inside the list it refers to itself and will therefore never be collected.
  • A RefCounter GC doesn’t keep track of pointers into the managed heap, and therefore can’t move any object because it`ll break pointers. This makes the fragmentation of the RefCounter GC as bad as malloc‘s fragmentation.

Circular references

Circular references can be fixed by adding a trace algorithm. This algorithm will trace through all objects from the root objects (pointers on the stack, etc.) and mark them. The objects that aren’t marked aren’t accessible and therefore are subject to Garbage Collection.

The problem is that a RefCounter Gc usually doesn’t keep track of what pointers are in objects, and that there could be pointers outside the managed heap which only use the add_ref and free_ref. Python struggled with this problem and it implemented the a trace algorithm that depended on the programmer’s of not-python-modules (of which pointers aren’t known) to implement the list of references to objects that also contain objects. By traversing links that way circular dependencies could be found without hurting unmanaged pointers, but this has a high performance hit on Python, and it is still adviced to avoid circular references.

Conclusion

RefCounter Gc’s are easy to implement, and a good idea for scripting languages which know all references so that circular references can be dealt with accoringly. But they still have a lot of disadvantages which other kinds of Gc’s solve a lot better, and more elegantly.

The GC: Introduction

I’ve been working on Garbage Collectors recently, and I`ll share some of my experiences with them with you on my blog.

First off, what is a Garbage Collector?

A Garbage Collector, a.k.a GC is an algorithm that keeps track of your objects in the memory and gets rid of objects that aren’t used (referenced to) anymore, a.k.a garbage collecting. This can only be accomplished when the GC knows all pointers to the objects managed (objects on the managed part of the heap). And they are therefore almost always used with an interpreted/jit-ted language, like Java, Python or C#.

Since the first GC’s (reference counters) they have gained a lot more responsibility and perform a lot more tasks, where collecting the garbage is just one of the things they do.

I`ll discuss the various kinds and aspects of Garbage Collectors with their advantages and disadvantages in coming blog posts.

You can read all articles about the gc in its category.

Using template variables in extension tags with Mediawiki

Mediawiki, the wiki used by wikipedia, is a powerfull wiki.

It allows you to include self made templates which you can pass variables too. A very usefull template on wikipedia is the stub template. When you add {{stub}} at the start of an article it will insert text explaining the user that the article isn’t finished yet, but rather a stub. With {{Album_infobox|Name=Sehnsucht|Cover=Sehnsucht_cover.jpg|…}} an infobox is inserted used at wikipedia for all albums.

When the features of wikipedia don’t suffice you don’t neccessarily have to hack wikipedia to add features. You can add an extension which hooks into a part of the parser to create your own logic for your own custom tags.

I created a wiki to store the lyrics which I already got on a forum and found espacially templates usefull.

However, there was a slight annoyance when working with templates. The problem is that in several cases there is a link in a template which can be ambiguous (there are two songs called the same) but you don’t want an ugly suffix after the name (like “Deliverance (Album of Opeth)” instead of “Deliverance”). To solve this you have to pass both the name of the album and the name of the page of the album, but this is annoying to do everytime.

MediaWiki knows no logic so I thought to solve this by adding some basic logic in a custom tag which either uses the same text as the page name when no custom text is specified and otherwise uses the custom text. But that just wouldn’t work.

MediaWiki, when parsing a page, first escapes all nowiki elements, extensions, comments, links and so on by a unique ID. Then parsed the format and replaced variables with their values. After that it just replaced the unique id’s back with the proper content, which in this case was the output of my extension too, which was flawed because the variables weren’t replaced before the extension tag was escaped out and also because the wiki-link ([[pagename]]) wasn’t replaced because the extension was still escaped. So I ended up with having [[{{{album}}}]] on my page instead of a Sehnsucht link.

Trying to fetch template variables inside the extension just wouldn’t work because the extension was evaluated at the moment it was already included in the main page. Avoiding this by disabling the cache would require parsing the whole template each time it is viewed which is very extensive. (each link on a page in a wiki has to be looked up when parsed to see whether it exists)

So I googled a bit around and found that Mikesch Nepomuk submitted a patch on one of the mailinglists to fix this by escaping extensions seperately after the veriables were expanded. Apparently it hadn’t been approved or maybe because it wasn’t included in my release yet so I tried to apply it but it failed. The patch was a bit too old. So I did it by hand instead of by GNUpatch and with some tinkering I got it to work.

You can download the patch for the 1.5 beta4 here, if you are interested.

And I thought I would never touch PHP again.

Dual Cores, Traffic and Gaming

Dual Cores are the new thing in the processor bussiness. They claim to be perfect for people who want to do different things at the same time but in reality they are just a cheap solution to deliver more.

The truth is that two 500 mhz processors just don’t deliver the calculation power of one 1 000mhz processor. Given offcourse that they are of the same architecture. This is due to the fact that there is a lot of additional logic required to let two processors work together without getting eachother in the way. When two processors both want to write another thing to the same spot of memory you would get impredictable results. To solve this you have all kind of ways to solve it, by for instance locking a region of memory for an amount of time, making the other processor to wait for the first one to finish. But this all creates a lot of overhead and complexity.

You could compare it with driving a car. Driving a car is relatively easy. You just have to steer around some static obstacles, no big deal. When you know the way you could even do it with your eyes closed. That is unless there are other people driving a car. When driving a car you are keeping an eye, not on the road, but on the other people on the road. This not only slows down your maximum speed or decreases efficiency – you can’t just drive full speed over a junction – but it also increases the complexity and the likelyhood of errors.

The same thing goes in the case of dual core processors (or even hyper-threaded and normal multi processor platforms). Although the comparison isn’t really valid because having two processors doesn’t mean that you have to do two things at the same time. The issue is though that what you normally would do, would only be done by one processor and you are therefore wasting the other’s capability.

A good example of this are games. Games tend to be single threaded, which gives best performance for no processor time is wasted to multithreading and it is the easiest thing to do for multithreaded is rather complicated. Complicated enough that there have been a few lengthy discussions in the mono mailing lists how to lock a simple object.

Because we are getting dual core and propably ‘more’core processors lateron for the companies are too lazy to make decent processors1 games should become multithreaded to exploit the full capability of the machine it runs on.

Although it makes creating performance applications a lot more difficult it will surelly benifit distributed computing for the change from different processors to different machines is less than from 1 processor to more.

[1] Native threaded CPU’s like dualcores/multi processor/hyperthreaded processors are ideal for server applications where multiple short living requests have to be resolved. Switching an allocating on a software-threaded processor would create too much overhead for such a simple request.

Practically Reversing CRC

The algorithm described in my previous post on how to compute a patch for a message to get the wanted crc computed from that text has a usefull application when wanting to get the possible texts that could compute to your own provided crc.

A little util I wrote, reversecrc (http://w-nz.com/projects/reversecrc/), lists (the endless) list of possible original texts, has been usefull for the Freelancer modding community, for in that game Crc is used a lot and was an obstacle.

Although it is almost impossible to reverse a 20 character string because it would be insane to check all the texts that could result in the hash, using one of the provided texts would do too as it results in the same hash anyways.

Subversion repositry: https://cvs.codeyard.net/svn/icrc

Reversing CRC

Cyclic Redundancy Code

CRC is a hash which is frequently used as a checksum for data in for instance archives. Who hasn’t had bad CRC errors once when opening corrupted zips. CRC is a very old algorithm and over time it changed a lot from the original idea.

The original idea behind CRC was representing the data that you wanted the hash from as a big number and dividing it by a certain number called the polynomial and taking the remainder of the division as the hash. For instance: 23 % 3 = 2 (% is the modulus operator, which is the remainder of a division)

The initial problem was that dividing is a rather intensive operation. They wanted to simplify CRC to make it easier to implement in hardware and make it faster. They did this by getting rid of the carry used in the substraction of the division:

Normal substraction (binary): 10101 - 01100 = 01001
Without carry: 10101 - 01100 = 11001

Substraction without a carry is basicly a eXclusive bitwise OR (XOR), which returns only 1 when one of the operands is 1 and the other 0.

The algorithm required was faster bit still worked bit by bit, which isn’t really what a computer likes. A computer works best with one to four bytes. To make CRC faster they cached 8 operations at the time by precomputing the results for a certain start value and put it in a table called a XOR Table.

The required code for the CRC calculation itself now became very simple:

hash = (hash >> 8 ) ^ table[data ^ (0xff & hash)]

They changed the CRC algorithm once more by making it reflected. This means that the input data is reversed bitwise: 011101011 <-> 110101110. This was done because most of the hardware chips at the time reversed data bitwise. For it was too much work to reflect each byte of incoming data they changed the algorithm that generates the Crc table to create a table which has the effect of reflected data.

This is by the way not totally correct; the result still was different for a reflected than a non-reflected algorithm for they wouldn’t cache the whole piece of data to reverse it but did it per byte at calculation.

At this moment CRC barely resembles the original idea of a modulus.

Reversing CRC

First off, credits for the original idea of CRC patching go to anarchriz.

CRC is a cryptographicly very weak algorithm. It can be easily reversed for it has got the property that with 4 bytes you append to the current computed hash you can get every required hash. You can change the whole message and add 4 patch bytes to patch the hash to the one you like.

The ability to patch a CRC also makes it possible to very efficiently generate all possible source data of a checksum. Although it still is a bruteforce method you got 4 bytes freely and patching is faster than calculating.

Patching is basicly going back the way CRC works. Crc basicly takes the hash, moves it 1 byte to the right (dropping one byte) and xor-ring it with the table entry. The nice thing about normal CRC is that the first byte of a table entry is unique for that entry.

For the first byte of the entry is unique for that entry and it is put in the hash xor-red with 0 for that is what is shifted in from the right you can work back the whole entry used.

For instance:

My is: 0x012345678, this means that it was xorred with the entry in the CRC table that starts with 0x12. When you xor the hash with that full entry the only thing that the next byte was xorred with was the start of a table entry too.

When reversing the current hash you know what will be xorred on the patch you’ll give. Xorring this with your wanted hash is enough.

The resulting algorithm is suprisingly simple:

– Put the current hash byte wise reversed at the start of a buffer. Put the wanted hash byte wise reversed at the end of the current hash in the same buffer.
– Look up the entry in the table that starts with byte 7 in the buffer. Xor this value of position 4, and Xor the entry number on position 3. Repeat this 4 times with the positions each time one decremented. (thus 7,6,5,4 and 4,3,2,1 and 3,2,1,0)

When you’ve done this the required patch bytes are the first 4 bytes in the buffer.

Some Crc variants tend to use duplicates in the crc-table which means there could be more than one original table entry used. You should just branch and try all of them.

I’ve made a simple python script to work with crc32 and to patch a hash.

You can download it Here. And there is a C implementation here.

Update Fixed a bug in the example.

Update Andrea Pannitti wrote an implementation in java.

Update I found a great article by Stigge, Pl�tz, M�ller and Redlich about reversing CRC. They pretty much nailed it by bringing it down to elementary algebra. They explain an algorithm which is able to patch data at any given point to adjust the CRC to any desired value.

rel=”nofollow”

Quite some time ago google started honouring the rel="nofollow" attribute value pair in a html tags, which should prevent spammers from gaining a high page rank by spamming blogs with comments.

It is useless.

Spamming costs almost nothing, if there is a slightest amount of gain in it for the spammers they will keep doing it. Of all those hundred thousands of people visiting blogs and seeing comment spam a few will still follow the link, those few are enough for the spammers.

One could argue that the websites the comment spam point to now haven’t got a really high pagerank anymore. This is only partially true, for only a selective amount of people install the code to add the rel="nofollow" attribute in links that can be spammed. Even though the highest ranking blogs have already installed it, the tons of small blogs are still enough to raise the page rank enourmously.

In my opinion search engines should just ignore links which are considered spam.

Function Recursion Overhead

In courses I followed at university and in most books about programming, recursion is praised as a good way to solve some problems.

An example of recursion (not-existing-api used for the sake of simplicity):

int GetFolderSize(Folder folder)
{
 int sum = 0;
 foreach(File file in folder.files)
 {
  sum += file.Size;
 }
 foreach(Folder subFolder in folder.folders)
 {
  sum += GetFolderSize(subFolder);
 }
 return sum;
}

This example function calculates the combined size of all files in a folder and those inside any subfolder.

It does its job and it is very clear how it works, but it is inefficient, take another way to write the algorithm:

int GetFolderSize(Folder folder)
{
 Stack<Folder> stack = new Stack<Folder>();
 int sum = 0;
 stack.Push(folder);
 while(stack.Count > 0)
 {
  Folder folder = stack.Pop();
  foreach(File file in folder.files)
  {
   sum += file.Size;
  }
  foreach(Folder subFolder in folder.folders)
  {
   stack.Push(subFolder);
  }
 }
 return sum;
}

This version is harder to understand. Basicly it maintains a stack of folder of which the total size of the files in it are not yet calculated. While going through each of the folders in the stack it adds new sub-folders from folders and removes the ones processed.

The latter method is way more efficient.

For each function function call in the first (recursive) version of the algoritm a new sum-instance, a subFolder-instance and an enumerator instance over the folder.Folders is required which stacks up. When 5 deep in the first algorithm you already use more than double of the amount of memory ever required in the second algorithm.

Additionally a function call itself requires more memory of its own, which depending on the language used, can be quite significant. Debugging also gets really hard when you have got a stack trace of hundreds of functions.

Using your own stack instead of poluting the call stack (the thing where function calls are kept) sound great. There is only one little problem, it can get pretty complex.

Take for instance a program that would put the files of folders and their subfolders in a file in that same folder, for instance for a playlist generater. This requires the algorithm to detect when all child items of a certain item have been processed. The problem with this is that the parent item is already gone from the stack when the child item is processed. This requires some additional fields and a few function pointers to get it to work and it can get a mess.

The best way to get it to work is to mimic what a normal recursive method would have done, which involves a lot of extra helper classes, which all depend on how the original recursive algorithm has worked. In a language with a Garbage Collector (which automaticly free’s unused classes) it is managable, but in a language without it like C(++) trouble doubles when you also need to monitor the lifetime of the helper classes.

I noticed that it has been very tempting to use recursion and that there are almost no occasions where something like your own stack is used, espacially in the complex cases. A shame, it is challenging :-P.

Shiftings in Architectures

Cell, the hype

Sony is working on a new kind of processor which they call Cell. A single Cell processor is said to have a 1/75th of the power of the Blue Gene/L supercomputer in teraflops, and that for 400 dollars.

The new Cell processor is hyped, a lot. People are shouting that the performance of the PlayStation 3, which will feature a Cell processor, will blow away those of the competers: the XBox and the Revolution, both using claimed to be slow PowerPC processors.

Teraflops

So what are these Teraflops? Why aren’t they talking about Hertzes anymore?

A TeraFLOPS stands for 1,000,000,000,000 Floating point Operations Per Second. (TeraFLOPS on wikipedia).

A normal computer has a few GigaFLOPS (one TeraFLOPS is 1000 GigaFLOPS).

The Cell processor in the PS3 will have 1,8 TeraFLOPS. It is very easy to believe that the PS3 will inheritly be about 500 times faster than a normal computer. But that is false.

The performance of a computer isn’t all about how fast a computer can calculate with floating point numbers. The performance of a computer has got to do with a lot more than that. Simply doubling the memory performance (not specificly memory bandwidth, but the whole) would have a bigger impact than doubling the amount of TeraFLOPS, even on applications that heavily rely on floating point operations.

So why have a lot of TeraFLOPS then? It is quite simple, the new platforms haven’t got seperate GPU’s (Graphical Processing Units), their task is now intergrated into the CPU. And what was their task? Calculating, a lot, with floating points.
The amount of TeraFLOPS on the new platforms are the sum of the amount of TeraFLOPS of both the GPU and the CPU. A GPU needs to calculate a lot with numbers for 3d rendering.

Your normal computer has got a lot more than those few GigaFLOPS in your CPU in your GPU already. Although the leap to over the one TeraFLOPS is certainly impressive.

Marketing

Super Computers need to do a lot of FLOPS too, they calculate models which got a lot of floating points. So why put the Cell processor in the Ps3 instead of directly in some supercomputer? Simple, marketing. Almost everyone that is superficially following the computer news is telling me over and over again that the Cell processor is the most incredible thing there is. It is hyped.

Linux on Ps3

Also the Cell processor uses a whole new architecture which still has to be tested a lot to mature. Sony will let you install Linux on your Ps3? Why? Simple, because they want the Linux community to add decent support for the Cell architecture.

New Architectures & CLI

I guess it will be a matter of time before new architectures will come. A problem with a lot new architectures is that there isn’t a lot of support for them yet. The solution? CLI’s like Java and .Net could bring the solution.

Microsoft rumoured to have MSIL running (semi)natively on their XBox360 and making Longhorn more and more rely on .Net (the ontop applications, not the kernel offcourse) means a lot less architecture dependency.

I wonder what will happen, but one thing is sure.. things are shifting.

Python Url Encoding

I looked in the Python Library Reference for a function to encode special characters in strings to be able to be put in Urls (and Uri’s), but found none.

Maybe I missed it?

Anyways, here is an implementation I wrote based on this article:

HexCharacters = "0123456789abcdef"

def UrlEncode(s):
    r = ''
    for c in s:
        o = ord(c)
        if (o >= 48 and o <= 57) or \
            (o >= 97 and o <= 122) or \
            (o >= 65 and o <= 90) or \
            o == 36 or o == 45 or o == 95 or \
            o == 46 or o == 43 or o == 33 or \
            o == 42 or o == 39 or o == 40 or \
            o == 41 or o == 44:
            r += c
        else:
            r += '%' + CleanCharHex(c)
    return r

def CleanCharHex(c):
    o = ord(c)
    r = HexCharacters[o / 16]
    r += HexCharacters[o % 16]
    return r

note: I used the character numbers instead the characters to compare with so I could do greater than and lesser than for the alphanumeric numbers. Maybe testing with a bitmask would be more efficient.

I have to write almost everytime I work with python my own something to hex function which doesn’t add the ‘0x’ in front the built-in hex does.

Either I can’t search or Python hasn’t got enough batteries included. I guess the first, if not I`ll submit my batteries.

BSD

BSD is Unix.. said to be the professional cousin of Linux.

This piece of propoganda for BSD against linux gives as reason that linux is bad for it is told to be just hacked together. This because of the people who develop linux are just people from the community who put a little bit of time in it to get a feature (they probably want) added into Linux and don’t really concern about making it perfect, was said.

They took some of the sarcastic ‘todo’ comments in the kernel as example, blaming that if that stuff is in the kernel linux can’t be trusted at all.

But why does BSD hasn’t got the widespread hardware support Linux has? They blame the big company’s like IBM for instance. I just wonder whether their 60 dedicated BSD programmers could code all the hardware drivers that the thousands of contributers of Linux have coded, even if it were as bad as it is now according to them.

I bet that *BSD has got more than enough of those comments in their code too. If they haven’t they are just hiding the truth and hiding points of improvement, for these TODO’s and FIXME’s are fixed in the end. And even if they get rid of all the TODO’s and FIXME’s before they release any of it they waste a lot of time it could have been used already (less efficiently though.. but it usualy doesn’t make such a big difference).

Services, a different approach

A Server is a computer (or a group of them) that offers services.

Such services could be a webmarket, a website, a game server and so on.

These services tend all to be implemented quite differently, hardly interacting at all. All these services require a different approach, a different way to manage them. This creates both overhead and chaos.

I`m working on a small framework intented to host services which is very flexible.

Framework
A service isn’t a single process as it usualy is, although it could be put in its own seperate process. A service is rather just an instance of its servervice type in the framework somewhere, meaning you can have multiple instances of one service with different configurations scattered accross processes and even machines as you like.

You can edit the settings and manage the services using the same API (and likely the same configuration tool).

A service isn’t aware where it is hosted, from its perspective it is just in a framework with other services and service modules with which it can interact in always the same way.

You as administrator however are aware of the services and its allocation. You can allocate a service (even runtime) to different locations which haven’t neccessarily got to be processes, but also can be lightweight processes found in .net like appdomains or on different machine’s processes.

Having different processes for each service decreases performance but increases stability, and visa versa.

Interaction
Services shouldn’t be restricted to their own process but should also be able to interact with other services by using their exposed interfaces.

An example could be a webserver-service which you provide with the name of the logger-service you want to be used for that webserver. The webserver could just interact by casting an object in the public interface dictionary of the logger to an log interface. The framework takes care of the required underlying communication like the remoting in case they are hosted on different processes or even different machines.

A more tighter type of interaction would be modules. A module of a webserver could be a webmarker. The webmarket would just access the object of the webserver which handles hooking to certain URI, add the hooks and await a visit. Other services could do the same but a module is specificly bound to the service it attached to.

Futhermore modules would be an easier way for administrators to manage extra functionality which is more module like than filling in the instance paths to use in the settings of a certain service.

Portability
I`m still in doubt whether I should make it cross-language. It would be a good thing but it would certainly create havoc to make everything run smooth and alike on all languages. Not even to mention the overhead to load each languages’ runtime.