The Cuckoo Hashtable

A Hashtable algorithm is a specific algorithm to implement key-value pair datastructure with efficient by-key look-ups using hashing of the keys. A hashtable contains a list of buckets. In a simple implementation, the i-th bucket, contains the key-value pairs in a list of which the key has a hash that is i modulo the amount of buckets. The hash-table would increase the amount of buckets if any bucket contains more than a fixed amount of pairs. This results in a constant-time look-up, but an insertion might invoke a expensive rebuild.

There are a few variations on Hash Tables. I’d like to share a really smart one: the Cuckoo Hashtable.

The Cuckoo hashtable expects two different hash-functions for the keys. Instead of storing a list of pairs in each bucket, the Cuckoo Hash table stores a single pair in each bucket. When inserting a pair, it is inserted in one of the two possible locations. If it happens to be occupied, the old pair is replaced. Then the replaced pair is inserted at its other possible location, potentially kicking out another pair. This step is repeated, until there is no displaced pair or a loop is detected. When a loop is detected, the hash-functions can be changed, if the buckets are for the most part empty -or- when the table is almost full, the amount of buckets can be increased. It can be shown that an insertion has a amortized constant time. (The buckets can be resized in-place, and each entry is repositioned as if it was displaced by another.)

Don’t rely on parsing

Most applications store their settings (espacially in *nix) in text configuration files. These text files need to be parsed every time the application starts.

Parsing is the action of (usualy streaming) dividing a certain piece of data into understandable parts. This usualy comes down to looking at the text file character by character and deciding what should be done. Usualy this is done by maintaining a state which contains the data collected sofar. And with a bit more complicated files this even means having a stack of states and complicated actions when a certain state is left.

The problems:

  • Parsing is slow, very slow
  • Formats required to be parsed contain overhead, a lot of overhead

But it certainly has got advantages:

  • Humans can easily edit it, you don’t need to rely on configuration tools
  • (Usualy) makes a configuration format more extensible by nature (adding one new field in the average programmer’s binary format would break it)

Now, there are attempts to help improve speed. This by standardizing the format, which makes the amount of oddities to expect less, which ultimately makes the whole parsing slightly faster. This at cost of the easiness it can be edited.

A good example would be Xml. Xml is damned ugly. Xml is too strict. And Xml still takes a hell of a lot of time to parse.

Yaml looked like a decent alternative: easy to edit, looks nice. But then I encountered this:

%YAML 1.1
---
!!map {
? !!str "sequence"
: !!seq [
!!str "one", !!str "two"
],
? !!str "mapping"
: !!map {
? !!str "sky" : !!str "blue",
? !!str "sea" : !!str "green",
}
}

Ugly…

So what to use instead?

Use binary configuration files, which are easy to load and save for the application. And create a parser to parse the configuration file and save it to the binary format! In other words: serialize the usefull data from the parsed document and only parse again when it is required.

When you only parse stuff when it has changed by the user than it doesn’t really matter how long it takes to parse. Which can get rid of the really ugly stuff and let us just have a very loose kind of format without the ugly rules and regulations.