The Cuckoo Hashtable

A Hashtable algorithm is a specific algorithm to implement key-value pair datastructure with efficient by-key look-ups using hashing of the keys. A hashtable contains a list of buckets. In a simple implementation, the i-th bucket, contains the key-value pairs in a list of which the key has a hash that is i modulo the amount of buckets. The hash-table would increase the amount of buckets if any bucket contains more than a fixed amount of pairs. This results in a constant-time look-up, but an insertion might invoke a expensive rebuild.

There are a few variations on Hash Tables. I’d like to share a really smart one: the Cuckoo Hashtable.

The Cuckoo hashtable expects two different hash-functions for the keys. Instead of storing a list of pairs in each bucket, the Cuckoo Hash table stores a single pair in each bucket. When inserting a pair, it is inserted in one of the two possible locations. If it happens to be occupied, the old pair is replaced. Then the replaced pair is inserted at its other possible location, potentially kicking out another pair. This step is repeated, until there is no displaced pair or a loop is detected. When a loop is detected, the hash-functions can be changed, if the buckets are for the most part empty -or- when the table is almost full, the amount of buckets can be increased. It can be shown that an insertion has a amortized constant time. (The buckets can be resized in-place, and each entry is repositioned as if it was displaced by another.)

If I’d own a botnet… (1)

…and didn’t want to loose it if my control servers got shut down, I’d let every orphaned zombie randomly connect to hosts in a given IP range, and challenge them to give a preimage of a hardcoded hash. [ detail: add a salt to prevent replay attacks ]. With a sufficiently safe casu quo large range, it also might be helpfull to allow zombies to forward still orphan zombies.

iPhone WiFi weirdness

I’m having trouble with the wireless on my 3G iPhone. Either an action works perfectly (for instance downloading e-mail or visiting a website), or it will just time-out. I just jailbroke it and installed sshd. ssh-ing to my it via 3G miraculously works perfectly. However, when connecting to it via WiFi it’ll simply refuse data. That is, until you let the iPhone try to get data itself by for instance visiting a website. Then suddenly the ssh packets come through again. In the background (with screen) I’m running a very frequent ping to my local gateway which makes ssh work perfectly again via WiFi. A very unsatisfying hack.

It seems like an aggressive power safe on the WiFi.

iPhone (0)

On the day of release, I bought an iPhone. We all know the great features; thus let’s talk about the annoyances.

It crashes and gets sluggish a lot. Especially Safari crashes a lot and the contact list at one moment took 10 seconds to get responsive. Over time random crashes in all applications appeared. Rebooting it, actually, seems to have solved it (for now).

Wifi is unreliable. Sitting right next to the router it’s still a game of chance whether it will load a website properly or just time-out. This often leaves me disabling Wifi, and using HSDPA instead (which is our 3G), which has a terrible coverage. Outside reception is great, but even inside a building near a window, it only shows one or two “pips”. This again, often leaves me disabling 3G in favor of the (slower) but the more reliable GRPS (?).

I use iTunes on Windows in VMWare to sync my iPhone, which works great for the music. For the other stuff, like contacts and photos it just does not work. Syncing with google contacts or the windows address book just doesn’t work on windows. There isn’t even any support for calendar syncing for windows. I still have to look intro a free exchange compatible server. (Anyone?)

T-Mobile in the Netherlands seems to have trouble with the administration of all new subscriptions. Billing information and visual voicemail still don’t work.

(Nevertheless, it’s a great toy)

git‘s versus svn‘s storage efficiency

At Codeyard we maintain a git and a subversion repository (which are synced with each other) for each of the >115 projects. The following graph shows the repositories plotted logarithmically according to the size of their whole server side subversion repository horizontally and their git repository size vertically:

To make more sense of the logarithmic nature of the graph, I’ve added three lines. The first (solid black) indicates the points of which both sizes are equal. The second course dashed line indicates the points of which the subversion repository is twice as large as the git repository. And lastly, the third finely dashed line indicates the points of which the subversion repository is five times as large as the git repository.

All projects for which git is less storage efficient, are smaller than 100Kb. The projects for which git is most storage efficient (up to even 6 times for a certain C# project), are all of medium size (10–100MB) and code-heavy. For the other projects, which are blob heavy (eg. images), git and subversion are close (git beats svn by ~20%).

One notable disadvantage of huge (someone committed a livecd image) git repositories, is an apparent [tex]\geq2N[/tex] memory usage of git repack even if I tell it not to with --window-memory.

“waiting for x server to begin accepting connections”

Try DisallowTCP = false under [Security] in /etc/X11/Sessions/Gnome/custom.conf or similar for other window managers. Obviously this isn’t a very desirable solution, emerge --emptytree gnome might do the trick too.

(and obviously this might be just one of the many underlying causes for the very generic symptom of X not accepting connections)

cyv: syncing git and svn

For codeyard I’m developing cyv, which is a (still quite specific) util (written in Python!) to keep svn and git repo’s in sync. On the serverside, at least. First, let me explain what exactly is synced.

When someone commits to a svn repo, the git repo is synced with git-svn. You can just clone the git repo and git pull instead of having to use git-svn yourself.

When pushing commits to the git repo on a branch that came from the svn repo, the commits are git-svn dcommit-ed. If it fails (svn doesn’t do merges that well), it’ll revert the updates and will receive the successful part of the dcommit on the next post-commit triggered fetch from the svn repo. The user will have to git-pull and fix the commits locally: the git manner.

If pushes don’t involve the svn backed branches, it won’t have any unusual side-effects. This allows for pushing and pulling of topic-branches separate from svn and pushing them, when mature enough, into subversion without ever having to hassle (as much) with git-svn.

An obvious huge advantage is that a git-clone of the git repo is a hell of a lot faster than a git svn clone. A second big advantage is that someone can choose to either use git or svn himself while not mutually excluding the other. This is of special concern to codeyard, where projects should be accessible to everyone: beginners and advanced. If instead we offered fully separated git repositories, the projects that prefer git would become inaccessible for most. And if we wouldn’t offer git repos, people would set them up themselves elsewhere, for they really don’t want to bother themselves with git-svn.

cyv contains some neat features. One I want to highlight is the cyv-layout file, you can place in the root of the svn repo. It tells cyv how the repository is laid out. Eg:

trunk:trunk
branches/*: branches/*
tags/*: releases/*
some-git-branch: some/path/in/the/svn/repo

Another one is a wrapper around git-shell to have per repository permissions for different users depending on their ssh pub key.

To reiterate, cyv is still quite specific to the needs of codeyard. (If you’re a codeyard participant: be patient, it’ll be up mid june). However, if you’re interested, I’ll be glad to hear from you.

Benchmarking CouchDB (1)

I’ve written a small benchmark for couchdb to test it’s document creation performance. A script creates [tex]$N$[/tex] documents in total using bulk update to create [tex]$B$[/tex] at the same time with [tex]$T$[/tex] concurrent threads. The following graph show the time it takes to create an amount of documents against that amount of document for different values of [tex]$B$[/tex] with [tex]$T=1$[/tex].

And for [tex]$T=2[/tex] (two concurrent threads. Tested on a dual core machine)

The values of B are 1, 2, 4, 5, 8, 11, 16, 22, 32, 45, 64, 90, 128, 181, 256, 362, 512, 724 and 1024

As you can see, a higher value of [tex]$B$[/tex] causes the graph to shift to the right which means more [tex]$N[/tex] for the same time. Bulk update really does make a difference. Or non-bulk-update really sucks. Also adding threads does help a bit, but not as much as expected.

There are some more interesting graphs to plot ([tex]$B$[/tex] against [tex]$\overline {N \over \Delta T} $[/tex]). More graphs tomorrow.

(For those interested, the raw data from which these graphs were plotted.)

CouchDB document creation performance

CouchDB is a non-relational database which uses MapReduce inspired views to query data. There are lots of cool things to tell about its design, but I rather want to talk about its performance.

Today I’ve been busy hacking together a little script to import all e-mails of a long e-mail thread into a couchdb database to write views to extract all kinds of statistics. I already imported these e-mails into a MySQL database a few months ago, but was quite disappointed by the (performance) limitations of SQL. The e-mail thread contains over 20,000 messages which weren’t a real problem for MySQL. When importing, however, couchdb was adding them at a rate of only a few dozen per second with a lot of (seek)noise of my HDD.

So I decided to do a simple benchmark. First of, a simple script (ser.py) that adds empty documents sequentially. It’s averaging 16 per second. It occurred to me that couchdb waits for a fsync before sending a response and that asynchronously the performance would be way better. A simple modification to the script later (par.py) it still averaged 16 creations per second.

I guess, for I haven’t yet figured out how to let straces tell me, that it’s the fsync after each object creation which causes the mess. couchdb itself doesn’t write or seek a lot, but my journaling filesystem (XFS) does on a fsync.

Can anyone test it on a different filesystem?

Update Around 17/sec with reiserfs.

Update I had some trouble with the bulk update feature. I switched from svn to the 0.7.2 release. I got about 600/sec, which dropped to a steady-ish 350/sec when using sequential bulkupdates of 100 docs. Two bulk updates in parallel yield about 950/sec initially, dropping to 550/sec after a while. Three parallel updates yield similar performance.

Interrupting a select without a timeout

select is a POSIX syscall which allows you to wait on several different filedescriptors (including sockets) for the event that they won’t block on write; won’t block or read or are in error. This syscall is very convenient when you’re writing a server.

When I want to shutdown an instance of the server, I have to interrupt the select. I have yet to find a satisfying way of doing this. At the moment I create a pair of linked sockets with socketpair. I include one of them to the sockets on which to block until there is data to read in the select call. To interrupt, I simply write some data to the other socket which will cause data to be available on the socket which in turn will interrupt the select.

There must be a more elegant solution.