git‘s versus svn‘s storage efficiency

At Codeyard we maintain a git and a subversion repository (which are synced with each other) for each of the >115 projects. The following graph shows the repositories plotted logarithmically according to the size of their whole server side subversion repository horizontally and their git repository size vertically:

To make more sense of the logarithmic nature of the graph, I’ve added three lines. The first (solid black) indicates the points of which both sizes are equal. The second course dashed line indicates the points of which the subversion repository is twice as large as the git repository. And lastly, the third finely dashed line indicates the points of which the subversion repository is five times as large as the git repository.

All projects for which git is less storage efficient, are smaller than 100Kb. The projects for which git is most storage efficient (up to even 6 times for a certain C# project), are all of medium size (10–100MB) and code-heavy. For the other projects, which are blob heavy (eg. images), git and subversion are close (git beats svn by ~20%).

One notable disadvantage of huge (someone committed a livecd image) git repositories, is an apparent [tex]\geq2N[/tex] memory usage of git repack even if I tell it not to with --window-memory.

cyv: syncing git and svn

For codeyard I’m developing cyv, which is a (still quite specific) util (written in Python!) to keep svn and git repo’s in sync. On the serverside, at least. First, let me explain what exactly is synced.

When someone commits to a svn repo, the git repo is synced with git-svn. You can just clone the git repo and git pull instead of having to use git-svn yourself.

When pushing commits to the git repo on a branch that came from the svn repo, the commits are git-svn dcommit-ed. If it fails (svn doesn’t do merges that well), it’ll revert the updates and will receive the successful part of the dcommit on the next post-commit triggered fetch from the svn repo. The user will have to git-pull and fix the commits locally: the git manner.

If pushes don’t involve the svn backed branches, it won’t have any unusual side-effects. This allows for pushing and pulling of topic-branches separate from svn and pushing them, when mature enough, into subversion without ever having to hassle (as much) with git-svn.

An obvious huge advantage is that a git-clone of the git repo is a hell of a lot faster than a git svn clone. A second big advantage is that someone can choose to either use git or svn himself while not mutually excluding the other. This is of special concern to codeyard, where projects should be accessible to everyone: beginners and advanced. If instead we offered fully separated git repositories, the projects that prefer git would become inaccessible for most. And if we wouldn’t offer git repos, people would set them up themselves elsewhere, for they really don’t want to bother themselves with git-svn.

cyv contains some neat features. One I want to highlight is the cyv-layout file, you can place in the root of the svn repo. It tells cyv how the repository is laid out. Eg:

trunk:trunk
branches/*: branches/*
tags/*: releases/*
some-git-branch: some/path/in/the/svn/repo

Another one is a wrapper around git-shell to have per repository permissions for different users depending on their ssh pub key.

To reiterate, cyv is still quite specific to the needs of codeyard. (If you’re a codeyard participant: be patient, it’ll be up mid june). However, if you’re interested, I’ll be glad to hear from you.