git‘s versus svn‘s storage efficiency

At Codeyard we maintain a git and a subversion repository (which are synced with each other) for each of the >115 projects. The following graph shows the repositories plotted logarithmically according to the size of their whole server side subversion repository horizontally and their git repository size vertically:

To make more sense of the logarithmic nature of the graph, I’ve added three lines. The first (solid black) indicates the points of which both sizes are equal. The second course dashed line indicates the points of which the subversion repository is twice as large as the git repository. And lastly, the third finely dashed line indicates the points of which the subversion repository is five times as large as the git repository.

All projects for which git is less storage efficient, are smaller than 100Kb. The projects for which git is most storage efficient (up to even 6 times for a certain C# project), are all of medium size (10–100MB) and code-heavy. For the other projects, which are blob heavy (eg. images), git and subversion are close (git beats svn by ~20%).

One notable disadvantage of huge (someone committed a livecd image) git repositories, is an apparent [tex]\geq2N[/tex] memory usage of git repack even if I tell it not to with --window-memory.

cyv: syncing git and svn

For codeyard I’m developing cyv, which is a (still quite specific) util (written in Python!) to keep svn and git repo’s in sync. On the serverside, at least. First, let me explain what exactly is synced.

When someone commits to a svn repo, the git repo is synced with git-svn. You can just clone the git repo and git pull instead of having to use git-svn yourself.

When pushing commits to the git repo on a branch that came from the svn repo, the commits are git-svn dcommit-ed. If it fails (svn doesn’t do merges that well), it’ll revert the updates and will receive the successful part of the dcommit on the next post-commit triggered fetch from the svn repo. The user will have to git-pull and fix the commits locally: the git manner.

If pushes don’t involve the svn backed branches, it won’t have any unusual side-effects. This allows for pushing and pulling of topic-branches separate from svn and pushing them, when mature enough, into subversion without ever having to hassle (as much) with git-svn.

An obvious huge advantage is that a git-clone of the git repo is a hell of a lot faster than a git svn clone. A second big advantage is that someone can choose to either use git or svn himself while not mutually excluding the other. This is of special concern to codeyard, where projects should be accessible to everyone: beginners and advanced. If instead we offered fully separated git repositories, the projects that prefer git would become inaccessible for most. And if we wouldn’t offer git repos, people would set them up themselves elsewhere, for they really don’t want to bother themselves with git-svn.

cyv contains some neat features. One I want to highlight is the cyv-layout file, you can place in the root of the svn repo. It tells cyv how the repository is laid out. Eg:

trunk:trunk
branches/*: branches/*
tags/*: releases/*
some-git-branch: some/path/in/the/svn/repo

Another one is a wrapper around git-shell to have per repository permissions for different users depending on their ssh pub key.

To reiterate, cyv is still quite specific to the needs of codeyard. (If you’re a codeyard participant: be patient, it’ll be up mid june). However, if you’re interested, I’ll be glad to hear from you.

Upgrading wordpress with git

I didn’t like upgrading wodpress much. Everytime I did it, I needed to re-apply all my little tweaks to the new wordpress. It took too much time.

I tried to diff -uNr on the current version I was running and the newer version and then applying the resulting diff to the current version, but it seems wordpress has been backporting changes so I got conflicts, quite a lot of them.

Because I was quite tired of porting my changes, I’ve tried git, the Source Code Managment tool used by the linux kernel, to do it for me:

I did this all in the parent directory of the root of blog.w-nz.com. This folder contains:

  • htdocs current installation (2.1.2)
  • 2.1.2 the unmodified wordpress
  • 2.2.0 the new wordpress I want to upgrade to

First, I created an empty git repository:

mkdir git; cd git; git init-db; cd ..

Then I copied over the unmodified version of wordpress I was running, and commited them:

cp 2.1.2/* git -R
cd git
git add *
git commit -a -s
cd ..

Then I copied over my current installation:

cp htdocs/* git -R
git status # lets see what changed

There are lots of files like uploads I want git to ignore, so I edit .gitignore to make git ignore them. There weren’t any files I added though, otherwise I’d had to run git add to let git know.

And let commit my changes:

git commit -a -s

Now, lets go back to the original commit — the clean 2.1.2 wordpress — and start a branch from there:

git checkout HEAD^ # HEAD^ means parent commit of HEAD: the previous commit
git checkout -b tmp # create a new branch tmp from here

Now I’m in a branch without my own changes, which was forked from the master branch. Lets apply the new wordpress on this branch:

cd ..
cp 2.2.0/* git -R
cd git
git status # see what changed

git-status showed me that there are a few new files in wordpress 2.2.0, I git-add-ed all of these new files. And then committed it all:

git commit -a -s

Now I’ve got two branches:

  • master which contains wordpress 2.1.2 with my own changes on top as a commit
  • tmp which is forked from the wordpress 2.1.2 from the master branch without my own changes but with the 2.2.0 changes on top

What I want to do is to reapply the 2.2.0 changes on top of my current changes’ commit instead of on top of the 2.1.2 commit. To do this, git has a very powerfull util called git-rebase:

git rebase master

This will search down the tree until the point where the current branch (tmp) forked from the target branch (master). Then it will re-apply all commits in between on the latest commit of the target branch.

Just like if I’d use diff/patch I get a merge conflict. git rebase lets me know this and git status shows me which one are these. The one little difference with the diff/patch approach is, that there are way less merge conflicts (git is smarter) and that the merge conflict are way easier to identify and they’re inline in the original files. Not to mention that when I would have fucked up I’d always have a way back.

After I fixed the merge conflict, I git update-index each conflicted file (to tell git it’s resolved) and git rebase --continue-ed.

Now I’ve got my updated wordpress in the git folder. Then I backuped the current, copied over from git and visited wp-admin/upgrade.php and I’m done :).

By the way: “I didn’t say Subversion doesn’t work. Subversion users are just ugly and stupid.” — Linus on this Google tech talk.

Sidenote, I switched from Hashcash to Akismet. Hashcash didn’t work anymore and Akismet theoretically should be the best solution because it isn’t based on security by obscurity.