Stupid PHP (1) (Strings are faster than Arrays)

When I slowly build a big string out of little bits, the worst thing to do in most languages is to just use string concatenation:

for(something) {
 str .= little_bit;
}

Why? Everytime a little bit is added to str, there must be a new string allocated big enough to contain str and the new little bit. Then the actual str must be copied in. In most languages there are constructs to efficiently build strings like this instead of concatenating. StringBuffer in C#. StringIO in Python.

But no, PHP has to be stupid. There is no nice construct and you’ll end up using concatenation. So, I thought to be smart and make use of PHP array’s and implode. Arrays are here for having elements added and removed all the time so they are properly buffered and should be great at having lots of small elements added. And when I want to pack it all into one big string, I can use PHP’s builtin implode function.

I wanted to try it out and created two scripts: a.php concats a little (10byte) string one million times and b.php appends it to an array and then implodes it. And because I’m also interested in the performance of implode I got a script c.php that’s identical to b.php but doesn’t implode afterwards. These are the results:

a.php (concat) 0.320s
b.php (array append and implode) 0.814s
c.php (array append) 0.732s

Indeed, string concatenation with all its allocation and copying is actually faster than plain simple array appending. PHP is stupid.

The (or at least a better) Solution to Spam

There is no easy way to distinguish between a human and a spambot. It’s an arms race which we’ll always be behind. I’m talking here about spam in more general—not only on e-mail but also on for instance Wikipedia or on blogposts. Even if we would have a perfect-solution to test whether there is a human behind something, we still have to combat cheap labour in India: real people spamming “by hand”.

I think the solution is a Web of Trust similar to that of PGP. An identity (not necessarily a person) publishes who she trusts (not to be spammy/phishy) or not trusts. Ideally everyone would be in one big web. Only someone who my blog via-via trusts may post.

Obviously one still may gather trust of people over time and enter the web of trust and then start spamming with that identity. However, then that identity will be marked untrusted by people and also the people who initially marked the identity as trusted will be less trusted. Also, there are way more sophisticated measures of establishment in the web/trust to conceive than just being trusted via-via by one identity.

There is no way to prevent spam perfectly, but the amount of work that has to go in to making an identity trusted and established in the web is several orders of magnitude greater than any other protection we have. The big problem: we don’t have such an ubiquitous web of trust yet. (Yes, it’ll be in SINP if I’ll get around to working on it)

Section 202c of the German computer crime laws

This section has come into effect over the weekend. It makes it illegal to create, possess, obtain, provide access to, yield, distribute or otherwise allow access to lots of widespread tools that can be used to breach security. Take for instance nmap.

This law does not only impede our freedom (of speech), research, decrease security and allow for misuse, but more importantly it won’t even stop the real criminals.

Stefan of the Month of PHP Bugs Project writes:

The law does not affect our freedom of speech to report and inform about security vulnerabilities and how to exploit them.

We are just not allowed to create/distribute/use software that could be used as “hacking tools”.

Where would they draw the line between reporting/informing about a vulnerability and how to exploit it and the actual source code to do it. Would pseudocode be illegal? Would literate code be illegal? Also there would be no way for security researchers to try out their work.

What will happen in the worst case if similar laws are accepted in other countries and enforced, is that vendors will rather cover up all vulnerabilities using these laws instead of securing it. That there are lots of ready-to-use exploits is good. It’s a very good incentive for security.

That there will always be a leak in a piece of software that someone will be able to find on his own will not be changed by this law. Also there will be no way to stop the real criminals from creating and distributing tools underground. Now everyone still knows what kind of tools are around and will know what to expect.

Wacken

I’ll be up early tomorrow, Saturday, to take the train with lots of friends to the little town of Wacken in Germany. (Actually, there is no train station in Wacken, so we need to hire a cab someway for the last dozen kilometers)

Sunday morning the camping of the Wacken Open Air festival will open. The festival itself will start on Thursday. I hope to be home again the first Sunday of August.

linux 2.6.22-bw-r1

There isn’t a tree that contains reiser4, suspend2 and the gentoo patches — so I created one. This revision adds the -stable patches and the new gentoo patches.

One big patch: bw-r1-for-2.6.22.diff.bz2
Patches broken out: bw-r1-for-2.6.22.tar.bz2

To apply the one big patch, use: bzcat bw-r1-for-2.6.22.diff.bz2 | patch -p1 inside a vanilla 2.6.22.

“Nothing to hide”

In this short essay, written for a symposium in the San Diego Law Review, Professor Daniel Solove examines the “nothing to hide” argument. When asked about government surveillance and data mining, many people respond by declaring: “I’ve got nothing to hide.” According to the “nothing to hide” argument, there is no threat to privacy unless the government uncovers unlawful activity, in which case a person has no legitimate justification to claim that it remain private. The “nothing to hide” argument and its variants are quite prevalent, and thus are worth addressing. In this essay, Solove critiques the “nothing to hide” argument and exposes its faulty underpinnings.

“I’ve Got Nothing to Hide” and Other Misunderstandings of Privacy

Not only is its subject very relevant, the Essay is very well written and a pleasure to read.

Online Cycle Detection in Directed Graphs

A short while ago I came across a quite interesting problem. Design a datastructure (and algorithms) to maintain a Directed Acyclic Graph.

There has to be only one operation: adding a link between two given nodes. This operation must be able to detect and deny any new link that would cause a cycle. For simplicity, nodes are identified by sequential ids starting with 0.

An example:

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 2, 0 -> False # Fails because it would create a cycle

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 0, 2 -> True # This isn't a real cycle, so it's perfectly fine

It’s rather trivial to create a [tex]\mathcal{O}\left(n+\ell\right)[/tex] (where [tex]n[/tex] is the number of nodes and [tex]\ell[/tex] the number of links). I conjecture there exists a [tex]\mathcal{O}\left(\log n\right)[/tex] algorithm.

Graduated

Today I received a long anticipated phone call from my mentor who took away the doubt and let me know that I graduated for my VWO exams at the Stedelijk Gymnasium Nijmegen, which is a matriculation exam.

With mixed feeling I look back at seven long years. It certainly shouldn’t have took 7 years, nor 6, not even 3. But espacially in my last few years I came to appreciate the tons of very different people with whom I spend the days. A diversity I won’t find in that extend within the students of Maths and Physics which I intend to study next year at the Radboud University in Nijmegen.

For one thing I certainly wouldn’t have want to have missed these last few years. Some say they’re the best of your life. I do really look forward to university, though.

For those interested, I’ve planned an “Examenfeest” with some others in a nice club in Nijmegen. Send me an e-mail if you happen to be in the neighborhood and care to join.

Syslets and other untrusted kernelspace programming

In the wake of creating true async. IO in the linux kernel Ingo worked on so called syslets. A syslet is nothing more than one async system call to chain together a few arbitrary system calls. An example syslet would be:

1. Read from file descriptor 123
2. If 1 returned with error, break
3. Write which was read to file descriptor 321
4. If 3 succeeded jump to 1 otherwise break

This is represented by a few struct’s linked together of which the first is passed on to the kernel with the syslet system call. The systemcall returns practically directly and eventually the application can be notified (or can wait) on the syslet to complete. This could safe an enormous amount of system calls. This means way less context switches, which is very good for performance.

Syslets dawn, in a very primitive manner, kernelspace scripting. But hey, wasn’t kernelside scripting the thing that all linux dev’s dreaded? Wasn’t it Linus who joked that he would be in a mental institute whilst releasing Linux 3 with a VB message pump? Yes, they’re afraid of putting untrusted programs/scripts in kernelspace and they’ll barely acknowledge that syslets is the first step.

The problem with the current full-featured scripting languages is that they are, well, full-features gone wrong: they’re bloated and not really secure. In kernelspace you can’t allow any memory except for the scripts’ own to be accessed, not to mention the restrictions on resources and virtual memory won’t help you there. Most scripting languages weren’t developed with these restrictions in mind. Most languages have got evil functions in dark corners of the standard library that will allow you to do really evil stuff with memory.

As far as I know only .net (thus mono) have got a quite rigorous trust framework build in. .Net is bloated and proprietary and Mono is still (and probably will never be) feature complete though still being very bloated.

A very simple safe language is what we need, of which a compiler daemon is running as a system service, with which untrusted userspace programs can have scripts running in the kernel. I’m tempted to use one of the brainf*ck JIT’s, they’re small enough to thoroughly review :).

A kernelspace interpreter would do to, though, as a PoC.

Linux 2.6.21-bw-r1

There isn’t a tree that contains reiser4, suspend2 and the gentoo patches — so I created one. This revision adds the -stable patches, new gentoo patches and some powertop patches.

One big patch: bw-r1-for-2.6.21.diff.bz2
Patches broken out: bw-r1-for-2.6.21.tar.bz2

To apply the one big patch, use: bzcat bw-r1-for-2.6.21.diff.bz2 | patch -p1 inside a vanilla 2.6.21.

Upgrading wordpress with git

I didn’t like upgrading wodpress much. Everytime I did it, I needed to re-apply all my little tweaks to the new wordpress. It took too much time.

I tried to diff -uNr on the current version I was running and the newer version and then applying the resulting diff to the current version, but it seems wordpress has been backporting changes so I got conflicts, quite a lot of them.

Because I was quite tired of porting my changes, I’ve tried git, the Source Code Managment tool used by the linux kernel, to do it for me:

I did this all in the parent directory of the root of blog.w-nz.com. This folder contains:

  • htdocs current installation (2.1.2)
  • 2.1.2 the unmodified wordpress
  • 2.2.0 the new wordpress I want to upgrade to

First, I created an empty git repository:

mkdir git; cd git; git init-db; cd ..

Then I copied over the unmodified version of wordpress I was running, and commited them:

cp 2.1.2/* git -R
cd git
git add *
git commit -a -s
cd ..

Then I copied over my current installation:

cp htdocs/* git -R
git status # lets see what changed

There are lots of files like uploads I want git to ignore, so I edit .gitignore to make git ignore them. There weren’t any files I added though, otherwise I’d had to run git add to let git know.

And let commit my changes:

git commit -a -s

Now, lets go back to the original commit — the clean 2.1.2 wordpress — and start a branch from there:

git checkout HEAD^ # HEAD^ means parent commit of HEAD: the previous commit
git checkout -b tmp # create a new branch tmp from here

Now I’m in a branch without my own changes, which was forked from the master branch. Lets apply the new wordpress on this branch:

cd ..
cp 2.2.0/* git -R
cd git
git status # see what changed

git-status showed me that there are a few new files in wordpress 2.2.0, I git-add-ed all of these new files. And then committed it all:

git commit -a -s

Now I’ve got two branches:

  • master which contains wordpress 2.1.2 with my own changes on top as a commit
  • tmp which is forked from the wordpress 2.1.2 from the master branch without my own changes but with the 2.2.0 changes on top

What I want to do is to reapply the 2.2.0 changes on top of my current changes’ commit instead of on top of the 2.1.2 commit. To do this, git has a very powerfull util called git-rebase:

git rebase master

This will search down the tree until the point where the current branch (tmp) forked from the target branch (master). Then it will re-apply all commits in between on the latest commit of the target branch.

Just like if I’d use diff/patch I get a merge conflict. git rebase lets me know this and git status shows me which one are these. The one little difference with the diff/patch approach is, that there are way less merge conflicts (git is smarter) and that the merge conflict are way easier to identify and they’re inline in the original files. Not to mention that when I would have fucked up I’d always have a way back.

After I fixed the merge conflict, I git update-index each conflicted file (to tell git it’s resolved) and git rebase --continue-ed.

Now I’ve got my updated wordpress in the git folder. Then I backuped the current, copied over from git and visited wp-admin/upgrade.php and I’m done :).

By the way: “I didn’t say Subversion doesn’t work. Subversion users are just ugly and stupid.” — Linus on this Google tech talk.

Sidenote, I switched from Hashcash to Akismet. Hashcash didn’t work anymore and Akismet theoretically should be the best solution because it isn’t based on security by obscurity.

The Future of the Internet

One of the prominent people behind the current internet discusses the history (telephony, wire oriented), the current (IP, endpoint oriented) and the future (?, data oriented) at google tech talks.

A short synopsis: the internet is having trouble at the moment for it has been designed in a time the problem was different. In these days most of the data is duplicate data, which is a tremendous waste. Also connecting to the internet (getting an address) (and resulting from that keeping everything in sync) is hard. Van suggests and predicts a data oriented internet. A bit like a secure P2P bittorrent network, but instead of on top of IP on the IP level.

It’s a very interesting talk, worth watching.

linux 2.6.21-bw

There aren’t official reiser4 patches for .20 or .21. There are quite a few branches that contain support for reiser4, but these are highly unstable. (for instance the -mm tree). Also because I wanted to give stacked git a shot, I started my own kernel tree:

One big bzip2-ed diff: bw-for-2.6.21.diff.bz2.
bzip2-ed tar with the separate patches: bw-for-2.6.21.tar.bz2.

This release contains reiser4, suspend2, the gentoo patches and a few patches to get everything working together nicely.

Virtual package for your python application

When you’ve got a big python application, you’ll usually split it up in modules. One big annoyance I’ve had is that a module inside a directory cannot (easily) import a module higher up in the tree. Eg: drawers/gtk.py cannot import state/bla.py.

This is usually solved by making the application a package. This allows for import myapp.drawers.gtk from everywhere inside your application. To make it a package though, you need to add the parent directory in the sys.path list. But unfortunately this also includes all other subdirectories of the parent directory as packages.

However, when the package module (eg: myapp) was already loaded, then the path from which myapp was loaded is used to find the submodules (eg: myapp.drawers.gtk) and sys.path isn’t looked at, at all. So, here is the trick:

import sys
import os.path

p = os.path.dirname(__file__)
sys.path.append(os.path.abspath(p+"/.."))
__import__(os.path.basename(p))
sys.path.pop()

Note that this script doesn’t work when directly executed, because the __file__ attribute is only available when loaded as a module.

Save this script as loader.py in the root of your application. import loader from the main script in your app, and you’ll be able to import modules by myapp.a.module, where myapp is the root directory name of your application.