Spam, spam and more spam

I noticed I had an enourmous amount of spam in my moderation queue.

The plugin I used to protect myself from spam wp-hashcash, seemed to have been mastered by spammers.

A download of the newest version did the trick.

If anyone experiences problems with posting comments, please mail me.

Update I: Seems some spam prevailed even over this version. I’d better get to making my own custom changes to wp-hashcsah.

Update II: I changed the secret codes in the plugin. And I broke it for a while. Either one of those could have resulted in the fortunate (hopefully not temporarilly) stop of spam.

Update III: According to Elliot Back, the creator of hashcash, the spammers bruteforce the secret value. Changing it usually is efficient enough to keep them at bay for a while. He’s working on a newer version which features bigger, thus harder to bruteforce values. I just hope they won’t suck my bandwidth too much.

Update IV: Unfortunately there seems to be a lot of computing power or a hack behind the breaking of the hashcash security -_-, I keep getting spam :-/

Strange pagerank

Google uses pagerank to give a site a ranking of importance. Pagerank is strange.

One thing to note is that pagerank is logarithmic-ish. A pagerank of 2 is a lot better than just the double of pagerank 1.

By far the most visited site on my server, this blog, has got a page rank of 3.

Other pages on my server which are visited sometimes like w-nz.com, board.w-nz.com have got a pagerank of 2.

What catched my attention is that the page xr12.com, which basicly is a filler containing a link to the xr12 wiki has got a pagerank of 5. This is the same pagerank as a big site like newgrounds!

Maybe google values xr12.com a lot because it is about one topic and is the only site about that topic and that is xr12.com, where this blog has got tons of links about practicly everything from very various sources.

Pagerank itself isn’t the sorting factor for google but rather the context, although pagerank still is an indicator. Maybe google values a few links which are very specific above tons of links about very different topics.

Using template variables in extension tags with Mediawiki

Mediawiki, the wiki used by wikipedia, is a powerfull wiki.

It allows you to include self made templates which you can pass variables too. A very usefull template on wikipedia is the stub template. When you add {{stub}} at the start of an article it will insert text explaining the user that the article isn’t finished yet, but rather a stub. With {{Album_infobox|Name=Sehnsucht|Cover=Sehnsucht_cover.jpg|…}} an infobox is inserted used at wikipedia for all albums.

When the features of wikipedia don’t suffice you don’t neccessarily have to hack wikipedia to add features. You can add an extension which hooks into a part of the parser to create your own logic for your own custom tags.

I created a wiki to store the lyrics which I already got on a forum and found espacially templates usefull.

However, there was a slight annoyance when working with templates. The problem is that in several cases there is a link in a template which can be ambiguous (there are two songs called the same) but you don’t want an ugly suffix after the name (like “Deliverance (Album of Opeth)” instead of “Deliverance”). To solve this you have to pass both the name of the album and the name of the page of the album, but this is annoying to do everytime.

MediaWiki knows no logic so I thought to solve this by adding some basic logic in a custom tag which either uses the same text as the page name when no custom text is specified and otherwise uses the custom text. But that just wouldn’t work.

MediaWiki, when parsing a page, first escapes all nowiki elements, extensions, comments, links and so on by a unique ID. Then parsed the format and replaced variables with their values. After that it just replaced the unique id’s back with the proper content, which in this case was the output of my extension too, which was flawed because the variables weren’t replaced before the extension tag was escaped out and also because the wiki-link ([[pagename]]) wasn’t replaced because the extension was still escaped. So I ended up with having [[{{{album}}}]] on my page instead of a Sehnsucht link.

Trying to fetch template variables inside the extension just wouldn’t work because the extension was evaluated at the moment it was already included in the main page. Avoiding this by disabling the cache would require parsing the whole template each time it is viewed which is very extensive. (each link on a page in a wiki has to be looked up when parsed to see whether it exists)

So I googled a bit around and found that Mikesch Nepomuk submitted a patch on one of the mailinglists to fix this by escaping extensions seperately after the veriables were expanded. Apparently it hadn’t been approved or maybe because it wasn’t included in my release yet so I tried to apply it but it failed. The patch was a bit too old. So I did it by hand instead of by GNUpatch and with some tinkering I got it to work.

You can download the patch for the 1.5 beta4 here, if you are interested.

And I thought I would never touch PHP again.

Dual Cores, Traffic and Gaming

Dual Cores are the new thing in the processor bussiness. They claim to be perfect for people who want to do different things at the same time but in reality they are just a cheap solution to deliver more.

The truth is that two 500 mhz processors just don’t deliver the calculation power of one 1 000mhz processor. Given offcourse that they are of the same architecture. This is due to the fact that there is a lot of additional logic required to let two processors work together without getting eachother in the way. When two processors both want to write another thing to the same spot of memory you would get impredictable results. To solve this you have all kind of ways to solve it, by for instance locking a region of memory for an amount of time, making the other processor to wait for the first one to finish. But this all creates a lot of overhead and complexity.

You could compare it with driving a car. Driving a car is relatively easy. You just have to steer around some static obstacles, no big deal. When you know the way you could even do it with your eyes closed. That is unless there are other people driving a car. When driving a car you are keeping an eye, not on the road, but on the other people on the road. This not only slows down your maximum speed or decreases efficiency – you can’t just drive full speed over a junction – but it also increases the complexity and the likelyhood of errors.

The same thing goes in the case of dual core processors (or even hyper-threaded and normal multi processor platforms). Although the comparison isn’t really valid because having two processors doesn’t mean that you have to do two things at the same time. The issue is though that what you normally would do, would only be done by one processor and you are therefore wasting the other’s capability.

A good example of this are games. Games tend to be single threaded, which gives best performance for no processor time is wasted to multithreading and it is the easiest thing to do for multithreaded is rather complicated. Complicated enough that there have been a few lengthy discussions in the mono mailing lists how to lock a simple object.

Because we are getting dual core and propably ‘more’core processors lateron for the companies are too lazy to make decent processors1 games should become multithreaded to exploit the full capability of the machine it runs on.

Although it makes creating performance applications a lot more difficult it will surelly benifit distributed computing for the change from different processors to different machines is less than from 1 processor to more.

[1] Native threaded CPU’s like dualcores/multi processor/hyperthreaded processors are ideal for server applications where multiple short living requests have to be resolved. Switching an allocating on a software-threaded processor would create too much overhead for such a simple request.

Practically Reversing CRC

The algorithm described in my previous post on how to compute a patch for a message to get the wanted crc computed from that text has a usefull application when wanting to get the possible texts that could compute to your own provided crc.

A little util I wrote, reversecrc (http://w-nz.com/projects/reversecrc/), lists (the endless) list of possible original texts, has been usefull for the Freelancer modding community, for in that game Crc is used a lot and was an obstacle.

Although it is almost impossible to reverse a 20 character string because it would be insane to check all the texts that could result in the hash, using one of the provided texts would do too as it results in the same hash anyways.

Subversion repositry: https://cvs.codeyard.net/svn/icrc

Opeth’s Ghost Reveries

Sweden’s OPETH and Jens Bogren have finished the recording and mixing of the group’s latest opus, “Ghost Reveries”. The album was mastered on Thursday (June 16) at the Cutting Room facilities in Stockholm. The total running time will be around 65 minutes.

The two released tracks, Ghost of Perdition and The Grand Conjuration sound promising.

The album will be for sale on my birthday :-), the 30th of August.

And they will perform it during their Europian tour, which I`ll visit on the 11th of September in the 013 in Tilburg.

I`m thrilled.

Reversing CRC

Cyclic Redundancy Code

CRC is a hash which is frequently used as a checksum for data in for instance archives. Who hasn’t had bad CRC errors once when opening corrupted zips. CRC is a very old algorithm and over time it changed a lot from the original idea.

The original idea behind CRC was representing the data that you wanted the hash from as a big number and dividing it by a certain number called the polynomial and taking the remainder of the division as the hash. For instance: 23 % 3 = 2 (% is the modulus operator, which is the remainder of a division)

The initial problem was that dividing is a rather intensive operation. They wanted to simplify CRC to make it easier to implement in hardware and make it faster. They did this by getting rid of the carry used in the substraction of the division:

Normal substraction (binary): 10101 - 01100 = 01001
Without carry: 10101 - 01100 = 11001

Substraction without a carry is basicly a eXclusive bitwise OR (XOR), which returns only 1 when one of the operands is 1 and the other 0.

The algorithm required was faster bit still worked bit by bit, which isn’t really what a computer likes. A computer works best with one to four bytes. To make CRC faster they cached 8 operations at the time by precomputing the results for a certain start value and put it in a table called a XOR Table.

The required code for the CRC calculation itself now became very simple:

hash = (hash >> 8 ) ^ table[data ^ (0xff & hash)]

They changed the CRC algorithm once more by making it reflected. This means that the input data is reversed bitwise: 011101011 <-> 110101110. This was done because most of the hardware chips at the time reversed data bitwise. For it was too much work to reflect each byte of incoming data they changed the algorithm that generates the Crc table to create a table which has the effect of reflected data.

This is by the way not totally correct; the result still was different for a reflected than a non-reflected algorithm for they wouldn’t cache the whole piece of data to reverse it but did it per byte at calculation.

At this moment CRC barely resembles the original idea of a modulus.

Reversing CRC

First off, credits for the original idea of CRC patching go to anarchriz.

CRC is a cryptographicly very weak algorithm. It can be easily reversed for it has got the property that with 4 bytes you append to the current computed hash you can get every required hash. You can change the whole message and add 4 patch bytes to patch the hash to the one you like.

The ability to patch a CRC also makes it possible to very efficiently generate all possible source data of a checksum. Although it still is a bruteforce method you got 4 bytes freely and patching is faster than calculating.

Patching is basicly going back the way CRC works. Crc basicly takes the hash, moves it 1 byte to the right (dropping one byte) and xor-ring it with the table entry. The nice thing about normal CRC is that the first byte of a table entry is unique for that entry.

For the first byte of the entry is unique for that entry and it is put in the hash xor-red with 0 for that is what is shifted in from the right you can work back the whole entry used.

For instance:

My is: 0x012345678, this means that it was xorred with the entry in the CRC table that starts with 0x12. When you xor the hash with that full entry the only thing that the next byte was xorred with was the start of a table entry too.

When reversing the current hash you know what will be xorred on the patch you’ll give. Xorring this with your wanted hash is enough.

The resulting algorithm is suprisingly simple:

– Put the current hash byte wise reversed at the start of a buffer. Put the wanted hash byte wise reversed at the end of the current hash in the same buffer.
– Look up the entry in the table that starts with byte 7 in the buffer. Xor this value of position 4, and Xor the entry number on position 3. Repeat this 4 times with the positions each time one decremented. (thus 7,6,5,4 and 4,3,2,1 and 3,2,1,0)

When you’ve done this the required patch bytes are the first 4 bytes in the buffer.

Some Crc variants tend to use duplicates in the crc-table which means there could be more than one original table entry used. You should just branch and try all of them.

I’ve made a simple python script to work with crc32 and to patch a hash.

You can download it Here. And there is a C implementation here.

Update Fixed a bug in the example.

Update Andrea Pannitti wrote an implementation in java.

Update I found a great article by Stigge, Pl�tz, M�ller and Redlich about reversing CRC. They pretty much nailed it by bringing it down to elementary algebra. They explain an algorithm which is able to patch data at any given point to adjust the CRC to any desired value.

Linux Mount Security

With the linux Set UID Attribute you can let the owner of the file be the one the execute it when another user executes the file. This feature has traditionaly be used for system tools in linux which require root access to run but also must be able to be run my users.

It came to mind that a floppy with the ext2 filesystem could contain files of the root user with this Set UID Attribute set. Which theoraticly would allow anyone who is allowed to mount floppy’s or other media with an filesystem that supports this attribute to gain root access for a program.

On my system I got this entry in my /mnt/fstab, which allows an user to mount the floppy:

/dev/floppy/0 /mnt/floppy auto noauto,user 0 0

I made a simple C program which would show the contents of /etc/shadow, which contains the password hashes of the users, and chmodded it accordingly. (chmod = showshadow; chmod go+x showshadow; chmod u+rs showshadow)

I ran my program, and it seemed to work! The contents of the /etc/shadow file was streaming on my console.

Euforicly I went to another linux computer and tried the same trick.

darkshines@darkshines-one /mnt/floppy $ ./showshadow
bash: ./showshadow: Permission denied

Dissapointed but releived it seemed that linux had already some precaution against a root Set UID-ed executable.

I copied the contents of the folder whilest preserving permissions to another folder outside the /mnt/floppy and it all seemed to work again, although I couldn’t do it with a normal user account for I can’t preserve the owner when copying a file as a normal user.

I wondered how linux would secure it and tried to run the program while it was unchmodded.

darkshines@darkshines-one /mnt/floppy $ ./showshadow.unchmodded
bash: ./showshadow.unchmodded: Permission denied

The warning is from bash which can’t seem to execute the program. (note that it isn’t the program that can’t acces shadow) . After recompiling it on the floppy itself it seems that linux prevents any program to be executed in an user mounted folder.

I recon that that security precaution is a bit too strict. Although copying the file from the medium to a normal folder and then executing is still possible, I find it a bit strange that nothing of the user itself can be executed.

This could result in trouble when dealing with netmounts where one user can has a share on a server where he puts his files and can access only that mount for space on a terminal, when dealing with an user mount which would be required for security.

London Bombings

One thing that amazed me about the bombings is that the people didn’t panic, but remained calm.

Even though the people of London and the rest of England were hurt, they weren’t terrorized.

Although it might sound harsh, the terrorists have lost this fight: they gained nothing by these bombings.

Update: although it is s a bit inconvenient, this post is my 100th blog post. And currently the 100th comment has been made too.

Welcome to 213.133.112.101

When you read this, this page has been served from my virtual server.

This also means that luckily the whole web content transfer has been a success :-D.

Sadly things haven’t gone as smoothly with the qmail configuration for my mail accounts.

So please mail me to bas.westerbaan@gmail.com instead of my @w-nz.com acccount.

Even though the minor setback with the mail (which is giving me headaches), it has been rather fun to do. And I learned a lot. (like that putting a restriction on the amount of memory that php uses isn’t just to annoy the customer but just to safe the server from some terribly php scripts)

When everything is working fine I`ll post some more on setting up your own server.

I desperately need some sleep.

Update: seems that there are quite some bugs in the transfer after all: ftp which only recursed a certain amount of times; .htaccess files not being transfered; permissions changing; php sucking too much memory (again); not enough sleep 😛
But seems to be working right now

rel=”nofollow”

Quite some time ago google started honouring the rel="nofollow" attribute value pair in a html tags, which should prevent spammers from gaining a high page rank by spamming blogs with comments.

It is useless.

Spamming costs almost nothing, if there is a slightest amount of gain in it for the spammers they will keep doing it. Of all those hundred thousands of people visiting blogs and seeing comment spam a few will still follow the link, those few are enough for the spammers.

One could argue that the websites the comment spam point to now haven’t got a really high pagerank anymore. This is only partially true, for only a selective amount of people install the code to add the rel="nofollow" attribute in links that can be spammed. Even though the highest ranking blogs have already installed it, the tons of small blogs are still enough to raise the page rank enourmously.

In my opinion search engines should just ignore links which are considered spam.

Function Recursion Overhead

In courses I followed at university and in most books about programming, recursion is praised as a good way to solve some problems.

An example of recursion (not-existing-api used for the sake of simplicity):

int GetFolderSize(Folder folder)
{
 int sum = 0;
 foreach(File file in folder.files)
 {
  sum += file.Size;
 }
 foreach(Folder subFolder in folder.folders)
 {
  sum += GetFolderSize(subFolder);
 }
 return sum;
}

This example function calculates the combined size of all files in a folder and those inside any subfolder.

It does its job and it is very clear how it works, but it is inefficient, take another way to write the algorithm:

int GetFolderSize(Folder folder)
{
 Stack<Folder> stack = new Stack<Folder>();
 int sum = 0;
 stack.Push(folder);
 while(stack.Count > 0)
 {
  Folder folder = stack.Pop();
  foreach(File file in folder.files)
  {
   sum += file.Size;
  }
  foreach(Folder subFolder in folder.folders)
  {
   stack.Push(subFolder);
  }
 }
 return sum;
}

This version is harder to understand. Basicly it maintains a stack of folder of which the total size of the files in it are not yet calculated. While going through each of the folders in the stack it adds new sub-folders from folders and removes the ones processed.

The latter method is way more efficient.

For each function function call in the first (recursive) version of the algoritm a new sum-instance, a subFolder-instance and an enumerator instance over the folder.Folders is required which stacks up. When 5 deep in the first algorithm you already use more than double of the amount of memory ever required in the second algorithm.

Additionally a function call itself requires more memory of its own, which depending on the language used, can be quite significant. Debugging also gets really hard when you have got a stack trace of hundreds of functions.

Using your own stack instead of poluting the call stack (the thing where function calls are kept) sound great. There is only one little problem, it can get pretty complex.

Take for instance a program that would put the files of folders and their subfolders in a file in that same folder, for instance for a playlist generater. This requires the algorithm to detect when all child items of a certain item have been processed. The problem with this is that the parent item is already gone from the stack when the child item is processed. This requires some additional fields and a few function pointers to get it to work and it can get a mess.

The best way to get it to work is to mimic what a normal recursive method would have done, which involves a lot of extra helper classes, which all depend on how the original recursive algorithm has worked. In a language with a Garbage Collector (which automaticly free’s unused classes) it is managable, but in a language without it like C(++) trouble doubles when you also need to monitor the lifetime of the helper classes.

I noticed that it has been very tempting to use recursion and that there are almost no occasions where something like your own stack is used, espacially in the complex cases. A shame, it is challenging :-P.

Bye Bye Spam

I just installed Hash Cash, which is an anti spam plugin for WordPress.

Hash Cash protects this blog from spam by requiring the client to execute javascript which calculates a checksum of the content from a seed which is very hard to extract.

Since I installed it I haven’t got any spam comments :-).

The downside is that it disallows anyone who hasn’t got a javascript enabled browser to post a comment.

Now I still need to get some good means to combat trackback spam. Just putting them under moderation isn’t good enough for they keep coming