The algorithm described in my previous post on how to compute a patch for a message to get the wanted crc computed from that text has a usefull application when wanting to get the possible texts that could compute to your own provided crc.
A little util I wrote, reversecrc (http://w-nz.com/projects/reversecrc/), lists (the endless) list of possible original texts, has been usefull for the Freelancer modding community, for in that game Crc is used a lot and was an obstacle.
Although it is almost impossible to reverse a 20 character string because it would be insane to check all the texts that could result in the hash, using one of the provided texts would do too as it results in the same hash anyways.
Subversion repositry: https://cvs.codeyard.net/svn/icrc
Well, I’m posting a comment here since, even though this was 2 years ago, I’ve seen other comments still being posted on your previous post about reversing CRC.
Recently I came across the issue of trying to revert CRC-32s. The game Trespasser uses CRC-32 hashes for its sound tables, among other things. Sound names apparently could have up to 32 characters (the longest I’ve seen has 31, I think), while the CRC-32 hashes take up only 4 bytes (32 bits) each. But in the final release of the game, not all the entries in the sound tables were used in the game levels, so there are several sounds there whose original names are unknown because no level mentions them. That’s when I started looking at the CRC-32 algorithm to see if there could be a way of retrieving the original information. You can follow the developments here:
http://trescom.3dactionplanet.gamespy.com/board/viewtopic.php?t=313
Of course, I arrived at pretty much the same results as everyone else, as I found out when I searched for other alternatives. But in this particular case, I’m not interested in modifying a string and adding 4 bytes that preserve the CRC-32 value, like other people do; I want to find the original text string, if possible. (Yes, other text strings will also give the same CRC-32 value, but where’s the fun in that?) I don’t know the length of the original string, but I either know or can guess some portions of it. It’s somewhat similar to your reversecrc code, but with those factors taken into account…
I’ve also noticed some interesting things… for example, if you have two similar strings with only a few characters’ difference, you an notice it when attempting to reverse the CRC-32. For example, if you have a string ending in “1” and another ending in “2”, the XOR of the 4-byte sequences calculated by the reverse algorithm will be 00 00 00 03, which means the last characters in both strings are different by two bits (the same XOR of 0x31 and 0x32, the ASCII codes for “1” and “2”, but also 0x61 and 0x62, or “a” and “b”). And if you have a string ending in “cmnt” (short for “cement”) and another in “dirt”, the XOR of the 4-byte sequences will be 07 04 1c 00. Still, you don’t know which is which, but a third string, for example, one ending in “metl” (short for “metal”) will have a XOR sequence of 09 0c 06 18 with “dirt” and of 0e 08 1a 18 with “cmnt”, so you can then identify all three. You can also add padding characters (asterisks, letters, digits, whatever) at the end and find out, for example, that a set of strings are similar but end in “dirt****”, “metl****” and “cmnt****” (with “****” standing for a set of 4 unknown characters which are the same for all three strings, meaning that the difference is not in the last characters in this case). I have to analyze the math behind all this yet, but those are the results from direct observation, and can be used to further help in finding the original string.
Also, I just read here:
http://sar.informatik.hu-berlin.de/research/publications/#SAR-PR-2006-05
about the possibility of the calculated 4 bytes not being necessarily sequential, and in fact I had been considering the same possibility for implementation. It will require the use of some sort of masks (possibly using something like “00101101” to indicate an 8-character string where the characters flagged “0” are to remain untouched and the ones flagged “1” are the ones to be either randomized or directly calculated with the 4-byte algorithm), but I laready have an idea of how to do it…
Well, I made a version in C++ with a GUI. First I used a recursive function to go through the character loop, then I tried a non-recursive approach which ended up very similar to your code (though first I considered using a single loop and modulo arithmetic, but it would have needed 64-bit or larger integers to be able to loop more than 5 characters).
http://rapidshare.com/files/49564875/wxTresCRC6.zip (recursive)
http://rapidshare.com/files/49569878/wxTresCRC7.zip (non-recursive)
Interestingly, the recursive version runs about 20% faster than the non-recursive one (though it could be needing some optimization yet).
I must admit I cheated a bit: since Trespasser basically is using a limited character set that includes lowercase letters, numbers and punctuation characters, instead of checking if the calculated 4-byte sequence only contained characters from the user-defined character set, I made this quick comparison:
if ((crc2 & 0xA0A0A0A0)== 0x20202020) {
to just see if the 4 bytes were in the range 0x20-0x3f or 0x60-0x7f.
Feel free to examine and modify the program if you want to (though I’m not done with it yet).
I posted a link to your blog at
http://en.wikibooks.org/wiki/Reverse_Engineering/Code_Hashing
.
I hope you don’t mind.
Well, I just finished a couple generic versions of my little program, and posted a description of its workings on the Trespasser forums:
http://trescom.3dactionplanet.gamespy.com/board/viewtopic.php?t=5638
I also included a link to this entry in your blog, hope you don’t mind, either…
And just when I thought I had done all I could with that program, I’ve had another idea which might work, but I’ll have to give it a try first… involving concatenated CRC calculations to find not 4 but 8 characters at a time. Of ccourse, I think it’s possible it would take just the same amount of time as looping the same number of characters, so that wouldn’t be much of an improvemnt…
Nice work :).