Tickle

Tickle is a small Python serializer like Pickle. It however aims at generating smaller output:

>>> len(tickle('hello')) 7 >>> s = StringIO.StringIO() >>> pickle.dump('hello', s) >>> len(s.getvalue()) 13

Though the difference is and remains quite small, this alone is useful for serialization of small things in the case of for instance RPC. However, usually you already know what kind of data to expect and you don’t really bother about the type information. This can be done by specifying a template:

>>> obj = [] >>> for i in xrange(100): obj.append((i, str(i))) >>> len(tickle(obj)) 629 >>> len(tickle(obj, template=(tuple, \ ((tuple,((int,), (str,))),)*100))) 390

(Instead the *100 an iterator could be constructed, but that would clutter the example even more than it already is.) In comparison:

>>> s = StringIO.StringIO(); pickle.dump(obj, s) >>> len(s.getvalue()) 1680

One big disadvantage of Tickle is speed. Pickle has got a nice C implementation, which is quite fast. Psyco helps a bit but not really enough for really big things. Even more so pickle is a bit smarter: it builds a LUT for instances to avoid duplicate data. However, in the situations where Tickle will be used (by me at least) that isn’t too big of an issue.

You can download tickle.py via gitweb.

6 thoughts on “Tickle”

Alexandre Vassalotti says:

28 Dec 2007 at 11:51 pm

You can improve significantly the size of pickle stream by using the newer binary protocols. Using your example:

>>> obj = [] >>> for i in xrange(100): ... obj.append((i, str(i))) ... >>> len(pickle.dumps(obj)) 2178 >>> len(pickle.dumps(obj, 2)) 1098

Anyway, I saw that you used pickle.HIGHEST_PROTOCOL in your tickle module, so probably already knew this. Yet, you can improve the size even more by using the (undocumented) ‘fast’ attribute, which turns off memoization.

>>> s = StringIO() >>> p = pickle.Pickler(s, 2) >>> p.fast = True >>> p.dump(obj) >>> len(s.getvalue()) 696

However by disabling memoization, pickle will enter in an infinite loop if it encounters an object that is cyclic—e.g., L = []; L.append(L). Now if you really care about size, nothing stop you from using the gzip module.

Overall, there is a few neat things about your serialization protocol. I like how you organized the code—i.e., each type has his own class, instead of one big serializer class. I am sure there is some drawbacks to this approach, but it is neat nevertheless.
Bas Westerbaan says:

29 Dec 2007 at 3:31 am

Forgot to add HIGHEST_PROTOCOL. Tickle itself would also enter an infinite loop on cyclic references so .fast=True would be a fair for comparison (Thanks for noticing it!). I wrote tickle for a rpc module which relies on small messages with a predefined structure for which Tickle with its templates is suited way better than pickle and is easier to maintain and handle than using pack.

gzip doesn’t make a whole lot of a difference on small rpc packets, I guess. Maybe delta-compression would though (which is just a fancy word in this case for persisting the state of the compressor between messages). I’ll write two backends, one Tickle and one ‘GzPickle’ to test it.

To make writing templates more convenient I’ll write a pack lilke format string that is converted magically by (un)pickle to an appropriate tuple.

When writing in Python itself, it’s suboptimal anyway. Using classes really doesn’t make that much of a difference compared to using a table of functions in python anyway. I should write a c version of Tickle if it ends up to be useful.
Alexandre Vassalotti says:

29 Dec 2007 at 8:25 am

I was curious to see if compressing was worthwhile. So, I did some
tests.

import pickle, tickle from StringIO import StringIO from gzip import GzipFile
def make_pickle_dump(fast=False): def dump(obj, file): p = pickle.Pickler(file, 2) p.fast = fast p.dump(obj) return dump pickle_dump = make_pickle_dump() pickle_dump_fast = make_pickle_dump(fast=True) pickle_load = pickle.load def tickle_dump(obj, file): file.write(tickle.tickle(obj)) def tickle_load(file): return tickle.untickle(file.read()) def dump(obj, dump_method): s = StringIO() gz = GzipFile(fileobj=s, mode="wb") dump_method(obj, gz) gz.close() return s.getvalue()
def load(data, load_method): s = StringIO(data) gz = GzipFile(fileobj=s, mode="rb") return load_method(gz)

Again, using your example ‘obj’:

>>> len(dump(obj, pickle_dump)) 717 >>> len(dump(obj, pickle_dump_fast)) 342 >>> len(dump(obj, tickle_dump)) 386

I am quite surprised to see how well pickle streams, generated using
the ‘fast’ attribute, compress. However for tickle streams, the gain
is minimal, which, I presume, means that your protocol is compact.
Bas Westerbaan says:

29 Dec 2007 at 1:06 pm

(tickle accepts a file (named stream) parameter and untickle both a stream as a str)

Now I wonder how it depends on the size:

1 37 31 27 2 48 38 33 4 68 50 43 8 99 71 62 16 153 103 91 32 254 150 147 64 467 235 253 128 952 431 489 256 2005 834 974 512 3959 1739 1782

Again with the same test object but then with a different amount of elements. First column is the amount of elements. Second the size of pickle, third of pickle with fast attribute and the last of tickle.

Maybe writing a dialect of Tickle that is very compressible would be nice. I guess using one byte markers to mark beginning and ends of data would be very compressible. Eg: [here starts a tuple] [an int] 5 [an int] 5 … [here ends the tuple], etc. Basically repr but then with binary data and no extra spaces.
Bas Westerbaan says:

29 Dec 2007 at 1:09 pm

Oh, wait, when looking even further Tickle wins again. I guess Tickle is better at compressing big strings and big ints than Pickle is:

1024 5514 3351 3367 2048 10918 6910 6414 4096 20763 12792 11833 8192 39492 23893 22166 16384 77385 46063 44242 32768 152429 89529 86426

Or that pickle at least has got a very compressible tuple format.
Alexandre Vassalotti says:

30 Dec 2007 at 7:01 am

Maybe writing a dialect of Tickle that is very compressible would be nice. I guess using one byte markers to mark beginning and ends of data would be very compressible. Eg: [here starts a tuple] [an int] 5 [an int] 5 … [here ends the tuple], etc. Basically repr but then with binary data and no extra spaces.

The format you are describing is the one currently used by pickle. In protocol 1, a tuple of three 5 is serialized as:
(I5 I5 I5 t.
Or in the newer binary protocols:
(K\x05K\x05K\x05t.
(where K is the opcode for 1-byte integers)

If you are curious, I do have an article about how pickle work. It doesn’t cover the details of the newer protocols, but it should give you a good start to learn them afterward. Anyway, I am currently working on the next version of the pickle protocol for Python 3K. So if interested to help, just send me an email.

6 thoughts on “Tickle”

Leave a Reply