Tickle is a small Python serializer like Pickle. It however aims at generating smaller output:
>>> len(tickle('hello'))
7
>>> s = StringIO.StringIO()
>>> pickle.dump('hello', s)
>>> len(s.getvalue())
13
Though the difference is and remains quite small, this alone is useful for serialization of small things in the case of for instance RPC. However, usually you already know what kind of data to expect and you don’t really bother about the type information. This can be done by specifying a template:
>>> obj = []
>>> for i in xrange(100):
obj.append((i, str(i)))
>>> len(tickle(obj))
629
>>> len(tickle(obj, template=(tuple, \
((tuple,((int,), (str,))),)*100)))
390
(Instead the *100 an iterator could be constructed, but that would clutter the example even more than it already is.) In comparison:
>>> s = StringIO.StringIO(); pickle.dump(obj, s)
>>> len(s.getvalue())
1680
One big disadvantage of Tickle is speed. Pickle has got a nice C implementation, which is quite fast. Psyco helps a bit but not really enough for really big things. Even more so pickle is a bit smarter: it builds a LUT for instances to avoid duplicate data. However, in the situations where Tickle will be used (by me at least) that isn’t too big of an issue.
You can improve significantly the size of pickle stream by using the newer binary protocols. Using your example:
>>> obj = []
>>> for i in xrange(100):
... obj.append((i, str(i)))
...
>>> len(pickle.dumps(obj))
2178
>>> len(pickle.dumps(obj, 2))
1098
Anyway, I saw that you used
pickle.HIGHEST_PROTOCOL
in your tickle module, so probably already knew this. Yet, you can improve the size even more by using the (undocumented) ‘fast’ attribute, which turns off memoization.>>> s = StringIO()
>>> p = pickle.Pickler(s, 2)
>>> p.fast = True
>>> p.dump(obj)
>>> len(s.getvalue())
696
However by disabling memoization, pickle will enter in an infinite loop if it encounters an object that is cyclic—e.g.,
L = []; L.append(L)
. Now if you really care about size, nothing stop you from using the gzip module.Overall, there is a few neat things about your serialization protocol. I like how you organized the code—i.e., each type has his own class, instead of one big serializer class. I am sure there is some drawbacks to this approach, but it is neat nevertheless.
Forgot to add HIGHEST_PROTOCOL. Tickle itself would also enter an infinite loop on cyclic references so
.fast=True
would be a fair for comparison (Thanks for noticing it!). I wrote tickle for a rpc module which relies on small messages with a predefined structure for which Tickle with its templates is suited way better than pickle and is easier to maintain and handle than usingpack
.gzip doesn’t make a whole lot of a difference on small rpc packets, I guess. Maybe delta-compression would though (which is just a fancy word in this case for persisting the state of the compressor between messages). I’ll write two backends, one Tickle and one ‘GzPickle’ to test it.
To make writing templates more convenient I’ll write a
pack
lilke format string that is converted magically by (un)pickle to an appropriate tuple.When writing in Python itself, it’s suboptimal anyway. Using classes really doesn’t make that much of a difference compared to using a table of functions in python anyway. I should write a c version of Tickle if it ends up to be useful.
I was curious to see if compressing was worthwhile. So, I did some
tests.
import pickle, tickle
from StringIO import StringIO
from gzip import GzipFile
def make_pickle_dump(fast=False):
def dump(obj, file):
p = pickle.Pickler(file, 2)
p.fast = fast
p.dump(obj)
return dump
pickle_dump = make_pickle_dump()
pickle_dump_fast = make_pickle_dump(fast=True)
pickle_load = pickle.load
def tickle_dump(obj, file):
file.write(tickle.tickle(obj))
def tickle_load(file):
return tickle.untickle(file.read())
def dump(obj, dump_method):
s = StringIO()
gz = GzipFile(fileobj=s, mode="wb")
dump_method(obj, gz)
gz.close()
return s.getvalue()
def load(data, load_method):
s = StringIO(data)
gz = GzipFile(fileobj=s, mode="rb")
return load_method(gz)
Again, using your example ‘obj’:
>>> len(dump(obj, pickle_dump))
717
>>> len(dump(obj, pickle_dump_fast))
342
>>> len(dump(obj, tickle_dump))
386
I am quite surprised to see how well pickle streams, generated using
the ‘fast’ attribute, compress. However for tickle streams, the gain
is minimal, which, I presume, means that your protocol is compact.
(
tickle
accepts a file (named stream) parameter anduntickle
both a stream as a str)Now I wonder how it depends on the size:
1 37 31 27
2 48 38 33
4 68 50 43
8 99 71 62
16 153 103 91
32 254 150 147
64 467 235 253
128 952 431 489
256 2005 834 974
512 3959 1739 1782
Again with the same test object but then with a different amount of elements. First column is the amount of elements. Second the size of pickle, third of pickle with fast attribute and the last of tickle.
Maybe writing a dialect of Tickle that is very compressible would be nice. I guess using one byte markers to mark beginning and ends of data would be very compressible. Eg: [here starts a tuple] [an int] 5 [an int] 5 … [here ends the tuple], etc. Basically repr but then with binary data and no extra spaces.
Oh, wait, when looking even further Tickle wins again. I guess Tickle is better at compressing big strings and big ints than Pickle is:
1024 5514 3351 3367
2048 10918 6910 6414
4096 20763 12792 11833
8192 39492 23893 22166
16384 77385 46063 44242
32768 152429 89529 86426
Or that pickle at least has got a very compressible tuple format.
The format you are describing is the one currently used by pickle. In protocol 1, a tuple of three 5 is serialized as:
(I5
I5
I5
t.
Or in the newer binary protocols:
(K\x05K\x05K\x05t.
(where K is the opcode for 1-byte integers)
If you are curious, I do have an article about how pickle work. It doesn’t cover the details of the newer protocols, but it should give you a good start to learn them afterward. Anyway, I am currently working on the next version of the pickle protocol for Python 3K. So if interested to help, just send me an email.