Tickle

Tickle is a small Python serializer like Pickle. It however aims at generating smaller output:

>>> len(tickle('hello'))
7
>>> s = StringIO.StringIO()
>>> pickle.dump('hello', s)
>>> len(s.getvalue())
13

Though the difference is and remains quite small, this alone is useful for serialization of small things in the case of for instance RPC. However, usually you already know what kind of data to expect and you don’t really bother about the type information. This can be done by specifying a template:

>>> obj = []
>>> for i in xrange(100):
       obj.append((i, str(i)))
>>> len(tickle(obj))
629
>>> len(tickle(obj, template=(tuple, \
   ((tuple,((int,), (str,))),)*100)))
390

(Instead the *100 an iterator could be constructed, but that would clutter the example even more than it already is.) In comparison:

>>> s = StringIO.StringIO(); pickle.dump(obj, s)
>>> len(s.getvalue())
1680

One big disadvantage of Tickle is speed. Pickle has got a nice C implementation, which is quite fast. Psyco helps a bit but not really enough for really big things. Even more so pickle is a bit smarter: it builds a LUT for instances to avoid duplicate data. However, in the situations where Tickle will be used (by me at least) that isn’t too big of an issue.

You can download tickle.py via gitweb.

6 thoughts on “Tickle”

  1. You can improve significantly the size of pickle stream by using the newer binary protocols. Using your example:

    >>> obj = []
    >>> for i in xrange(100):
    ... obj.append((i, str(i)))
    ...
    >>> len(pickle.dumps(obj))
    2178
    >>> len(pickle.dumps(obj, 2))
    1098

    Anyway, I saw that you used pickle.HIGHEST_PROTOCOL in your tickle module, so probably already knew this. Yet, you can improve the size even more by using the (undocumented) ‘fast’ attribute, which turns off memoization.

    >>> s = StringIO()
    >>> p = pickle.Pickler(s, 2)
    >>> p.fast = True
    >>> p.dump(obj)
    >>> len(s.getvalue())
    696

    However by disabling memoization, pickle will enter in an infinite loop if it encounters an object that is cyclic—e.g., L = []; L.append(L). Now if you really care about size, nothing stop you from using the gzip module.

    Overall, there is a few neat things about your serialization protocol. I like how you organized the code—i.e., each type has his own class, instead of one big serializer class. I am sure there is some drawbacks to this approach, but it is neat nevertheless.

  2. Forgot to add HIGHEST_PROTOCOL. Tickle itself would also enter an infinite loop on cyclic references so .fast=True would be a fair for comparison (Thanks for noticing it!). I wrote tickle for a rpc module which relies on small messages with a predefined structure for which Tickle with its templates is suited way better than pickle and is easier to maintain and handle than using pack.

    gzip doesn’t make a whole lot of a difference on small rpc packets, I guess. Maybe delta-compression would though (which is just a fancy word in this case for persisting the state of the compressor between messages). I’ll write two backends, one Tickle and one ‘GzPickle’ to test it.

    To make writing templates more convenient I’ll write a pack lilke format string that is converted magically by (un)pickle to an appropriate tuple.

    When writing in Python itself, it’s suboptimal anyway. Using classes really doesn’t make that much of a difference compared to using a table of functions in python anyway. I should write a c version of Tickle if it ends up to be useful.

  3. I was curious to see if compressing was worthwhile. So, I did some
    tests.

    import pickle, tickle
    from StringIO import StringIO
    from gzip import GzipFile

    def make_pickle_dump(fast=False):
      def dump(obj, file):
        p = pickle.Pickler(file, 2)
        p.fast = fast
        p.dump(obj)
      return dump

    pickle_dump = make_pickle_dump()
    pickle_dump_fast = make_pickle_dump(fast=True)
    pickle_load = pickle.load

    def tickle_dump(obj, file):
      file.write(tickle.tickle(obj))

    def tickle_load(file):
      return tickle.untickle(file.read())

    def dump(obj, dump_method):
      s = StringIO()
      gz = GzipFile(fileobj=s, mode="wb")
      dump_method(obj, gz)
      gz.close()
      return s.getvalue()

    def load(data, load_method):
      s = StringIO(data)
      gz = GzipFile(fileobj=s, mode="rb")
      return load_method(gz)

    Again, using your example ‘obj’:

    >>> len(dump(obj, pickle_dump))
    717
    >>> len(dump(obj, pickle_dump_fast))
    342
    >>> len(dump(obj, tickle_dump))
    386

    I am quite surprised to see how well pickle streams, generated using
    the ‘fast’ attribute, compress. However for tickle streams, the gain
    is minimal, which, I presume, means that your protocol is compact.

  4. (tickle accepts a file (named stream) parameter and untickle both a stream as a str)

    Now I wonder how it depends on the size:

    1    37    31    27
    2    48    38    33
    4    68    50    43
    8    99    71    62
    16   153   103   91
    32   254   150   147
    64   467   235   253
    128  952   431   489
    256  2005  834   974
    512  3959  1739  1782

    Again with the same test object but then with a different amount of elements. First column is the amount of elements. Second the size of pickle, third of pickle with fast attribute and the last of tickle.

    Maybe writing a dialect of Tickle that is very compressible would be nice. I guess using one byte markers to mark beginning and ends of data would be very compressible. Eg: [here starts a tuple] [an int] 5 [an int] 5 … [here ends the tuple], etc. Basically repr but then with binary data and no extra spaces.

  5. Oh, wait, when looking even further Tickle wins again. I guess Tickle is better at compressing big strings and big ints than Pickle is:

    1024   5514   3351   3367
    2048   10918  6910   6414
    4096   20763  12792  11833
    8192   39492  23893  22166
    16384  77385  46063  44242
    32768  152429 89529  86426

    Or that pickle at least has got a very compressible tuple format.

  6. Maybe writing a dialect of Tickle that is very compressible would be nice. I guess using one byte markers to mark beginning and ends of data would be very compressible. Eg: [here starts a tuple] [an int] 5 [an int] 5 … [here ends the tuple], etc. Basically repr but then with binary data and no extra spaces.

    The format you are describing is the one currently used by pickle. In protocol 1, a tuple of three 5 is serialized as:
    (I5
    I5
    I5
    t.

    Or in the newer binary protocols:
    (K\x05K\x05K\x05t.
    (where K is the opcode for 1-byte integers)

    If you are curious, I do have an article about how pickle work. It doesn’t cover the details of the newer protocols, but it should give you a good start to learn them afterward. Anyway, I am currently working on the next version of the pickle protocol for Python 3K. So if interested to help, just send me an email.

Leave a Reply

Your email address will not be published. Required fields are marked *