serialization – Intrepid Blog

Tickle is a small Python serializer like Pickle. It however aims at generating smaller output:

>>> len(tickle('hello')) 7 >>> s = StringIO.StringIO() >>> pickle.dump('hello', s) >>> len(s.getvalue()) 13

Though the difference is and remains quite small, this alone is useful for serialization of small things in the case of for instance RPC. However, usually you already know what kind of data to expect and you don’t really bother about the type information. This can be done by specifying a template:

>>> obj = [] >>> for i in xrange(100): obj.append((i, str(i))) >>> len(tickle(obj)) 629 >>> len(tickle(obj, template=(tuple, \ ((tuple,((int,), (str,))),)*100))) 390

(Instead the *100 an iterator could be constructed, but that would clutter the example even more than it already is.) In comparison:

>>> s = StringIO.StringIO(); pickle.dump(obj, s) >>> len(s.getvalue()) 1680

One big disadvantage of Tickle is speed. Pickle has got a nice C implementation, which is quite fast. Psyco helps a bit but not really enough for really big things. Even more so pickle is a bit smarter: it builds a LUT for instances to avoid duplicate data. However, in the situations where Tickle will be used (by me at least) that isn’t too big of an issue.

You can download tickle.py via gitweb.

Yey, I am your average developer and I made yet another program with some kind of data I want to be able to dump on the hard drive and be able to grab it again. Let’s use serialization!

Serialization was meant to be a tool for developers to pick up their data from the memory sqeesh it a bit and let it be dropped onto the hard drive, sparing hours of work making a custom data serializer to an existing format or even worse: a home made format. However there are a lot of reasons why not to serialize in .net.

Ok, so Iâ€™ll spend endless hours to make my own algorithms to save my data producing thousands of almost similar lines of code, but just not nearly similar enough to prevent copy and paste bugs. After admiring my very own labor there will be twice as much time debugging the code.

… mm… isnâ€™t there another way to serialize, which doesnâ€™t have too much adverse sides?
ok, what are the demands:

It shouldn’t effect the way you design classes.
Thus:
It should not be based on public fields or properties.
This is probably the foremost cause of type design restrictions due to serialization support. While dropping public fields as basis of the serialization allows protection of certain fields, it also prevent those fields to be accessed by some kind of automated serialization. This however is only a minor setback: youâ€™ll need to create Serialize and Deserialize methods to control serialization. A part of the current .net serialization also relies on this principle, however there should be some modification:
The actual implementation of the (de)serialization algorithm should only optionaly be in the type declaration.
The original .net serialization restricts the user to serializing souly his/her own types. If string was to be unserializable you wouldnâ€™t be able to tell some kind of serialization handler: “hey, here is a type, serialize it” because this handler would find that you are a moron, trying to serialize a string which is not serializable because there is no Serialize nor a Deserialize method. You would be forced to create your own string serialization algorith in every single type you want to serialize and which contains a string. While this is doable image that the string has been changed in a complex data structure containing inter references… You still would be spending a lot of time creating a serialization algorithm for someone elses type. The most obvious solution to this problem is allowing some kind of TypeSerializer which could be ‘registered’ to a serialization provider. The only drawback to this solution is that you canâ€™t access any private fields, therefore you arenâ€™t able to truly serialize every type and there are scenarioâ€™s imaginable where serialization is impossible. There is no easy solution to this. Luckily this should be a rare event.
You should be able to handle reference types as reference types.
The most fundamental flaw in the current .net serialization in my opinion is being unable to serialize a from multiple locations referenced type only once in other words preserving the ‘reference equals’. Programmers have been known to use ‘ID’s’ to preserve some kind of referencing ability. This seems like a nice simple solution, only draining processor time every time an Id needs to be solved.
There should be a Serialization and Deserialization Host/Provider.
Such a provider has 3 main function, justifying its existence:
- Storing type serializers
  All different type serializers could also be stored in a static list, but this would enforce the ‘use’ of every single one of them (during lookup). You donâ€™t want to know how to deserialize meat when you are vegetarian!
- Storing serialized signatures together with their (deserialized) objects
  The only way you are able to preserve references while being able to serialize on demand is to keep track of the already serialized reference type object or the already deserialize reference objects. This could also be managed in a static list however the ‘same’ rule applies: A ice cream shop doesnâ€™t need to know what kinds of meats already have been serialized.
- Providing Serialize and Deserialize methods to the Serialize and Deserialize algorithms by being an argument.
  The provider should automatically redirect the serialization request for a field of a serialize method to the serialize method of the type of the field, while passing itself as serialization provider along to the next ones in the chain. While this improves user comfort it also gives the serializer the ability to store the created ‘serialized object’ or in the case of a deserializer the deserialized object. There is a catch to this system, when there is the possibility of inter referencing this could cause an endless (until the stack runs out) chain. When the pork meat isnâ€™t finished serializing itâ€™s fields, the olives donâ€™t know that it is busy and will invoke a second serialization of pork meat causing a second serialization of the olives and so on. This is easily solved by a not so elegant use of a Register method inside the Serialize or Deserialize method. This method adds the not yet totally (De)serialized object to the referenced object and serialized objects pairs allowing any objects down the graph to use the correct reference.
It should be able to coup with any changes.
This is a hard one and this is a problem bugging all areas of development. There is no easy solution to this. The only way to handle this is to make some sort of conversion for older files or to have different types of serialized object of the same (sometimes changed) type. It would be something like that you have the serialized type of salade mix containing the amount of salad and another one which also contains the amount of tomatoes.
It should get rid of the string key value pair used by the microsoft serialization.
Strings are slow and interpreting them is even slower. Itâ€™s like describing the forms of the figures in your bank account with metaphors. There is one good thing about key value pairs and that it that they are unordered. This barely manages to try to hide the fact that itâ€™s too slow. An alternative could look like this:
- A Guid regogniced by the deserialization provider as a specific version of the serialized version of a specific type. This would trash any problems with extensibility of the serialized type because you would simple copy and paste the algorithm and kick iit a bit (to fit your demands) and supply it with a new Guid while preserving the old algorithm and possibly adding a friendly obsolete exception. Any type which supplies serialization and deserialization to itself or to another type should include a list of accepted guids.
- A list of referenced serialized objects which could contain the fields of the type. How this is used is to the creator of the serialization and deserialization algorithm.
- A byte array containing any ‘personal’ data of a type whos data isnâ€™t distributed among fields (like natives).
It should be secure.
The memory is pretty save due to access restrictions and the fact that only the application controlling the memory really knows what a byte means, however the hard disk or even worse the internet isnâ€™t really any match for the protection the memory offers. There are a few ways to protect data on your hard disk. One of those is access policy. However this is somewhat unpractical and canâ€™t be applied to internet traffic. Maybe the best solution is using encryption, this could be applied to the hole file, inefficient, but effective or sensitive data inside of a serialized file could be stored into the ‘raw’ data and be encrypted by the type serializer.

I think that there could be an implementation of serialization able to meet these demands. If it does meet these demands there would be little objection left to use and it would be favorable above even the most optimized hand crafted data ‘dumpers’. I will do some more research and there could be a sequential more concrete article with some closer-to-code talk.

P.S. As you may have noticed the posts now contain an author specifies and you may have also noticed this post wasnâ€™t written by the usual author. Iâ€™ve joined Bas Westerbaan writing posts for Intrepid Blog.