Unicode to ASCII (1)

When I want to generate usernames from real names, which can contain non-ascii characters, you can’t simply ignore the unicode characters. For instance, danielle@blaat.org is the right e-mail address for Daniëlle, danille@blaat.org isn’t.

There’s trick. Unicode has got a single code for ë itself, but it has also got a code which (simplified) adds ¨ on top of the previous character. The unicode standard defines a normal form in which (at least) all such characters, which can be, are represented using such modifiers. If you then simply ignore the non-ascii representable codes, you’ll get the desired result.

In python: unicodedata.normalize('NFKD', txt).encode('ASCII', 'ignore').

However, this isn’t the right solution. For instance, in german, one prefers ue as a replacement of ü over u.

GStreamer: accurate duration

When decoding, for instance, a variable-bitrate MP3, gstreamer reported durations are, to say the least, estimates. I’ve tried to get a better result in a few ways. First off, some files yield a duration tag, but even if you’re lucky and it is there, there are no guaranties about precision. After that I tried seeking to the end (GST_SEEK_END) of the stream and querying the position, which gstreamer didn’t like. Finally, routing the audio into a fakesink, waiting for the end of stream and then querying for the position gives the right result. It’s not the prettiest method, but it works.

This is a Python script that prints the duration of a media to stdout.

Benchmarking CouchDB (1)

I’ve written a small benchmark for couchdb to test it’s document creation performance. A script creates [tex]$N$[/tex] documents in total using bulk update to create [tex]$B$[/tex] at the same time with [tex]$T$[/tex] concurrent threads. The following graph show the time it takes to create an amount of documents against that amount of document for different values of [tex]$B$[/tex] with [tex]$T=1$[/tex].

And for [tex]$T=2[/tex] (two concurrent threads. Tested on a dual core machine)

The values of B are 1, 2, 4, 5, 8, 11, 16, 22, 32, 45, 64, 90, 128, 181, 256, 362, 512, 724 and 1024

As you can see, a higher value of [tex]$B$[/tex] causes the graph to shift to the right which means more [tex]$N[/tex] for the same time. Bulk update really does make a difference. Or non-bulk-update really sucks. Also adding threads does help a bit, but not as much as expected.

There are some more interesting graphs to plot ([tex]$B$[/tex] against [tex]$\overline {N \over \Delta T} $[/tex]). More graphs tomorrow.

(For those interested, the raw data from which these graphs were plotted.)

CouchDB document creation performance

CouchDB is a non-relational database which uses MapReduce inspired views to query data. There are lots of cool things to tell about its design, but I rather want to talk about its performance.

Today I’ve been busy hacking together a little script to import all e-mails of a long e-mail thread into a couchdb database to write views to extract all kinds of statistics. I already imported these e-mails into a MySQL database a few months ago, but was quite disappointed by the (performance) limitations of SQL. The e-mail thread contains over 20,000 messages which weren’t a real problem for MySQL. When importing, however, couchdb was adding them at a rate of only a few dozen per second with a lot of (seek)noise of my HDD.

So I decided to do a simple benchmark. First of, a simple script (ser.py) that adds empty documents sequentially. It’s averaging 16 per second. It occurred to me that couchdb waits for a fsync before sending a response and that asynchronously the performance would be way better. A simple modification to the script later (par.py) it still averaged 16 creations per second.

I guess, for I haven’t yet figured out how to let straces tell me, that it’s the fsync after each object creation which causes the mess. couchdb itself doesn’t write or seek a lot, but my journaling filesystem (XFS) does on a fsync.

Can anyone test it on a different filesystem?

Update Around 17/sec with reiserfs.

Update I had some trouble with the bulk update feature. I switched from svn to the 0.7.2 release. I got about 600/sec, which dropped to a steady-ish 350/sec when using sequential bulkupdates of 100 docs. Two bulk updates in parallel yield about 950/sec initially, dropping to 550/sec after a while. Three parallel updates yield similar performance.

Tickle

Tickle is a small Python serializer like Pickle. It however aims at generating smaller output:

>>> len(tickle('hello'))
7
>>> s = StringIO.StringIO()
>>> pickle.dump('hello', s)
>>> len(s.getvalue())
13

Though the difference is and remains quite small, this alone is useful for serialization of small things in the case of for instance RPC. However, usually you already know what kind of data to expect and you don’t really bother about the type information. This can be done by specifying a template:

>>> obj = []
>>> for i in xrange(100):
       obj.append((i, str(i)))
>>> len(tickle(obj))
629
>>> len(tickle(obj, template=(tuple, \
   ((tuple,((int,), (str,))),)*100)))
390

(Instead the *100 an iterator could be constructed, but that would clutter the example even more than it already is.) In comparison:

>>> s = StringIO.StringIO(); pickle.dump(obj, s)
>>> len(s.getvalue())
1680

One big disadvantage of Tickle is speed. Pickle has got a nice C implementation, which is quite fast. Psyco helps a bit but not really enough for really big things. Even more so pickle is a bit smarter: it builds a LUT for instances to avoid duplicate data. However, in the situations where Tickle will be used (by me at least) that isn’t too big of an issue.

You can download tickle.py via gitweb.

Virtual packages in python

When writing an application in python, it can be very convenient to be able to import a module from the top of your module tree. Eg. import myapp.config instead of (python 2.5 only) relative imports: import ....config. To do this one would have to make myapp a package. The normal way to do this is to put your application directory (which now has to be named myapp) somewhere in python’s package search path. This isn’t all to convenient.

The solution: manually set up your package module:

import os
import sys
import imp
def setup_virtual_package(name, path=os.curdir):
    """ Sets up a package at the given path with a given
     name """
    modulePath = os.path.abspath(path)
    f, fn, suffix = imp.find_module('__init__',
         [modulePath])
    imp.load_module(name, f, fn, suffix)
    sys.modules[name].__path__ = [modulePath]

Now import myapp.something works like a charm.

Virtual package for your python application

When you’ve got a big python application, you’ll usually split it up in modules. One big annoyance I’ve had is that a module inside a directory cannot (easily) import a module higher up in the tree. Eg: drawers/gtk.py cannot import state/bla.py.

This is usually solved by making the application a package. This allows for import myapp.drawers.gtk from everywhere inside your application. To make it a package though, you need to add the parent directory in the sys.path list. But unfortunately this also includes all other subdirectories of the parent directory as packages.

However, when the package module (eg: myapp) was already loaded, then the path from which myapp was loaded is used to find the submodules (eg: myapp.drawers.gtk) and sys.path isn’t looked at, at all. So, here is the trick:

import sys
import os.path

p = os.path.dirname(__file__)
sys.path.append(os.path.abspath(p+"/.."))
__import__(os.path.basename(p))
sys.path.pop()

Note that this script doesn’t work when directly executed, because the __file__ attribute is only available when loaded as a module.

Save this script as loader.py in the root of your application. import loader from the main script in your app, and you’ll be able to import modules by myapp.a.module, where myapp is the root directory name of your application.

Python Url Encoding

I looked in the Python Library Reference for a function to encode special characters in strings to be able to be put in Urls (and Uri’s), but found none.

Maybe I missed it?

Anyways, here is an implementation I wrote based on this article:

HexCharacters = "0123456789abcdef"

def UrlEncode(s):
    r = ''
    for c in s:
        o = ord(c)
        if (o >= 48 and o <= 57) or \
            (o >= 97 and o <= 122) or \
            (o >= 65 and o <= 90) or \
            o == 36 or o == 45 or o == 95 or \
            o == 46 or o == 43 or o == 33 or \
            o == 42 or o == 39 or o == 40 or \
            o == 41 or o == 44:
            r += c
        else:
            r += '%' + CleanCharHex(c)
    return r

def CleanCharHex(c):
    o = ord(c)
    r = HexCharacters[o / 16]
    r += HexCharacters[o % 16]
    return r

note: I used the character numbers instead the characters to compare with so I could do greater than and lesser than for the alphanumeric numbers. Maybe testing with a bitmask would be more efficient.

I have to write almost everytime I work with python my own something to hex function which doesn’t add the ‘0x’ in front the built-in hex does.

Either I can’t search or Python hasn’t got enough batteries included. I guess the first, if not I`ll submit my batteries.

Plugins, the web and python

Plugins for web applications seems to be discussed a lot more recently by bloggers (Zef did for instance).

Plugins usualy come in 2 shapes: hacks and modules.

The key difference between a hack and a module is that a hack changes existing behaviour and a module adds behaviour or maybe extends some. In Zef`s example he is talking about possibly adding a calander to google`s gmail – that is definitely a module.

When changing the whole interface by for instance being able to plan appointments inside other module`s (like viewing email) that would require changing those modules and basicly make it a hack.

Modules are easy to make and maintain for they form seperate entitites which don’t often conflict. Most web applications already support module-based plugins like IPB. Some more functionality can be added to modules by letting modules hook onto certain events in the other application, but this has its limits.

Where hacks are also widely used, although these are very hard to maintain and usualy conflict which eachother.

So what is the best way to write a plugin based application?

I was thinking about that when I was working on a computer version of the popular card game Magic. There are already a lot of computer version which let players play the game but the players have to apply the rules themselves.. the computer doesn’t do that for them for there are for every card a lot of exceptions and new rules are added every month. To make such a game plugins would be required.

The way I solved this is by putting everything in the game in modules/plugins. The only thing the game does is saying ‘start’ to all plugins. One plugin responds to that and that may for instance be the UI plugin which shows the main UI form and then sais ‘postgui-start’, where another plugin may respond that extends the UI form for some new features in the game and then calls ‘postseries4gui-start’. A card itself is represented by a big hashtable containing all info of that card dynamicly, including functions. Letting one create-card attack another basicly is just a big dynamicly getting properties and calling functions which all are overloaded for new rules which seems to work pretty fine.

Guess what I used to develop this highly dynamic interaction which eachother?

Python!

Python is the perfect language for plugins.

Now.. why don’t we already got really nice web applications allowing this kind of highly dynamic plugins? Web applications are although using sessions, primarily stateless.. On every page view the whole script is loaded in the memory, and executed again and then unloaded. Nothing keeps persistant. Using plugins generates a lot of overhead on espacially loading stuff for every plugin must be able to hook onto and interact with everything. Doing this with the current model webservers just don’t work.

I already made some suggestions how to solve this at the server side, but changing this would take a lot of time.

So to get back to how to get this done -now- and not later is to keep working with a modulair way exposing the most used hooks and data access.

A suggestion for gmail:

Allow SOAP based access to gmail`s features. Exactly the same amount as access as the user has got.
Then allow adding new menu items and hooks on which a call is made to your specified url, which then can work with gmail on the client-access level using SOAP calls.

Best would be for google to just expose the api of gmail and give a limited functionality dev download so people can make modules in the source and send them to google back. If they like it they’ll use it.. I guess google will come up with something like that once. Or at least I hope it.

Python Html Document Abstraction

Python is great!

>>> d = document()
>>> d.html.body.h1.value = "My Site!"
>>> d.html.body.p.value = "Welcome to this python generated site"
>>> str(d)
'<?xml version="1.0" encoding="UTF-8"?>< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML
 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<body><p>Welcome to this python generated site</p>
<h1>My site!</h1></body><head><title>
</title></head></html>'

(Ignore the added slashes and the additional line breaks caused by wordpress)

By overloading the __get/set/delattr__ functions a html document can be represented like a real python object model.

I’ve just experimented a little bit with python code to ultimately go to write a framework to write nice dynamic python webbased applications in.

Although it appears that the names of the objects (html, body, p, etc) are the tag names, they aren’t. They are the identifiers of the tags.. in case the tag isn’t set by yourself but just created for it didn’t existed it uses as tag its alleged id.

The default created object when no object exists already with that id is a tag. This abstract document won’t be limited to tags. I’ve just made a styleTag class which allows:

d.html.head.style.body["font-family"] = 'verdana'

which is basicly the same as

d.html.head.style.["body"].font-family = 'verdana'

In contrary to the normal tag class where an item is an attribute, this is different in the style tag for CSS got a lot of characters which python doesn’t like (like #).

Being able to manipulate a style sheet that easily allows every custom tag (maybe a datetimepickercontrol) to set its own style information by just using simple python code.

For the styletag isn’t bound to putting its emitted css in the emitted-html string itself in case it is emit-ed in a specific context like a webserver, it can even create a seperate css for this purpose.

Python allows much more dynamic features in a dynamic framework like this than any other language, I`m quite enthousiastic about it and am playing with new idea’s like a little child :-).

All kinds of idea’s would be welcome..

Just wondering whether such a thing has already been written for Python.. anyone knows?

The dynamic of python

Python is dynamic (duh)

Dynamic typing
You don’t need to specify the type of an object anywhere.

def PrintSomething(something):
    print something

Everything is an object
Everything, yes, everything in Python is an object. Even an integer, a method, a class, etc.
Every object has got a type (__class__), which actually can be changed at runtime, which usually results in conflicts about the allocated size for the PyObject.

No typing needed
Python doesn’t know interfaces, for if a certain object wants to expose certain behaviour it just implements the required methods.
When you want a object to be comparible in python you don’t inherit something like IComparible like in .Net, but you just create the __cmp__ method.

Fields and methods can be changed (and are the same)
A method and a field in python are 2 the same things. They both rely in the object’s dictionary (__dict__) or of the object’s type’s dictionary (__class__.__dict__).
This allows you to use an object which implements __call__ instead of a method.
For __dict__ is writeable you can change / create attributes (members) at runtime:

>>> sys.__dict__['foo'] = "bar"
>>> sys.foo
'bar'

A class member function is a normal method, wrapped

>>> class exampleclass:
	value = ""
	def examplefunction(self):
		print self.value
>>> def examplereplacement(self):
	print "replacement!"
	print self.value
>>> instance = exampleclass()
>>> instance.examplefunction()
>>> instance.value = "foobar"
>>> instance.examplefunction()
foobar
>>> instance.examplefunction = examplereplacement
>>> instance.examplefunction()
Traceback (most recent call last):
  File "<pyshell#15>", line 1, in -toplevel-
    instance.examplefunction()
TypeError: examplereplacement() takes exactly 1 argument (0 given)
>>> instance.__class__.examplefunction = examplereplacement
>>> instance.examplefunction()
Traceback (most recent call last):
  File "<pyshell#17>", line 1, in -toplevel-
    instance.examplefunction()
TypeError: examplereplacement() takes exactly 1 argument (0 given)
>>> instance.__class__.__dict__['examplefunction'] = examplereplacement
>>> instance.examplefunction()
Traceback (most recent call last):
  File "<pyshell#19>", line 1, in -toplevel-
    instance.examplefunction()
TypeError: examplereplacement() takes exactly 1 argument (0 given)
>>> instance = exampleclass()
>>> instance.__class__.__dict__['examplefunction'] = examplereplacement
>>> instance.examplefunction()
replacement!

The bound methods seem to work a bit more tricky than they appear to work. Usually editing member functions via the __class__‘s __dict__ will result in a proper replacement.

These were just a few examples of what Python has to offer. These features aren’t limited to Python. Iron Python, an implementation of Python in .Net is capable of letting usual .Net objects to be manipulated in similar ways with Python via Iron Python. Although this doesn’t actually change the .Net objects, changing the wrapper resulting in the required similar behaviour is good enough.

The author of Iron Python has been recruited by Microsoft and is now working on making the CLR more dynamic.

I’d love a static dynamic language.

Base64 encoding/decoding algorithm

I’ve made some python functions to encode/decode base64. I’ve been trying to develop my own algorithm for base64 for the email protection script which can be found here.

Python again has proved itself again to be a great language for quickly developing stuff.

def tobase64(s, padd = False):
    b64s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
    b64p = "="
    ret = ""
    left = 0
    for i in range(0, len(s)):
        if left == 0:
            ret += b64s[ord(s[i]) >> 2]
            left = 2
        else:
            if left == 6:
                ret += b64s[ord(s[i - 1]) & 63]
                ret += b64s[ord(s[i]) >> 2]
                left = 2
            else:
                index1 = ord(s[i - 1]) & (2 ** left - 1)
                index2 = ord(s[i]) >> (left + 2)
                index = (index1 << (6 - left)) | index2
                ret += b64s[index]
                left += 2
    if left != 0:
        ret += b64s[(ord(s[len(s) - 1]) & (2 ** left - 1)) << (6 - left)]
    if(padd):
        for i in range(0, (4 - len(ret) % 4) % 4):
            ret += b64p
    return ret
def frombase64(s):
    b64s = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"
    b64p = "="
    ret = ""
    s2 = s.replace(b64p, "")
    left = 0
    for i in range(0, len(s2)):
        if left == 0:
            left = 6
        else:
            value1 = b64s.index(s2[i - 1]) & (2 ** left - 1)
            value2 = b64s.index(s2[i]) >> (left - 2)
            value = (value1 << (8 - left)) | value2
            ret += chr(value)
            left -= 2
    return ret

The algorithm doesn’t automaticly add the required =‘s while encoding, nor does it require while deencoding.

RGB to Hex (and why the python interactive mode is so damned handy)

Update 2010-02-01 Thanks to Sameer for pointing out the short way of doing this: return “#%02X%02X%02X” % (r,g,b)

def tohex(r,g,b):
	hexchars = "0123456789ABCDEF"
	return "#" + hexchars[r / 16] + hexchars[r % 16] + hexchars[g / 16] + hexchars[g % 16] + hexchars[b / 16] + hexchars[b % 16]

Python is very convenient when you need to program a simple algorithm. I programmed this RGB to Hex converter in less than 1 minute, including using the function for a few RGB values I needed to convert.

Usualy I just grab a pen and a paper and do the calculations myself for getting either the calculator of windows itself to show up and actually calculate stuff properly (looking to screen, writing down, forgetting to press C, starting again…), or writing a program in a language like C# would just take too much time.

I’ve been using Idle for quite some time as a very good replacement for both my calculator and my pen and paper.