Python Html Document Abstraction

Python is great!

>>> d = document()
>>> d.html.body.h1.value = "My Site!"
>>> d.html.body.p.value = "Welcome to this python generated site"
>>> str(d)
'<?xml version="1.0" encoding="UTF-8"?>< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML
 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<body><p>Welcome to this python generated site</p>
<h1>My site!</h1></body><head><title>
</title></head></html>'

(Ignore the added slashes and the additional line breaks caused by wordpress)

By overloading the __get/set/delattr__ functions a html document can be represented like a real python object model.

I’ve just experimented a little bit with python code to ultimately go to write a framework to write nice dynamic python webbased applications in.

Although it appears that the names of the objects (html, body, p, etc) are the tag names, they aren’t. They are the identifiers of the tags.. in case the tag isn’t set by yourself but just created for it didn’t existed it uses as tag its alleged id.

The default created object when no object exists already with that id is a tag. This abstract document won’t be limited to tags. I’ve just made a styleTag class which allows:

d.html.head.style.body["font-family"] = 'verdana'

which is basicly the same as

d.html.head.style.["body"].font-family = 'verdana'

In contrary to the normal tag class where an item is an attribute, this is different in the style tag for CSS got a lot of characters which python doesn’t like (like #).

Being able to manipulate a style sheet that easily allows every custom tag (maybe a datetimepickercontrol) to set its own style information by just using simple python code.

For the styletag isn’t bound to putting its emitted css in the emitted-html string itself in case it is emit-ed in a specific context like a webserver, it can even create a seperate css for this purpose.

Python allows much more dynamic features in a dynamic framework like this than any other language, I`m quite enthousiastic about it and am playing with new idea’s like a little child :-).

All kinds of idea’s would be welcome..

Just wondering whether such a thing has already been written for Python.. anyone knows?

Why Php sucks (and I still use it)

  • Php is slow, not just slow, but really slow. A simple benchmark runs in 1 milisecond in C. It takes 2 miliseconds for .Net. Python takes 600 miliseconds for instead of a native assembly language or jit-ted language it is an interpreted language. But Php, even though it also is an interpreted language and hasn’t got the enourmous object overhead Python has got it still takes 12000 miliseconds, that are 12 seconds.
  • Stupid work arounds, for Php itself is rather slow you got to rely as much as you can on function calls instead of doing anything yourself. It is for instance faster to load a list by deserializing a string than just reading line by line through a file although the deserialization would be way faster if both methods would be implemented nativly. Another little issue here is that most very quick functions in Php are only available in the newer versions (eg. file_set_contents), this requires you to add an if statement with a home made implementation of the function which usualy is slower by a factor of 10. You can choose to use an alternative way which doesn’t exactly implements the functions of the function you require to use but still does the job for the circumstance (better) (eg. not rewriting file_put_contents when it isn’t available when you want to streamingly write data to a file but rather call fwrite a few times which gets rid of having the whole file in the memory in a string at a time.
  • State less Although the Http protocol is a stateless protocol that doesn’t have to mean that a server side scripting framework should be stateless too. Although Php attempts to become statefull by using a session implementation by serializing a session array on the harddisk for every session this isn’t very efficient. Even one global array that persists between requests would result in such a performance boost. Not only for it doesn’t require file reads, writes, serialization and deserialization (and optionaly queries when you don’t like the php session system), but also for it would allow you to store that little bit of important cache between sessions that otherwise would have needed to be read, deserialized, serialized and written again, for every mere page view!
    Allowing such a persistant array however poses a security risk in the way Php works at the moment. They should add contexts to allow one instance of apache on which mod_php runs to execute files in different context, each with its own settings (and persistant data).
  • There is no satisfying solution in Php For every single common issue in Php there is no simple solution, that works perfectly or even reasonbly well.
    Take for instance templates. There are basicly a few ways to handle templates in Php. Usualy it comes down to either caching .php-scripts which are than executed as smarty does, or using a class for every major template section where every function fetches a template bit. In both these methods executing Php code is required to just only replace a certain tag with an replacement. Php has been designed to do a lot more than that and contains a lot of overhead during interpretation. Using str_replace‘s is a lot faster than a php block inline or even using instring php variables ("Example: {$example}"). The second way using classes and functions is even worse for the whole class and all functions first need to be loaded in the memory and basicly are a lot slower.
    The proper way to use templates is streaminly inputting and replacing tags with their values and outputting it. This isn’t possible for php is slow and loading the whole template in a string is even faster.

The only reason why I still use Php is for it is just the number one supported server side scripting language.

Why is world wide web not fair?

Skins and performance in PHP

There are several ways to use skins in PHP, I’ve put some through a performance test.

Basicly you can use either evaluated PHP or a string that will undergo str_replace’s.

When evaluating PHP in a file it seems to be faster than replacing tags in a string. This for PHP streams through the file during execution instead of handling one big string. The difference is minimal though (15% in my tests).

Although when the PHP code is placed in a string instead of in a file which has to be done in case a string is cached in a database or is generated from compiling from another format it is significantly slower than using str_replace’s on a normal string (600%!), this is because the original sourcecode, the intermediate code and the return from the code all take a lot of memory.

Either use cached PHP files or a string with tags instead of PHP code in a database, never the otherway around (what happens very often).

Caching in PHP

It is usefull to cache certain things between Php script executions.
Some boards written in Php cache the forum architecture so a difficult query hasn’t got to be run every time a guest views the board.

There are a few ways to cache data:

  • Php script. Data will be stored as a normal php file which will be included during execution
  • Serialized object in file. Data will be serialized and dumped to a file which will be read every page view
  • Database storage. Data will be serialized and stored in a database and queried every page view.

There are a lot of myths about using a database would be way slower than a normal php file.

I’ve run a few tests caching a ~16kB php array, the results:

Serialized object stored in file: 0.0015ms
Object in PHP script: 0.0121ms
Serialized object stored in mySQL database: 0.0015ms

It seems to be quicker to use a serialized array in a file as configuration file than a config.php php script!

Databases although just as quick as normal files are favored by me for they are much more scalable.

PHP Security Consortium

The PHPSC is a site managing a lot of resources on PHP security.

For all those starting or sometimes using PHP this is a must read.

Also I’d advice for people who want to know whether there site is safe enough is to try to play the other site by trying out hacking yourself: hackthissite.org. It is easier than you might have thought.

Torrent sharing p2p network

In my previous post I discussed Exeem. Exeem is (or actually will be for it hasn’t been launched, just announced) a p2p network for sharing, rating and commenting torrents.

What is a torrent? A torrent is a small file which is used for thebittorent p2p file redistribution system to identify a certain file, or files you can download. You first need the torrent for a file/folder before you can download it.

The major problem with this is that it is impossible to use a bittorent client itself to search for the downloads you want, therefore a lot of sites have been created over time which contain huge searchable collections of torrents. One of these sites was suprnova.org, which has recently been terminated due to legal issues.

As I elaborated in my previous post Exeem probably will suck. So someone will need to do stuff right by making an alternative.

What issues would have need to be solved to create such a p2p torrent sharing network?

  • No centralized client list, most p2p networks were terminated because they had a centralised tracker to which a client connected to receive the file list and all the users available for a certain file. Instead of a centralised server every single client should tell other clients who else is in the network and what files are there. When giving every client a buildin list of IP’s it can update these by querying these for better IP’s. By rating an IP by uptime and connection bandwidth a big changing group of frequently online users could provide the other IP’s and port search queries for the rest.
  • Searching, how to handle a search query? At this moment our client is connected to a few big clients who are frequently online in their neighbourhood, lets call them super nodes for now. When we send them a search query they would look in their cache whether they got the result and if not they look in their own torrents to see whether one of those matched the query and if it doesn’t they’ll just forward it to another a-bit-smaller supernode. The problem with this method is that one query could travel a huge amount of nodes and when you are connected with a good bandwidth you are doing nothing more than passing through queries to other nodes. To solve this the query feedback (when they found a result) should contain the source along with the estamated amount of different search queries the node which had the result can provide. By doing this a shortcut can be formed by one client if it finds a node which either has a lot of files searchable or which has an enourmous cache and offcourse along with that is online often and has a neat bandwidth
  • Rating Alongside every torrent you download or expose for upload there would be a meta data file containing a description, rating and comments on the torrent itself. The problem with this system is that descriptions and ratings can change and it is very hard to keep every instance of a torrent on the whole network synchronized. It is possible to send a message through the network to the original node from which you received the file with the new comment message, or you could search for the torrent again by unique id and message the nodes found to have the torrent too. All these methods still include a lot of passing through messages.
  • Privacy
  • Client side ‘hacking’, When everyone would use the default client which automaticly selects super nodes and lets people pass through queries everything will work fine. The big problem is that it is very possible that people would start using illegal client applications which would just leech from the network. To incorpirate methods to get rid of leechers would work when most people are still using the default client, but when people massivly start using illegal clients the network won’t block itself anymore but would certainly get rid off itself for everyone is leeching. This is the major problem that could happen to this p2p network which heavily relies on the fact that everyone should help others whether they like it or not by proxying, caching and passing through various queries to maintain privacy and decentralization.

I’d be rather interested in how exeem will address these issues. I guess they would just outrule client site hacking by incorperating various encrypting tricks in their protocol.

Update on the anti-email-harvester mailto links

In the previous post I described a simple though effective method to get rid of the constantly cleverer spam email harvester bots.

I’ve made a little update on the algorithm, it now uses only 1 number for each character and uses a cascading incremental xor transform.

Python code for the algorithm itself:

def alphaicx(s):
    ret = ""
    cascvalue = 0
    for i in range(0, len(s)):
        ret = ret + chr(ord(s[i]) ^ cascvalue)
        cascvalue = (ord(ret[i]) + 1) % 255 
    return ret
def betaicx(s):
    ret = ""
    cascvalue = 0
    for i in range(0, len(s)):
        ret = ret + chr(ord(s[i]) ^ cascvalue)
        cascvalue = ((ord(ret[i]) ^ cascvalue) + 1) % 255
    return ret

I designed the algorithm in Python. Python is great for that kind of stuff.

As you can see there are 2 functions, when you encode something with alphaicx you can decode it with betaicx, and visa versa. betaicx creates tougher code though. This encryption is pretty lousy, but hard enough to stop spam bots.

I’ve ported betaicx to PHP, and alphaicx to Javascript. The running example (very usefull though) has been updated.

The PHP/Javascript code for the function:

function JSBotProtect($text){
	$cxred = "0";
	$cascval = 0;
	for($i = 0; $i < strlen($text); $i++){
		$value = (ord($text[$i]) ^ $cascval);
		$cxred .= "," . $value;
		$cascval = (($value ^ $cascval) + 1) % 255;
	}
	return <<<EOF
<script type="text/javascript">var cxred=String.fromCharCode({$cxred});
var uncxred=""; var cascval=0;for(i=1;i<cxred .length; i++)
{uncxred+=String.fromCharCode(cxred.charCodeAt(i)^cascval);
cascval=((uncxred.charCodeAt(i-1))+1)%255;}document.write(uncxred);</script>
EOF;
}

I’ll more compact uncxred storage. Probable just normal hex, or when I can get it working BASE64.

Markup? Nah, wysiwyg!

When you post something on a forum, on a guestbook, on anything on the web which supports some kind of formatting of your code it works with some sort of BB-Code.
BB-Code is just like HTML formatting, with ‘[]’s instead of ‘<>‘s, and a lot less features. It is hard to type and doesn’t look neat.

But who cares? Everyone uses BB-Code, almost everyone knows BB-Code, and most people don’t find it hard to type for people are just too used to it!

There are alternatives to BB-Code, like Textile and Markdown, which use a more convenient syntax.
Personally I don’t like using them for I ain’t used to the syntax.

However.. why would we want to write formating anyway? Why not just use wysiwyg, and I do not mean a java applet but rather a standard for browsers; a new tag: “<input type="formatted" name="example" />“, which would act for the server as a normal input field returning the formatted text in Html.

The problem is that it would be very hard to get every browser to support such a new tag. Most browsers I guess would be very willing to comply. But browsers like for instance Internet Explorer wouldn’t. They don’t even comply with the simplest of CSS at the moment which gives website developers a headache.

Rich Client Side Framework

On several blogs the idea of having a rich java script passed, for example on ZefHemel.com: Rich Web UI: Search As You Type

Guess due to google, which has made a neat Webmail interface for gmail and Google suggest with find as you type.

The demands on java script keeps growing. People want to make better webUI’s and features with Javascript although javascript is defenitely not designed for this stuff.

Using flash, and java is an overkill, but using javascript is espacially an overkill for javascript isnt handled consistantly on different browsers, and isn’t as quick and maintainable as it could be.

I guess it would be time to extend HTML itself with a more advanced script; java like preferably although then directly supported by the browser, and less aimed at custom drawing but using an API provided by the browser.

I’m currently experimenting with Microsoft .net assemblies which are downloaded in slimmed form as webpage which are executed with very limited access. Which works neat although it is still an overkill (a .dll is about 20 Kb, even if you got only one line of code..)

Just a thought.

Parsing $_SERVER[‘PATH_INFO’]

The PHP global variable $_SERVER['PATH_INFO'] contains the path suffixed to a PHP script, if I would call the URL:

http://domain.ext/path/to/script.php/foo/bar.htm?a=b&c=d

Then $_SERVER['PATH_INFO'] would contain:

/foo/bar.htm

Traditionaly the $_GET variables are used for certain parameters like a page to display:

http://domain.ext/page.php?page=about.htm

This method is easy to program, but not only looks strange, but also is very search engine unfriendly. Most searchengines ignore the QueryString (the part of the URL after the ?). And therefor would index the first page.php?page=x they would find and ignore the rest.
Some searchengines like Google do not ignore the query string, but would give a page without using a querystring for different content a way higher ranking.

Parsing the $_SERVER['PATH_INFO'] is relatively easy, this code would do most of the stuff just fine:

if (!isset($_SERVER['PATH_INFO'])){
	$pathbits= array('');
}else{
	$pathbits = explode("/",  $_SERVER['PATH_INFO']);
}

The $pathbits array would always contain / as first element if a path info was provided, otherwise it will be an empty array.

Here is a quite simple example which parses the path info to decide which file to include:

<?php
if (!isset($_SERVER['PATH_INFO'])){
	$pathbits= array('');
}else{
	$pathbits = explode("/",  $_SERVER['PATH_INFO']);
}
if (!isset($pathbits[1]) || $pathbits[1] == ""){
	$page = "default"
}else{
	$page = basename($pathbits[1]);
}
$file = "./pages/{$page}.php";
if (!is_file($file)){
	echo "File not found";
}else{
	require $file;
}
?>

Modular Server

(Understanding URIs)

“A common mistake, responsible for many HTTP implementations problems, is to think this is equivalent to a filename within a computer system. This is wrong. URIs have, conceptually, nothing to do with a file system. One should remember that at all times when dealing with the World Wide Web.”

So why do most HTTP servers still heavily rely on the filesystem dictating URL’s?

This not only tends to create Uncool URI’s, but also makes it seem logical for filebased dynamic content (more on that in my previous post).

A http server, actually every server should be nothing more then a wrapper for modules which handle requests of clients.

The server would only limit itself to a very selected amount of functions

  • Handling connections
  • Exposing an API for the protocol which the modules can us
  • Hosting modules

On startup the server loads modules and binds them to certain URL. Modules remain persistant in the memory and are just signalled a request is made passing the module an API to handle the request.

I’ll be busy exploring the posibilities to implement this.. there will be more about this.

Why server side scripts aren’t that scalable at all

There are a lot of different types of server scripts, a few examples:

Practicly all of these were to be used via CGI, but most of them created tighter intergrations for popular webservers like for the Apache Http Server. Usualy the tight implementations for apache are in the form of modules like mod_python to use the general purpose python scripting language as a server side scripts.

When you, as client, request a page powered by a server side script via CGI (like for instance this blog at the time of writing), the webserver starts the responsible interpreter providing it with some arguments like the POST, GET variables, which then are processed by the PHP interpreter executing the php script, which provides a stream to return to the client.

This works pretty well with little, not too demanding, scripts.
However, problems arise when a lot of people use the scripts or when the script itself is quite demanding.

An example could be this blog, this blog uses a mySQL database to store its posts and comments. Every time the index page is requested it makes a new connection to the mySQL database server and send a query for the categories, links, latest posts, latest comments, etc. Creating a mySQL connection takes time, sending queries takes time, processing queries takes time for the mySQL server, retreiving the results takes time, processing the results take time, this all just to produce the same content for the index page over and over again, for there are at least 100 times more visits than updates to this blog.

Some blogs build there content. When you are viewing posts on these webblogs these posts are not generated for your request but are cached (usualy as normal .html files on the webserver). The control panel to post new blog items is however written in a server side script, and utilizes some sort of database. When you are finished creating new posts the php script will rebuild the cached pages from the database which will significantly reduce the server stress.

The downside of these types of blogs is that they contain only a very limited amount of dynamic features, for that would require scripts. Another downside of these blogs is that it is very hard to get such a blog hosted by multiple servers at the same time which usualy happens when a site is very popular and one server can not handle the demand. This is possible for the database powered blog for it stores its data on one centralized datbase server. However, a database powered blog will most likely require more than one server quite soon for it is a far greater strain on the server then a caching weblog. However, it is possible to be done by setting up the blog in such a manner that it stores its cached files on a centralized server too (by normal filesharing). However, I would be suprised when a server using cached pages will ever reach the limit of its server capacities.

The problem gets bigger when you are dealing with more dynamic server side software like forums. A forum requires some queries. Like a query which retreives the posts, it has no use to cache them for visitors of forums tend to post (yeah.. I know, it’s strange), which would require caches to be updated (which would be an even bigger strain then just using the database anyway).

A forum however still has got a lot of stuff that is quite static and would run a lot faster if it could be cached. This would be stuff like the templates, the categories architecture, the help pages, the statistics like user count, and the sessions, which are very dynamic but are queried by every page view. These things usualy are requested in every time you download a page.

Some forums use a cache which consists out of a table in the database which contains all cached stuff serialized so it can be immediately used. But this still means about 10 Kb transfered from the databaseevery time someone views your page!

This problem even grows bigger when you are developing even more demanding server side projects like a browser based online game.

I started developing an online game as just a hobby project in PHP, but I soon switched to making my own webserver in C# which caches all the stuff in the memory of the webserver which makes the rewritten stuff 3 times faster.

I figured it would be great if http servers and server side would be less aimed to just handling one request but create more support for inter-request caching. There is limited support for caching in JSP and ASP.net but this is used quite rarely for JSP and ASP.net still focus on just requesting. A server side script should not be loaded on request but rather be loaded already in the form of being able to cache objects like a provider of mySQL connections, which just recycles a connection; a function class which contains all the common used functions already loaded; and off course stuff like templates, sessions, and other cachable things.

The problem about caching in the memory is that memory can’t be shared between multiple servers, so it isn’t realy scalable. When you would multiple servers and store sessions in the memory it is quite possible that members would just get a “session not found” error when clicking a page which is served by the other server. A possible solution to this problem could be to redirect people when they access “domain.ext”, to a mirror (“s12.domain.ext”). This would avoid session loss. When adding a ‘cacheversion’ value in a table on the shared database server which is changed everytime something is changed for which a cache is created. Requesting just this very small number would be enough to check whether the cache should be rebuild for another server has changed something.

Just a thought…