Stupid PHP (1) (Strings are faster than Arrays)

When I slowly build a big string out of little bits, the worst thing to do in most languages is to just use string concatenation:

for(something) {
 str .= little_bit;
}

Why? Everytime a little bit is added to str, there must be a new string allocated big enough to contain str and the new little bit. Then the actual str must be copied in. In most languages there are constructs to efficiently build strings like this instead of concatenating. StringBuffer in C#. StringIO in Python.

But no, PHP has to be stupid. There is no nice construct and you’ll end up using concatenation. So, I thought to be smart and make use of PHP array’s and implode. Arrays are here for having elements added and removed all the time so they are properly buffered and should be great at having lots of small elements added. And when I want to pack it all into one big string, I can use PHP’s builtin implode function.

I wanted to try it out and created two scripts: a.php concats a little (10byte) string one million times and b.php appends it to an array and then implodes it. And because I’m also interested in the performance of implode I got a script c.php that’s identical to b.php but doesn’t implode afterwards. These are the results:

a.php (concat) 0.320s
b.php (array append and implode) 0.814s
c.php (array append) 0.732s

Indeed, string concatenation with all its allocation and copying is actually faster than plain simple array appending. PHP is stupid.

md5(microtime())

Don’t use md5(microtime()). You might think it’s more secure than md5(rand()), but it isn’t.

With a decent amount of tries and a method of syncing (like a clock on your website) one can predict the result of microtime() to the millisecond. This only leaves about a 1000 different possible return values for microtime() to be guessed. That isn’t safe.

Just stick with md5(rand()), and if you’re lucky and rand() is backed by /dev/random you won’t even need the md5(). In both cases it will be quite a lot more secure than using microtime().

PHP’s hidden treasures

I’ve complained a lot when I worked with PHP about PHP’s terribly inefficient design; I’ve complained about it, just because it was PHP. The things I missed most in PHP, it seems, were actually there all along!

Behold shared memory and yet more IPC.

One reason PHP really sucked was that you need to load small data from and to databases or files if you want to share it between page views. The whole concept of handling a request per page view is ridiculous too, IMO.

With those two libraries however, PHP pages could be way more efficient.

I wonder why the big PHP software haven’t used it. Lets hope it isn’t portability.

Using template variables in extension tags with Mediawiki

Mediawiki, the wiki used by wikipedia, is a powerfull wiki.

It allows you to include self made templates which you can pass variables too. A very usefull template on wikipedia is the stub template. When you add {{stub}} at the start of an article it will insert text explaining the user that the article isn’t finished yet, but rather a stub. With {{Album_infobox|Name=Sehnsucht|Cover=Sehnsucht_cover.jpg|…}} an infobox is inserted used at wikipedia for all albums.

When the features of wikipedia don’t suffice you don’t neccessarily have to hack wikipedia to add features. You can add an extension which hooks into a part of the parser to create your own logic for your own custom tags.

I created a wiki to store the lyrics which I already got on a forum and found espacially templates usefull.

However, there was a slight annoyance when working with templates. The problem is that in several cases there is a link in a template which can be ambiguous (there are two songs called the same) but you don’t want an ugly suffix after the name (like “Deliverance (Album of Opeth)” instead of “Deliverance”). To solve this you have to pass both the name of the album and the name of the page of the album, but this is annoying to do everytime.

MediaWiki knows no logic so I thought to solve this by adding some basic logic in a custom tag which either uses the same text as the page name when no custom text is specified and otherwise uses the custom text. But that just wouldn’t work.

MediaWiki, when parsing a page, first escapes all nowiki elements, extensions, comments, links and so on by a unique ID. Then parsed the format and replaced variables with their values. After that it just replaced the unique id’s back with the proper content, which in this case was the output of my extension too, which was flawed because the variables weren’t replaced before the extension tag was escaped out and also because the wiki-link ([[pagename]]) wasn’t replaced because the extension was still escaped. So I ended up with having [[{{{album}}}]] on my page instead of a Sehnsucht link.

Trying to fetch template variables inside the extension just wouldn’t work because the extension was evaluated at the moment it was already included in the main page. Avoiding this by disabling the cache would require parsing the whole template each time it is viewed which is very extensive. (each link on a page in a wiki has to be looked up when parsed to see whether it exists)

So I googled a bit around and found that Mikesch Nepomuk submitted a patch on one of the mailinglists to fix this by escaping extensions seperately after the veriables were expanded. Apparently it hadn’t been approved or maybe because it wasn’t included in my release yet so I tried to apply it but it failed. The patch was a bit too old. So I did it by hand instead of by GNUpatch and with some tinkering I got it to work.

You can download the patch for the 1.5 beta4 here, if you are interested.

And I thought I would never touch PHP again.

Why Php sucks (and I still use it)

  • Php is slow, not just slow, but really slow. A simple benchmark runs in 1 milisecond in C. It takes 2 miliseconds for .Net. Python takes 600 miliseconds for instead of a native assembly language or jit-ted language it is an interpreted language. But Php, even though it also is an interpreted language and hasn’t got the enourmous object overhead Python has got it still takes 12000 miliseconds, that are 12 seconds.
  • Stupid work arounds, for Php itself is rather slow you got to rely as much as you can on function calls instead of doing anything yourself. It is for instance faster to load a list by deserializing a string than just reading line by line through a file although the deserialization would be way faster if both methods would be implemented nativly. Another little issue here is that most very quick functions in Php are only available in the newer versions (eg. file_set_contents), this requires you to add an if statement with a home made implementation of the function which usualy is slower by a factor of 10. You can choose to use an alternative way which doesn’t exactly implements the functions of the function you require to use but still does the job for the circumstance (better) (eg. not rewriting file_put_contents when it isn’t available when you want to streamingly write data to a file but rather call fwrite a few times which gets rid of having the whole file in the memory in a string at a time.
  • State less Although the Http protocol is a stateless protocol that doesn’t have to mean that a server side scripting framework should be stateless too. Although Php attempts to become statefull by using a session implementation by serializing a session array on the harddisk for every session this isn’t very efficient. Even one global array that persists between requests would result in such a performance boost. Not only for it doesn’t require file reads, writes, serialization and deserialization (and optionaly queries when you don’t like the php session system), but also for it would allow you to store that little bit of important cache between sessions that otherwise would have needed to be read, deserialized, serialized and written again, for every mere page view!
    Allowing such a persistant array however poses a security risk in the way Php works at the moment. They should add contexts to allow one instance of apache on which mod_php runs to execute files in different context, each with its own settings (and persistant data).
  • There is no satisfying solution in Php For every single common issue in Php there is no simple solution, that works perfectly or even reasonbly well.
    Take for instance templates. There are basicly a few ways to handle templates in Php. Usualy it comes down to either caching .php-scripts which are than executed as smarty does, or using a class for every major template section where every function fetches a template bit. In both these methods executing Php code is required to just only replace a certain tag with an replacement. Php has been designed to do a lot more than that and contains a lot of overhead during interpretation. Using str_replace‘s is a lot faster than a php block inline or even using instring php variables ("Example: {$example}"). The second way using classes and functions is even worse for the whole class and all functions first need to be loaded in the memory and basicly are a lot slower.
    The proper way to use templates is streaminly inputting and replacing tags with their values and outputting it. This isn’t possible for php is slow and loading the whole template in a string is even faster.

The only reason why I still use Php is for it is just the number one supported server side scripting language.

Why is world wide web not fair?

Skins and performance in PHP

There are several ways to use skins in PHP, I’ve put some through a performance test.

Basicly you can use either evaluated PHP or a string that will undergo str_replace’s.

When evaluating PHP in a file it seems to be faster than replacing tags in a string. This for PHP streams through the file during execution instead of handling one big string. The difference is minimal though (15% in my tests).

Although when the PHP code is placed in a string instead of in a file which has to be done in case a string is cached in a database or is generated from compiling from another format it is significantly slower than using str_replace’s on a normal string (600%!), this is because the original sourcecode, the intermediate code and the return from the code all take a lot of memory.

Either use cached PHP files or a string with tags instead of PHP code in a database, never the otherway around (what happens very often).

Caching in PHP

It is usefull to cache certain things between Php script executions.
Some boards written in Php cache the forum architecture so a difficult query hasn’t got to be run every time a guest views the board.

There are a few ways to cache data:

  • Php script. Data will be stored as a normal php file which will be included during execution
  • Serialized object in file. Data will be serialized and dumped to a file which will be read every page view
  • Database storage. Data will be serialized and stored in a database and queried every page view.

There are a lot of myths about using a database would be way slower than a normal php file.

I’ve run a few tests caching a ~16kB php array, the results:

Serialized object stored in file: 0.0015ms
Object in PHP script: 0.0121ms
Serialized object stored in mySQL database: 0.0015ms

It seems to be quicker to use a serialized array in a file as configuration file than a config.php php script!

Databases although just as quick as normal files are favored by me for they are much more scalable.

PHP Security Consortium

The PHPSC is a site managing a lot of resources on PHP security.

For all those starting or sometimes using PHP this is a must read.

Also I’d advice for people who want to know whether there site is safe enough is to try to play the other site by trying out hacking yourself: hackthissite.org. It is easier than you might have thought.

Parsing $_SERVER[‘PATH_INFO’]

The PHP global variable $_SERVER['PATH_INFO'] contains the path suffixed to a PHP script, if I would call the URL:

http://domain.ext/path/to/script.php/foo/bar.htm?a=b&c=d

Then $_SERVER['PATH_INFO'] would contain:

/foo/bar.htm

Traditionaly the $_GET variables are used for certain parameters like a page to display:

http://domain.ext/page.php?page=about.htm

This method is easy to program, but not only looks strange, but also is very search engine unfriendly. Most searchengines ignore the QueryString (the part of the URL after the ?). And therefor would index the first page.php?page=x they would find and ignore the rest.
Some searchengines like Google do not ignore the query string, but would give a page without using a querystring for different content a way higher ranking.

Parsing the $_SERVER['PATH_INFO'] is relatively easy, this code would do most of the stuff just fine:

if (!isset($_SERVER['PATH_INFO'])){
	$pathbits= array('');
}else{
	$pathbits = explode("/",  $_SERVER['PATH_INFO']);
}

The $pathbits array would always contain / as first element if a path info was provided, otherwise it will be an empty array.

Here is a quite simple example which parses the path info to decide which file to include:

<?php
if (!isset($_SERVER['PATH_INFO'])){
	$pathbits= array('');
}else{
	$pathbits = explode("/",  $_SERVER['PATH_INFO']);
}
if (!isset($pathbits[1]) || $pathbits[1] == ""){
	$page = "default"
}else{
	$page = basename($pathbits[1]);
}
$file = "./pages/{$page}.php";
if (!is_file($file)){
	echo "File not found";
}else{
	require $file;
}
?>

Modular Server

(Understanding URIs)

“A common mistake, responsible for many HTTP implementations problems, is to think this is equivalent to a filename within a computer system. This is wrong. URIs have, conceptually, nothing to do with a file system. One should remember that at all times when dealing with the World Wide Web.”

So why do most HTTP servers still heavily rely on the filesystem dictating URL’s?

This not only tends to create Uncool URI’s, but also makes it seem logical for filebased dynamic content (more on that in my previous post).

A http server, actually every server should be nothing more then a wrapper for modules which handle requests of clients.

The server would only limit itself to a very selected amount of functions

  • Handling connections
  • Exposing an API for the protocol which the modules can us
  • Hosting modules

On startup the server loads modules and binds them to certain URL. Modules remain persistant in the memory and are just signalled a request is made passing the module an API to handle the request.

I’ll be busy exploring the posibilities to implement this.. there will be more about this.

Why server side scripts aren’t that scalable at all

There are a lot of different types of server scripts, a few examples:

Practicly all of these were to be used via CGI, but most of them created tighter intergrations for popular webservers like for the Apache Http Server. Usualy the tight implementations for apache are in the form of modules like mod_python to use the general purpose python scripting language as a server side scripts.

When you, as client, request a page powered by a server side script via CGI (like for instance this blog at the time of writing), the webserver starts the responsible interpreter providing it with some arguments like the POST, GET variables, which then are processed by the PHP interpreter executing the php script, which provides a stream to return to the client.

This works pretty well with little, not too demanding, scripts.
However, problems arise when a lot of people use the scripts or when the script itself is quite demanding.

An example could be this blog, this blog uses a mySQL database to store its posts and comments. Every time the index page is requested it makes a new connection to the mySQL database server and send a query for the categories, links, latest posts, latest comments, etc. Creating a mySQL connection takes time, sending queries takes time, processing queries takes time for the mySQL server, retreiving the results takes time, processing the results take time, this all just to produce the same content for the index page over and over again, for there are at least 100 times more visits than updates to this blog.

Some blogs build there content. When you are viewing posts on these webblogs these posts are not generated for your request but are cached (usualy as normal .html files on the webserver). The control panel to post new blog items is however written in a server side script, and utilizes some sort of database. When you are finished creating new posts the php script will rebuild the cached pages from the database which will significantly reduce the server stress.

The downside of these types of blogs is that they contain only a very limited amount of dynamic features, for that would require scripts. Another downside of these blogs is that it is very hard to get such a blog hosted by multiple servers at the same time which usualy happens when a site is very popular and one server can not handle the demand. This is possible for the database powered blog for it stores its data on one centralized datbase server. However, a database powered blog will most likely require more than one server quite soon for it is a far greater strain on the server then a caching weblog. However, it is possible to be done by setting up the blog in such a manner that it stores its cached files on a centralized server too (by normal filesharing). However, I would be suprised when a server using cached pages will ever reach the limit of its server capacities.

The problem gets bigger when you are dealing with more dynamic server side software like forums. A forum requires some queries. Like a query which retreives the posts, it has no use to cache them for visitors of forums tend to post (yeah.. I know, it’s strange), which would require caches to be updated (which would be an even bigger strain then just using the database anyway).

A forum however still has got a lot of stuff that is quite static and would run a lot faster if it could be cached. This would be stuff like the templates, the categories architecture, the help pages, the statistics like user count, and the sessions, which are very dynamic but are queried by every page view. These things usualy are requested in every time you download a page.

Some forums use a cache which consists out of a table in the database which contains all cached stuff serialized so it can be immediately used. But this still means about 10 Kb transfered from the databaseevery time someone views your page!

This problem even grows bigger when you are developing even more demanding server side projects like a browser based online game.

I started developing an online game as just a hobby project in PHP, but I soon switched to making my own webserver in C# which caches all the stuff in the memory of the webserver which makes the rewritten stuff 3 times faster.

I figured it would be great if http servers and server side would be less aimed to just handling one request but create more support for inter-request caching. There is limited support for caching in JSP and ASP.net but this is used quite rarely for JSP and ASP.net still focus on just requesting. A server side script should not be loaded on request but rather be loaded already in the form of being able to cache objects like a provider of mySQL connections, which just recycles a connection; a function class which contains all the common used functions already loaded; and off course stuff like templates, sessions, and other cachable things.

The problem about caching in the memory is that memory can’t be shared between multiple servers, so it isn’t realy scalable. When you would multiple servers and store sessions in the memory it is quite possible that members would just get a “session not found” error when clicking a page which is served by the other server. A possible solution to this problem could be to redirect people when they access “domain.ext”, to a mirror (“s12.domain.ext”). This would avoid session loss. When adding a ‘cacheversion’ value in a table on the shared database server which is changed everytime something is changed for which a cache is created. Requesting just this very small number would be enough to check whether the cache should be rebuild for another server has changed something.

Just a thought…