Modular Server

(Understanding URIs)

“A common mistake, responsible for many HTTP implementations problems, is to think this is equivalent to a filename within a computer system. This is wrong. URIs have, conceptually, nothing to do with a file system. One should remember that at all times when dealing with the World Wide Web.”

So why do most HTTP servers still heavily rely on the filesystem dictating URL’s?

This not only tends to create Uncool URI’s, but also makes it seem logical for filebased dynamic content (more on that in my previous post).

A http server, actually every server should be nothing more then a wrapper for modules which handle requests of clients.

The server would only limit itself to a very selected amount of functions

  • Handling connections
  • Exposing an API for the protocol which the modules can us
  • Hosting modules

On startup the server loads modules and binds them to certain URL. Modules remain persistant in the memory and are just signalled a request is made passing the module an API to handle the request.

I’ll be busy exploring the posibilities to implement this.. there will be more about this.

Why server side scripts aren’t that scalable at all

There are a lot of different types of server scripts, a few examples:

Practicly all of these were to be used via CGI, but most of them created tighter intergrations for popular webservers like for the Apache Http Server. Usualy the tight implementations for apache are in the form of modules like mod_python to use the general purpose python scripting language as a server side scripts.

When you, as client, request a page powered by a server side script via CGI (like for instance this blog at the time of writing), the webserver starts the responsible interpreter providing it with some arguments like the POST, GET variables, which then are processed by the PHP interpreter executing the php script, which provides a stream to return to the client.

This works pretty well with little, not too demanding, scripts.
However, problems arise when a lot of people use the scripts or when the script itself is quite demanding.

An example could be this blog, this blog uses a mySQL database to store its posts and comments. Every time the index page is requested it makes a new connection to the mySQL database server and send a query for the categories, links, latest posts, latest comments, etc. Creating a mySQL connection takes time, sending queries takes time, processing queries takes time for the mySQL server, retreiving the results takes time, processing the results take time, this all just to produce the same content for the index page over and over again, for there are at least 100 times more visits than updates to this blog.

Some blogs build there content. When you are viewing posts on these webblogs these posts are not generated for your request but are cached (usualy as normal .html files on the webserver). The control panel to post new blog items is however written in a server side script, and utilizes some sort of database. When you are finished creating new posts the php script will rebuild the cached pages from the database which will significantly reduce the server stress.

The downside of these types of blogs is that they contain only a very limited amount of dynamic features, for that would require scripts. Another downside of these blogs is that it is very hard to get such a blog hosted by multiple servers at the same time which usualy happens when a site is very popular and one server can not handle the demand. This is possible for the database powered blog for it stores its data on one centralized datbase server. However, a database powered blog will most likely require more than one server quite soon for it is a far greater strain on the server then a caching weblog. However, it is possible to be done by setting up the blog in such a manner that it stores its cached files on a centralized server too (by normal filesharing). However, I would be suprised when a server using cached pages will ever reach the limit of its server capacities.

The problem gets bigger when you are dealing with more dynamic server side software like forums. A forum requires some queries. Like a query which retreives the posts, it has no use to cache them for visitors of forums tend to post (yeah.. I know, it’s strange), which would require caches to be updated (which would be an even bigger strain then just using the database anyway).

A forum however still has got a lot of stuff that is quite static and would run a lot faster if it could be cached. This would be stuff like the templates, the categories architecture, the help pages, the statistics like user count, and the sessions, which are very dynamic but are queried by every page view. These things usualy are requested in every time you download a page.

Some forums use a cache which consists out of a table in the database which contains all cached stuff serialized so it can be immediately used. But this still means about 10 Kb transfered from the databaseevery time someone views your page!

This problem even grows bigger when you are developing even more demanding server side projects like a browser based online game.

I started developing an online game as just a hobby project in PHP, but I soon switched to making my own webserver in C# which caches all the stuff in the memory of the webserver which makes the rewritten stuff 3 times faster.

I figured it would be great if http servers and server side would be less aimed to just handling one request but create more support for inter-request caching. There is limited support for caching in JSP and ASP.net but this is used quite rarely for JSP and ASP.net still focus on just requesting. A server side script should not be loaded on request but rather be loaded already in the form of being able to cache objects like a provider of mySQL connections, which just recycles a connection; a function class which contains all the common used functions already loaded; and off course stuff like templates, sessions, and other cachable things.

The problem about caching in the memory is that memory can’t be shared between multiple servers, so it isn’t realy scalable. When you would multiple servers and store sessions in the memory it is quite possible that members would just get a “session not found” error when clicking a page which is served by the other server. A possible solution to this problem could be to redirect people when they access “domain.ext”, to a mirror (“s12.domain.ext”). This would avoid session loss. When adding a ‘cacheversion’ value in a table on the shared database server which is changed everytime something is changed for which a cache is created. Requesting just this very small number would be enough to check whether the cache should be rebuild for another server has changed something.

Just a thought…