Don’t rely on parsing

Most applications store their settings (espacially in *nix) in text configuration files. These text files need to be parsed every time the application starts.

Parsing is the action of (usualy streaming) dividing a certain piece of data into understandable parts. This usualy comes down to looking at the text file character by character and deciding what should be done. Usualy this is done by maintaining a state which contains the data collected sofar. And with a bit more complicated files this even means having a stack of states and complicated actions when a certain state is left.

The problems:

  • Parsing is slow, very slow
  • Formats required to be parsed contain overhead, a lot of overhead

But it certainly has got advantages:

  • Humans can easily edit it, you don’t need to rely on configuration tools
  • (Usualy) makes a configuration format more extensible by nature (adding one new field in the average programmer’s binary format would break it)

Now, there are attempts to help improve speed. This by standardizing the format, which makes the amount of oddities to expect less, which ultimately makes the whole parsing slightly faster. This at cost of the easiness it can be edited.

A good example would be Xml. Xml is damned ugly. Xml is too strict. And Xml still takes a hell of a lot of time to parse.

Yaml looked like a decent alternative: easy to edit, looks nice. But then I encountered this:

%YAML 1.1
---
!!map {
? !!str "sequence"
: !!seq [
!!str "one", !!str "two"
],
? !!str "mapping"
: !!map {
? !!str "sky" : !!str "blue",
? !!str "sea" : !!str "green",
}
}

Ugly…

So what to use instead?

Use binary configuration files, which are easy to load and save for the application. And create a parser to parse the configuration file and save it to the binary format! In other words: serialize the usefull data from the parsed document and only parse again when it is required.

When you only parse stuff when it has changed by the user than it doesn’t really matter how long it takes to parse. Which can get rid of the really ugly stuff and let us just have a very loose kind of format without the ugly rules and regulations.

OO Stated Stackbased Parsing

Every parsable format (like INI, XML) consists out of certain area’s. You can be parsing the section name at one moment, or be parsing a comment when parsing an INI. These certain area’s where you can parse result in a parse state. In every state you expect something else, and you gather other kinds of information.

When you are parsing in a certain state you can find that the state has changed (the parser found a new xml node in a xml file), then the old state is pushed on the state stack.

In certain circumstances you need to have the ability to fall back to a previous state, this can happen when you are parsing a apparently new section name and suddenly there is a comment character instead of a sectionname end character. In this case you need to be able to fall back on the previous section you were parsing. Although when you successfully have parsed the sectionname you want the old sectionstate removed from the stack (and the data of it emited).
Continue reading OO Stated Stackbased Parsing