Online Cycle Detection in Directed Graphs

A short while ago I came across a quite interesting problem. Design a datastructure (and algorithms) to maintain a Directed Acyclic Graph.

There has to be only one operation: adding a link between two given nodes. This operation must be able to detect and deny any new link that would cause a cycle. For simplicity, nodes are identified by sequential ids starting with 0.

An example:

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 2, 0 -> False # Fails because it would create a cycle

addLink 0, 1 -> True
addLink 1, 2 -> True
addLink 0, 2 -> True # This isn't a real cycle, so it's perfectly fine

It’s rather trivial to create a \mathcal{O}\left(n+\ell\right) (where n is the number of nodes and \ell the number of links). I conjecture there exists a \mathcal{O}\left(\log n\right) algorithm.

Function Recursion Overhead

In courses I followed at university and in most books about programming, recursion is praised as a good way to solve some problems.

An example of recursion (not-existing-api used for the sake of simplicity):

int GetFolderSize(Folder folder)
{
 int sum = 0;
 foreach(File file in folder.files)
 {
  sum += file.Size;
 }
 foreach(Folder subFolder in folder.folders)
 {
  sum += GetFolderSize(subFolder);
 }
 return sum;
}

This example function calculates the combined size of all files in a folder and those inside any subfolder.

It does its job and it is very clear how it works, but it is inefficient, take another way to write the algorithm:

int GetFolderSize(Folder folder)
{
 Stack<Folder> stack = new Stack<Folder>();
 int sum = 0;
 stack.Push(folder);
 while(stack.Count > 0)
 {
  Folder folder = stack.Pop();
  foreach(File file in folder.files)
  {
   sum += file.Size;
  }
  foreach(Folder subFolder in folder.folders)
  {
   stack.Push(subFolder);
  }
 }
 return sum;
}

This version is harder to understand. Basicly it maintains a stack of folder of which the total size of the files in it are not yet calculated. While going through each of the folders in the stack it adds new sub-folders from folders and removes the ones processed.

The latter method is way more efficient.

For each function function call in the first (recursive) version of the algoritm a new sum-instance, a subFolder-instance and an enumerator instance over the folder.Folders is required which stacks up. When 5 deep in the first algorithm you already use more than double of the amount of memory ever required in the second algorithm.

Additionally a function call itself requires more memory of its own, which depending on the language used, can be quite significant. Debugging also gets really hard when you have got a stack trace of hundreds of functions.

Using your own stack instead of poluting the call stack (the thing where function calls are kept) sound great. There is only one little problem, it can get pretty complex.

Take for instance a program that would put the files of folders and their subfolders in a file in that same folder, for instance for a playlist generater. This requires the algorithm to detect when all child items of a certain item have been processed. The problem with this is that the parent item is already gone from the stack when the child item is processed. This requires some additional fields and a few function pointers to get it to work and it can get a mess.

The best way to get it to work is to mimic what a normal recursive method would have done, which involves a lot of extra helper classes, which all depend on how the original recursive algorithm has worked. In a language with a Garbage Collector (which automaticly free’s unused classes) it is managable, but in a language without it like C(++) trouble doubles when you also need to monitor the lifetime of the helper classes.

I noticed that it has been very tempting to use recursion and that there are almost no occasions where something like your own stack is used, espacially in the complex cases. A shame, it is challenging :-P.

Enter The Unkown: Algorithm`s A programmer Has Got To Know

I don’t like to link to other blogs or articles for I am just a lame copier, but I’d like to point my select few readers to Enter The Unkown, the weblog of Kaja Fumei, where he will point out some Algorithms a Programmer should know. First one in the list are the Hash Tables.

You don’t need to know the stuff behind these kind of features a language library exposes; but it helps a lot if you know how it works and what the weak and strong points are.