MMC - Morphing Match Chain

Introduction
MMC (Morphing Match Chain) was invented in November 2010, in an attempt to produce an improved search algorithm for LZ compression, starting with a simple Hash Chain methodology. It was discussed in this forum thread and declared as a novel algorithm. (see also Charles Bloom's analysis of MMC)

Later, Piotr Tarsa underlined that this structure can be described as a left-leaning Radix tree. This is pretty close from describing MMC. However, there are some significant differences too : MMC organizes data as it is being searched into, while Radix sort them on insertion. Moreover, similar to Hash Chain, old data is deleted from MMC for free, while Radix requires a specific algorithm for this purpose. This makes for some pretty large property differences, that will be put to the test below.

MMC is Open-Source algorithm, released under LGPL license. It is available on Google Code.

Context :
The basic objective is to find the best (=longest) match in a given search window size.
As everyone probably know, a proven method is to hash the minimal valid sequence (minmatch) into a first table, and then link all positions with same hash within a second table, which has as many positions as the search window size.
This is called "Hash Chain", and works reasonably well. Indeed, search speed is tremendously improved compared to "naive" search (blindly comparing all previous positions).

Now, the problem is : with increasing search window size, the number of positions within the chain might become so large that it slows down considerably. This has nothing to do with eventual collisions (two different sequences with same hash), as this can be made reasonably low with increased hash size. This is a valid behavior of the algorithm : there are many positions which share the same prefix, that's all.

Therefore, one could wonder : quite a pity that, if i already have a match of say 4 bytes, i would prefer to ask the match search function for a position with at least 5 common bytes, in order to improve on previous result, and not for another position with just 4 bytes.

That, i figure out, could be achieved with nearly the same structure.

Basic idea :
I will start with a simple example.
Let's imagine we have a chain of positions which share a common hash. Now, what we want to result with is a chain of MinMatch common bytes; in effect, we want to remove hash collisions.

This can be achieved with a simple "double chain" methodology. 
Just to make everything clear, here is a picture of how it works :


Just use the first "Hash Chain" as usual, looking for a sequence of at least MinMatch Bytes. When you do find it, link it on the second chain, and continue searching with the second chain.
This will reduce the number of positions to compare, as there is no more collision to fear. As another small speed advantage, there is no need to compare the first Minmatch Bytes anymore, as they are guaranteed to be strictly equivalent, so just compare starting from MinMatch+1 character.

Generalization :
Now, the same idea could be used to reach higher levels.
Just start with a chain of positions with a guaranteed common prefix of N elements.
In order to reduce the number of useless comparisons, it is interesting to reach a "higher level", with a chain of at least N+1 common prefix elements.

A naive strategy would be to create a table for each prefix depth. This obviously would cost more and more memory, on top of making the chain update process more and more expensive. So this is a dead end.

The clever idea here is to continue working with just 2 tables. This can be done to reach unlimited prefix depth level. 

Just replace in previous picture the words "hash" with "prefix N", and "minmatch" with "prefix N+1". And we have our generalization.

For each position, we keep 2 choices : "bad" means this position has not improved over the guaranteed prefix, and therefore we should continue on this chain. "good" means this position has at least "N+1" common elements, so we use this opportunity to reach the higher chain. By applying this principle recursively, we create more and more chains with more and more common prefix elements, making further researches faster.


Here is a graphical attempt at explaining the algorithm :



Note that the match chain is being modified as we are searching into it, hence the name "morphing match chain". The good point is that there is no "setup time" here, it's all done on the fly while searching.

Blind insertion :
The whole idea may look simple, but there are some hard points during implementation.
Mostly, the described behavior is guaranteed as long as the chain of prefix N has already been fully sorted and linked to valid chains of prefix N+1. But it's not the case.
Since elements are "blindly inserted" into the chain, there is no guarantee that they point towards an N+1 sequence. When that happens, it is necessary 
to continue updating chain N, up to the point of finding a valid link to level N+1.
Dealing with this complex "unsecured" situation allows greater agility though. Adding positions in any chain is a no brainer, and requires no further operation. They'll get sorted later on, if ever needed.



Multi-level promotion :
The picture above explains how a gateway to level N+1 becomes an element of chain N+1 when a new gateway is created. This operation is called "promotion".
In the picture, promotions are always "one level higher", which is simple to explain.
It is however possible to build a stronger algorithm which would allow positions to be promoted by several levels on a single pass. For this to happen, we need to handle both a "virtual level" and a "current level". Virtual level is always higher or equal to current level. When conditions are met, a position can be promoted directly to a virtual level, which can be many levels above current one. This is called "multi-level promotion", and makes the search algorithm much faster.

Secondary promotions :
Inspired by Piotr comments, the idea is to promote elements as we are testing them, even though we are not looking for them. It requires a more complex data structure. However, in a single pass, many more elements get promoted, making future searches faster. 
The algorithm is more complex though, which makes it worthwhile only on larger window size (described here). It is implemented into the open-source version of MMC, but disabled into LZ4 HC

Stream acceleration :
Although not specific to MMC, it is worthwhile to note that dedicated optimizations designed to detect and search long streams of identical characters or repetitive sequences can be very easily integrated into MMC. It produces some interesting speed boost when such streams are present in the file to be compressed.

Level generalization :
While in the previous explanation a level means one byte, it is possible to create levels of any size, while keeping MMC algorithm. For example, the Open Source implementation hosted on Google Code starts with a first level which is MINMATCH bytes, then all others levels are 1 byte.

Eugene Shelwien has created a version of MMC where each level is 4 bits (called a nibble).
A level can be any size, in bits or in bytes, and you can even mix different level sizes together, either mapped (statically) or dynamically discovered.

Comparison with BST :

This was studied and reported in this post.
BST (Binary Search Tree) is a very well known search structure, in use within several top-of-the-line compressors, especially 7zip or FreeArc.
To summarize, BST searches faster than MMC, which is expected since BST searches into an already sorted data structure.
However, when insertion time is taken in consideration, the picture changes considerably : BST is several times slower than MMC, due to insertion cost. As explained above, inserting data into MMC is almost free : positions are just blindly inserted, and will be sorted later on only if needed. In contrast, BST needs to sort data on insertion, which costs as much CPU cycles as searching a best match length. Balanced BST also require specific functions to remove elements from the tree, while this operation is free for MMC, due to its Hash Chain roots.
As a conclusion, MMC looks a better choice for greedy and lazy match strategies, while BST is likely to take the lead with Optimal parsing.

Experimental results:
We are comparing 2 versions : "Hash Chain", and "Morphing Match Chain". All search algorithms find the best match in the window, resulting in exactly the same compression rate. The results below measure the number of chain comparisons to compress the entire file. Less is better.

Window Search 64K :

AlgoHash Chain
MMC
Calgary71.6M
4.84M
Firefox261M
34.0M
Enwik8355M
116.M

Window Search 512K :

AlgoHash Chain
MMC
Calgary300M
9.13M
Firefox1.26G
70.7M
Enwik81.89G
367.M

Window Search 4M :

AlgoHash Chain
MMC
Calgary368M
11.6M
Firefox4.15G
129.M
Enwik88.99G
612.M

Results are conform to expectation : the larger the search window, the more advantageous the morphing match chain methodology becomes.
Note also that benefits vary greatly depending on file type.

Obviously, a search with MMC is a bit more complex, due to changes into chain table, so there is a cost for it. But that's not terrible. To provide an idea, compressing an uncompressible file results in a less than 10% speed loss with MMC compared with classic single Hash Chain methodology.

Comments
There is no claim here that this structure is the best.
Certainly it is a fair improvement over simple Hash chain methodology. It is likely that more complex structures may provide better results, especially at very large search depths. All it takes is a comparison test.

Therefore, the next sensible step in evaluation is to test MMC search algorithm on larger window sizes (64MB+)



[Update] : Charles Bloom made a pretty thorough analysis of multiple match search algorithms, available at its blog site.
The conclusion is that MMC is indeed not suitable for Optimal parsing scenario.