Saturday, December 4, 2010

Multiple level promotions

One of the main advantages of Morphing Match Chain (MMC) method (explained here) is that you can add elements into the chain without actually sorting them. They will get sorted later on, and only if necessary.
As a consequence, many elements will never get sorted, because they are never accessed.
In a "greedy match" search algorithm, it happens very often, and the shorter the search window, the more probable elements will reach their end of life before being searched.

This strategy is key in avoiding many useless comparison operations.

Now it has a drawback : all elements are added into the "baseline" hash chain. They will get sorted later on if the hash chain is called. But my current implementation of MMC is limited to a maximum of one promotion per pass. As a consequence, all these unsorted positions tend to massively get promoted to next level together. Only elements which does not reach the minimum length are left out, but the point is : these "good" elements could have reached immediately higher level, they are artificially limited to current level + 1 due to implementation.

Willing to measure this effect, i made a simple modification to LZ4 to count the highest potential level of each element, and compare it to the current level limitation.
I was expecting some missed opportunities, in the range of 8 levels typically.

This was incorrect.
I found missed promotion opportunities in the range of hundreds level.
Granted, highest numbers are typically the result of very repetitive patterns, which can be found in binary files (and i witnessed many strange repetition length, such as prime numbers 3, 5, 7, or large one such as 120). In this case, missed opportunities can reach several thousand levels.
But outside of these (not so rare) corner cases, i still have witnessed missed opportunities in the range of a hundred levels with file such as enwik8, which does not feature any massive repetition pattern, just some english words and group of words which are more frequents.

This suggests that some sensible gain can be achieved through multi-level promotion.
Sorting elements on a multi-level scheme will not make the current search faster, but will make the next search using same hash much more efficient.

Unfortunately, this also requires complete rewriting of MMC algorithm. The basics behind the search method remain one and the same, but the algorithm needs to be completely different, to track multiple levels in parallel. Quite some work in the waiting...

Thursday, December 2, 2010

Fusion searches - segments & sequences fully integrated

Finally, after spending quite some time at testing a few different directions (more on this in later notes), i finally had enough time and focus to generalize Fusion for any stream of any character.
I opted for the easy way, by extending the table listing existing segments within range. I had another scheme in mind, with the advantage of being memory-less, albeit at unknown performance cost. Maybe for another version.

Nevertheless. I end up with some usable results, and they  are quite promising.
As a usual "raw" efficiency measure, i counted the number of comparisons here below :

                       MMC  Segment  Fusion   Improvement
Calgary 64K Searches  20.3M  6.55M    5.54M      18%
Firefox 64K Searches  70.3M  47.1M    41.9M      12%
Enwik8  64K Searches   188M   187M     187M       0%

Calgary 512K Searches 34.2M  14.6M    10.9M      34%
Firefox 512K Searches  153M   121M    92.1M      31%
Enwik8  512K Searches  435M   435M     434M       0%

Calgary 4M Searches   38.7M  18.5M    14.2M      30%
Firefox 4M Searches    271M   308M     182M      69%
Enwik8  4M Searches    753M   758M     751M       1%

Now, this is satisfactory. Note how Firefox greatly benefits from fusion search support for "non-zero" bytes. Calgary, which mostly contains streams of zero, achieves a little gain compared to Fusion0. Enwik8 hardly contains any stream at all, and therefore benefits the least.

But the tendency is what matters here. The larger the search window, the better the benefit. Firefox is quite impressive at 4M, keeping in consideration that about 65M comparisons (more than one third of total) are just collisions.

And finally, results are now always better than with good old MMC alone, whatever the file and search window size. Objective reached.