Saturday, November 20, 2010

First results of fusion search

As stated in an earlier post, segment search provides a nice boost to compression speed when compressed files contain some long stream of identical characters. However, this benefit does not ramp up with search depth, as there is no gradual elimination of bad choices, like MMC is doing for normal sequences.

Therefore, the ideal solution would be to "merge" both algorithms. But this is easier said than done.

Finally, after much time at correcting many small inefficiencies and inaccuracies, a first version of this merged algorithm is able to provide some figure, in order to verify the claim.
The current version is only limited to handling streams of zeros. When a segment is detected, it is fully populated with pointers to previous best possible match, then the search continue using the normal MMC algorithm.

This approach results in better elimination of bad positions, and therefore less comparison checks. On the other hand it requires much heavier updates on chain table.

So where does that lead us to ? Here are some extracted statistics, comparing new fusion algorithm with current 0.5 (segment) version of LZ4HC :

                       MMC  Segment  Fusion/0  Improvement
Calgary 64K Searches  20.3M  6.55M    5.62M       17%
Firefox 64K Searches  70.3M  47.1M    45.3M        4%
Enwik8  64K Searches   188M   187M     187M        0%

Calgary 512K Searches 34.2M  14.6M    11.1M       32%
Firefox 512K Searches  153M   121M     115M        5%
Enwik8  512K Searches  435M   435M     435M        0%

Calgary 4M Searches   38.7M  18.5M    14.4M       28%
Firefox 4M Searches    271M   308M     295M        4%
Enwik8  4M Searches    753M   758M     758M        0%

There is definitely some improvement, but not that large.
It seems to get even better with larger window size (64K->512K), but the tendency is surprisingly reversed at largest tested size (4M).

There are a number of potential explanations for that.
To begin with, the new Fusion algorithm only deals with streams of zero. The Segment algorithm is still being used for all other streams. As a consequence, there is still room for improvement, as can be witnessed in Firefox at 4M range : we still have more comparisons than with MMC alone. So there are probably some tangible benefits to expect with generalizing Fusion to all byte streams.

Then, at 4M, collision is really large for Firefox, reaching 64M, which is more than 20% of comparisons. It is also more than 1M for Calgary (about 7%). This amount is not reduced by Fusion, therefore its relative benefit importance is reduced by increased share of collision.
The easier solution is to increase hash table, to reduce collision effect. However, as i want directly comparable results, hash table has to remain the same for all search depth and all search algorithms.
Maybe i'll change that policy, to allow adapted hash size per search depth.

Calgary is also a strange beast : its size is less than 4M, so it does not fully benefit from largest window size. Still, the small difference between 512K and 4M is quite unexpected. This observation is valid whatever the search algorithm, so there is definitely something at stake. Maybe the fact that calgary.tar contains some very different file type, which individual size is less than 512K, could explain this behavior.

Next logical step seems to be generalization of Fusion, and measure potential improvements.

Friday, November 19, 2010

Compressing the web

I just had a very pleasant time reading Frederic Filloux 's post on his wish list for a future word processor. In a nutshell, what he is looking for is a context-aware spell checker, so that it does not need to constantly tell the dictionary about new words and surnames. It goes even farther : citing a famous example from Google's Wave, the sentence "Icland is an icland" get automatically corrected into "Iceland is an island", the most probable spelling.

We can easily get a glance of this feature by just mis-spelling a word into Google search engine : although the search is performed, a correction is also automatically proposed.

Now just a word on how this correction is proposed : this is no longer about reference into dictionary. No, Google has enough memory to download (almost) the whole web, study it, and get meaningful statistics out of it. Its engine is not instructed with English lessons, it learns it by itself, by just continuously reading English texts. And it does the same for any other language.
This does not even stop there, it can guess equivalence, correct spelling and grammatical rules out of it. Moreover, it can extract from the current context enough information to propose the proper correction out of several possibilities.

So where does that lead us ?
Guessing the proper next sequence from the current context and from what has been learned with past history, doesn't that sound like compression ?

It must be underlined here that "past history" refers to "the web", at large. Therefore, digging into such a massive amount of data and still generating an answer in a fraction of a second is quite a feat, and should inspire compression algorithms (or the other way round).

As stated in a previous post, deeply looking into a dictionary of size N requires 4xN or 8xN memory. Obviously you can't do that on a dictionary which is "Internet" size, so you don't even think of making a search. Creating a statistical structure out of this huge source is likely to result in much more compact dataset, and as a consequence usable & fast search algorithm.

Looking for a compression algorithm working with stats rather than dictionary ? look no further :  think about PPM / CCM.

Monday, November 15, 2010

Unramping benefits

One of the pretty nice effects of segment detection, and one of the main reason for the speed improvements behind latest LZ4HC versions, is that it translates into so many less comparisons to do.
Or does it ?
To verify this claim, i've measured the number of comparisons for different search algorithms (Simple Hach Chain, MMC, and MMC + Segment detection) and for different search windows.
And here are the results
(btw, no way to build simple tables ? that's plain crazy)

                      Hash   MMC  Segment  Improvement 
Calgary 64K Searches  71,6M 20,3M 6,55M      x3.10
Firefox 64K Searches  261M  70,3M 47,1M      x1.49
Enwik8  64K Searches  355M  188M  187M       x1.01

Calgary 512K Searches 300M  34,2M 14,6M      x2.34
Firefox 512K Searches 1,26G 153M  121M       x1.26
Enwik8  512K Searches 1,89G 435M  435M       x1.00

Calgary 4M Searches   368M  38,7M 18,5M      x2.09
Firefox 4M Searches   4,15G 271M  308M       x0.88
Enwik8  4M Searches   8,99G 753M  758M       x0.99

Tendancy is clear : the larger the search window, the less benefits segments brings. Indeed, it even ends up with a net loss for Firefox, at a mere 4M window. Let's imagine for 16M....
And it's pretty normal : segment search algorithm is much closer from simple Hash chain than from MMC, as there is no gradual elimination of positions.

To counter this effect, the best would be a perfect merge between the two search algorithms. They currently work together nicely, but they are still completely separated.

In ideal scenario, segment search would provide the equivalent of "guaranteed prefix", and then forward result to MMC algorithm.
Unfortunately, it's not possible with current data structure, as data pointers for segments would risk being erased in the process.
To counter this effect, i would need an additionnal buffer, storing information i don't want to lose with MMC. Something complex to optimize properly.

Sunday, November 14, 2010

Fast note forwarding

 While i've not yet filled this blog with real posts right now, there are a number of them already available on my old web site . Match searches has been my subject of interest lately.

Unfortunately, this old website is redacted in french (it is one of the reasons i'm opening this new blog).
You can, however, use the little shortcut to "english translation" to have the website automatically translated by google. Obviously, this will not result in a very clean translation, but it should remain fairly understandable.

I intend to switch on this blog pretty soon, so for past knowledge on what happened up to now, head there :

LZ4 - New version

LZ4 "HC", a high compression derivative of LZ4, has been released in version 0.5 today.
It features an improved compression speed, thanks to segment search algorithm.
This new version is so good that it displaces the older "OC" (Optimal) version, which was a compromise with less compression for better speed.

You can download it here.