Thursday, November 25, 2010

The cost of History

well well, another idea proved fruitless.
After a quick note on cuckoo hashing , i tried to apply the concept on a compressor using Etincelle's principles, betting that the superior occupation rate of cuckoo's hash table would result in better compression rate.
There was, indeed a gain, but a small one.

In fact, my initial diagnostic was simply wrong : i assumed that Etincelle's tables were incompletely used. It proved incorrect : they are fully used, fully cranked i should say. And that is necessary for the scheme to be efficient.

Indeed, the real question is : which slot should be eliminated when a new entrant gets in. With Etincelle, i'm using pure randomness to free a slot for the new occupant. This obviously means there are some collisions. With Cuckoo hash, the table gets filled more before the first collision occurs. But in the end, both tables are completely filled, therefore collisions are happening at each update.

The only benefit of the new hash function is that i can select which slot will be freed. In my test, betting that the least recently used slot is probably the most unworthy did provide a small compression boost.
But not enough for the cost of it.

I now remember that a pending question i had when designing Etincelle was : why keeping the whole history into memory ? The larger it is, the more it costs to reference into it. If it was possible to only keep a useful part of history, references would become so much smaller, compression would therefore improve.

Alas, you can't guess which part of history is going to become useful. As a consequence, the usual way for compressors has always been to keep everything into memory. When reference size matters, they are reduced thanks to context filtering.

Trying to improve on this situation, i tried my hand at an "explicit history manager", which would tell the decoder which position to keep in memory for later use.
My first implementation proved disastrous. Although references would become much smaller, this was not enough to offset the cost of instructions passed to decoder.
Later, i tried a new implementation, much more dynamic and (alas) much more complex. This time, there was a net gain. Not a large one, but definitely a gain.
From a speed/ratio perspective however, this was really not worth it; so, outside of a pure compression exercise, it proved a wrong path.

So my conclusion, at that time, was i'd better try to find some heuristics which would, on average, correctly guess which part of history to keep in memory.

I never found the correct formula.

It seems this idea would be remotely cousin to another concept used in some vastly different compression algorithms, called SSE (Secondary Symbol Estimation). To summarize this concept, it states that, according to additional statistics (such as latest compression ratio, local guess success, etc.), current guess probabilities will be affected. For example, if the algorithm tries to compress some noise, SSE will simply "take over" past statistics and tell that final estimation is completely random, providing a flat probability for all symbols in alphabet.

This is the kind of thing i would like to mimic : get from a bunch of easy statistics an estimation of current position "worthwhileness" to be stored into memory, for future reference.

Apparently, this still deserves to be properly implemented.

Wednesday, November 24, 2010

CPU cache

 Looking for performances on modern CPU architecture ? You can't avoid taking care of memory cache effects.

The logic for cache is pretty simple : access to main memory is (relatively) slow, therefore CPU keeps a part of memory next to it, for faster interaction.

At the lowest part of this strategy, you could consider registers as being "level 0 cache". Registers are within the CPU, and some of them are even electable for some comparison or mathematical operations, even if limited. As an example, the old "saturn" processor, used in several HP handheld calculators, had 4 main registers and 4 store registers. Data in these registers are accessed swiftly and can be manipulated easily. But this is not transparent : you have to explicitly load and save data into them, therefore the term "cache" is quite dubious here.

At the next level, all modern processors do feature nowadays an L1 cache. Not only PC ones, but even low-power CPU for embedded market, such as ARM, do feature an L1 cache. This time, this is truly a copy of main memory, an (almost) transparent mechanism which simply ensure that recently accessed data remain close to processors.
L1 caches are typically very fast, with access times in the range of 5 cycles and better. They are however relatively small, storing just a few kilobytes (Core 2 Duo for example feature 2x32KB of L1 cache per core). L1 caches are typically considered part of the CPU.

Going up another level, most PC processors do also feature an L2 cache. This time, we can really say that data sit "close to" processor, as L2 caches are rarely "inside" the CPU. They are however sometimes part of the same package.
L2 caches are slower (in the range of 20 cycles access time), and larger (between 0.25 and 8MB typically). So data that do not fit into L1 is likely to be found in this second larger cache.

A few processors do also feature an L3 cache, but i won't enumerate on that. Suffice to say that it is an even larger and slower cache.

Then you have the main memory. Nowadays, this memory tend to be quite large, at several GB per machine. Therefore, it is several order of magnitude larger than cache, but also much slower. Performance figures do vary a lot depending on architecture, but we can waver a typical 100-150 cycles number for access time.

So here you have the basic principles : very recently accessed data is re-accessed swiftly thanks to L1 cache (5 cycles), then a bit less recent data is found into L2 cache (20 cycles) then main memory (150 cycles). As a consequence, making sure that wanted data remains as much as possible into memory cache makes a terrific difference on performance.

The rule seems simple enough as it is, but there is more to it. Understanding how exactly work a memory cache is key to craft performances. So in a later note, we'll study its in depth mechanism.

Monday, November 22, 2010

Cuckoo's Hash

 After reading a short rant from Charles Bloom, i spent some time reading and studying cuckoo hashing, a relatively new hash methodology, which gets its name from the bird : as you may know, a cuckoo is genetically programmed to hatch into another bird's nest, and when it gets out of its shell, its first act in life is to push other eggs and little baby birds out of the nest, ensuring all future food for its own benefit.

Cuckoo hashing share the same behavior :
Assuming you have 2 hash functions, with relatively low probabilities to produce the same hash from different sequences, you start by hashing your sequence with first hash. If the slot is already occupied, you take it. The previous slot occupant is then re-hashed, using the second hash function. It therefore finds another place for its use. However, if the second place is already taken, then the process starts again : the occupants gets removed, then re-hashed using first hash function, and so on.
In the end, there is a pretty high chance that, with enough shuffling, all elements will find their own place.

Searching into a cuckoo hash table is pretty straightforward : you calculate both hashes, using the 2 hash functions, and then check both positions for sequence presence. It can't be anywhere else, so you're sure to find it in maximum 2 probes if it exists.

This looks interesting, but there is a catch : you can get into a situation where there are infinite loops, such as in the picture W & H, which point to each other.

To counter this risk, you have to respect a "load limit". In the original cuckoo hash implementation, this limit is pretty low, at 50%. However, since then, a simple improvement has been found, using multi-slots bucket or more hash functions. With many more choices to choose from, the load limit rocket to the roof, reaching extreme performances, as high as 99.7% (for 2 Hashes and 8 slots per bucket, for example).

So where does that lead us to ?

When reading these properties, i had one target in mind : Etincelle, my current flagship compressor. To describe its characteristics swiftly, Etincelle has compression performances in the range of the best Zip, but much much faster (calculated at 80MB/s for enwik8 on my Core 2).

Its speed is mostly due to its very fast match search algorithm, which however misses many opportunities due to its very light probing. Etincelle uses this inefficiency to its advantage, by referencing data with very little indexes. This works even at fairly large distances (you can instruct Etincelle to use a 1GB dictionary for example) but is only as efficient as the table fill-rate. I therefore use over-booked hash tables to ensure index efficiency.
Cuckoo hashes with high load limit could help here in making this scheme even more efficient.
Testing this, however, would require some fairly large code changes, to the point of changing Etincelle into something else, obviously a new compressor. The speed is also poised to be slower, has measured by Dr Askitis's hash comparison thesis.

Another potential use of Cuckoo Hash table is a more classical objective within LZ4 : as measured in a previous post, collision rate do reach inconvenient level with Firefox at 4M search depth, totalizing 60 millions comparisons, which is more than 20% of all comparisons. This is obviously a lot, in spite of MMC automatic handling of this situation, albeit in a chain-like fashion. Here also, cuckoo hash could help to bring down these collisions to zero, but at a cost. It remains to be tested if the extra cost and complexity remains worthwhile.

Sunday, November 21, 2010

LZ4 : New version

Yesterday's experiments provided some appreciable gains to LZ4, and therefore it seems valid to share them.

The new version of LZ4HC is, on average, 10% faster.

On top of that, i found an optimization in the decoder, which is valid for any LZ4 version.
As a consequence, both the "Fast" and "High Compression" versions get updated with a nice boost to decompression speed. LZ4 already was the fastest decoder, this makes it even better.

You can download both version at their new LZ4 homepage.

Compression RatioSpeedDecoding
LZ4 "Ultra Fast"2.062227 MB/s780 MB/s
LZ4HC "High Compression"2.34531.0 MB/s830 MB/s