Tuesday, August 25, 2015

Fuzz testing Zstandard

 An advance issue that any production-grade codec must face is the ability to deal with erroneous data.

Such requirement tends to come at a second development stage, since it's already difficult enough to make an algorithm work under "normal conditions". Before reaching erroneous data, there is already a large number of valid edge cases to properly deal with.

Erroneous input is nonetheless important, not least because it can degenerate into a full program crash if not properly taken care of. At a more advanced level, it can even serve as an attack vector, trying to push some executable code into unauthorized memory segments. Even without reaching that point, just the perspective to make a system crash with the use of a predictable pattern is a good enough nuisance.

Dealing with such problems can be partially mitigated using stringent unit tests. But that's more easily said than done. Sometimes, not only is it painful to build and maintain a thorough and wishfully complete list of unit test for each function, it's also useless in predicting some unexpected behavior resulting from an improbable chain of events at different stages in the program.

Hence the idea to find such bugs at "system level". The system's input will be fed with a set of data, and the results will be observed. If you create test set manually, you will likely test some important, visible and expected use cases, which is still a pretty good start. But some less obvious interaction patterns will be missed.

That's where starts the realm of Fuzz Testing. The main idea is that random will make a better job at finding stupid forgotten edge cases, which are good candidates to crash a program. And it works pretty well. But how to setup "random" ?

In fact, even "random" must be defined within some limits. For example, if you only feed a lossless compression algorithm with some random input, it will simply not be able to compress it, meaning you will always test the same code path. 

The way I've dealt with such issue for lz4 or zstd is to create programs able to generate "random compressible data", with some programmable characteristics (compressibility, symbol variation, reproducible by seed). And it helped a lot to test valid code path.

The decompression side is more interested by resistance to invalid input. But even with random parameters, there is a need to target interesting properties to test. Typically, a valid decompression stage is first run, to serve as a model. Then some "credible" fail scenarios are built from them. Zstd fuzzer tool typically tests : truncated input, too small destination buffer, and noisy source created from a valid one with some random changes, in order to bypass too simple screening stages.

All these tests were extremely useful to strengthen the reliability of the code. But the idea that "random" was in fact defined within some limits make it clear that maybe some other code path, outside of limits of "random", may still fail if properly triggered.

But how to find them ? As stated earlier, brute force is not a good approach. There are too many similar cases which would be trivially reduced to a single code path. For example, the compressed format of zstd includes an initial 4-bytes identifier. A dumb random input would therefore have a 1 in 4 billion chances to pass such early screening, leaving little energy to test the rest of the code.

For a long time, I believed it was necessary to know in details one's code to create some useful fuzzer tool. Thanks to kind notification from Vitaly Magerya, it seems this is no longer the only one solution. I discovered earlier today the American Fuzzy Lop. No, not the rabbit; this test tool, by MichaƂ Zalewski.

It's relatively easy to setup (for Unix programmers). Build, install and usage follow clean conventions, and the Readme is a fairly good read, easy to follow. With just a few initial test cases to provide, a special compilation stage and a command line, the tool is ready to go.

American Fuzzy Lop, testing zstd decoder

It displays a simple live board in text mode, which successfully captures the mind. One can see, or rather guess, how the genetic algorithm tries to create new use cases. It basically starts from the initially provided set of tests, and create new ones by modifying them using simple transformations. It analyzes the results, which are relatively precise thanks to special instrumentation installed in the target binary during the compilation stage. It deduces from them the triggered code path and if it has found a new one. Then generate new test cases built on top of "promising" previous ones, restart, ad infinitum. 

This is simple and brilliant. Most importantly, it is generic, meaning no special knowledge of zstd was required for it to test thoroughly the algorithm and its associated source code.

There are obviously limits. For example, the amount of memory that can be spent for each test. Therefore, successfully resisting for hours the tricky tests created by this fuzzer tool is not the same as "bug free", but it's a damn good step into this direction, and would at least deserve the term "robust".

Anyway, the result of all these tests, using internal and external fuzzer tools, is a first release of Zstandard. It's not yet "format stable", meaning specifically that the current format is not guaranteed to remain unmodified in the future (such stage is planned to be reached early 2016 only). But it's already quite robust. So if you wanted to test the algorithm in your application, now seems a good time, even in production environment.

Wednesday, August 19, 2015

Accessing unaligned memory

 Thanks to Herman Brule, I recently received an access to real ARM hardware systems, in order to test C code and tune them for performance. It proved a great experience, with lots of learnings.

It started with the finding that xxhash speed was rubbish on ARM systems. To this end, 2 systems were benchmarked : first, an ARMv6-J, and then an ARMv7-A.

This was a unwelcomed surprise, and among the multiple potential reasons, it turns out that accessing unaligned data became the most critical one.

Since my latest blog entry on this issue, I converted unaligned-access code to the QEMU-promoted solution using `memcpy()`. Compared with earlier method (`pack` statement), the `memcpy()` version has a big advantage : it's highly portable. It's also supposed to be correctly optimized by the compiler, to end up to a trivial `unaligned load` instruction on CPU architecture which support this feature.

Well, supposed to is really the right word. It turns out, this is not true in a number of cases. While initially only direct benchmark tests were my main investigation tool, I was pointed towards godbolt online assembly generator, which became an invaluable asset to properly understand what was going on at assembly level.

Thanks to these new tools, the issue could be summarized into a selection between 3 possibilities to access unaligned memory :

1. Using `memcpy()` : this is the most portable and safe one.
It's also efficient in a large number of situations. For example, on all tested targets, clang translates `memcpy()` into a single `load` instruction when hardware supports it. gcc is also good on most target tested (x86, x64, arm64, ppc), with just arm 32bits standing out.
The issue here is that your mileage will vary depending on specific compiler / targets. And it's difficult, if not impossible, to test and check all possible combinations. But at least, `memcpy()` is a good generic backup, a safe harbour to be compared to.

2. `pack` instruction : the problem is that it's a compiler-specific extension. It tends to be present on most compilers, but using multiple different, and incompatible, semantics. Therefore, it's a pain for portability and maintenance.

That being said, in a number of cases where `memcpy()` doesn't produce optimal code, `pack` tends to do a better job. So it's possible to `special case` these situations, and left the rest to `memcpy`.

The most important use case was gcc with ARMv7, basically the most important 32-bits ARM version nowadays (included in current crop of smartphones and tablets).
Here, using `pack` for unaligned memory improved performance from 120 MB/s to 765 MB/s compared to `memcpy()`. That's definitely a too large difference to be missed.

Unfortunately, on gcc with ARMv6, this solution was still as bad as `memcpy()`.

3. direct `u32` access : the only solution I could find for gcc on ARMv6.
This solution is not recommended, as it basically "lies" to the compiler by pretending data is properly aligned, thus generating a fast `load` instruction. It works when the target cpu is hardware compatible with unaligned memory access, and does not risk generating some opcode which are only compatible with strictly-aligned memory accesses.
This is exactly the situation of ARMv6.
Don't use it for ARMv7 though : although it's compatible with unaligned load, it can also issue multiple load instruction, which is a strict-align only opcode. So the resulting binary would crash.

In this case too, the performance gain is too large to be neglected : on unaligned memory access, read speed went up from 75 MB/s to 390 MB/s compared to `memcpy()` or `pack`. That's more than 5 times faster.

So there you have it, a complex setup, which tries to select the best possible method depending on compiler and target. Current findings can be summarized as below :

Better unaligned read method :
| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |
A good news is that there is a safe default method, which tends to work well in a majority of situations. Now, it's only a matter of special-casing specific combinations, to use alternate method.

Of course, a better solution would be for all compilers, and gcc specifically, to properly translate `memcpy()` into efficient assembly for all targets. But that's wishful thinking, clearly outside of our responsibility. Even if it does improve some day, we nonetheless need an efficient solution now, for current crop of compilers.

The new unaligned memory access design is currently available within xxHash source code on github, dev branch.

Summary of gains on tested platforms :
compiled with gcc v4.7.4
| program            | platform|  before  |  after   | 
| xxhash32 unaligned |  ARMv6  |  75 MB/s | 390 MB/s |
| xxhash32 unaligned |  ARMv7  | 122 MB/s | 765 MB/s |
| lz4 compression    |  ARMv6  |  13 MB/s |  18 MB/s |
| lz4 compression    |  ARMv7  |  33 MB/s |  49 MB/s |

Thursday, July 30, 2015

Huffman revisited - Part 3 - Depth limited tree

Huffman tree
 A secondary issue that most real-world Huffman implementations must deal with is tree depth limitation.

Huffman construction doesn't limit the depth. If it would, it would no longer be "optimal". Granted, the maximum depth of an Huffman tree is bounded by the Fibonacci serie, but that leave ample room for larger depth than wanted.
Why limit Huffman tree depth ? Fast huffman decoders use lookup tables. It's possible to use multiple table levels to mitigate the memory cost, but a very fast decoder such as Huff0 goes for a single table, both for simplicity and speed. In which case the table size is a direct product of the tree depth.
For the benefit of speed and memory management, a limit had to be selected : it's 8 KB for the decoding table, which nicely fits into Intel's L1 cache, and leave some room to combine it with other tables if need be. Since latest decoding table uses 2 bytes per cell, it translates into 4K cells, hence a maximum tree depth of 12 bits.
12 bits for compressing literals is generally too little, at least according to optimal Huffman construction. Creating a depth-limited tree is therefore a practical issue to solve. The question is : how to achieve this objective with minimum impact on compression ratio, and how to do it fast ?
Depth-limited huffman trees have been studied since the 1960's, so there is ample literature available. What's more surprising is how complex the proposed solutions can be, and how many decades were
necessary to converge towards an optimal solution.
Edit : in below paragraph, n is the alphabet size, and D is the maximum tree Depth.
It started with Karp, in 1961 (Minimum-redundancy coding for the discrete noiseless channel), proposing a solution in exponential time. Then Gilbert, in 1971 (Codes based on inaccurate source probabilities), still in exponential time. Hu and Tan, in 1972 (Path length of binary search trees), with a solution in O(n.D.2^D). Finally, a solution in polynomial time was proposed by Garey in 1974 (Optimal binary search trees with restricted maximal depth), but still O(n^2.D) time and using O(n^2.D) space. In 1987, Larmore proposed an improved solution using O(n^3/2.D.log1/2.n) time and space (Height restricted optimal binary trees). The breakthrough happened in 1990 (A fast algorithm for optimal length-limited Huffman codes), when Larmore and Hirschberg propose the Package_Merge algoritm, a completely different kind of solution using only O(n.D) time and O(n) space. It became a classic, and was refined a few times over the next decades, with the notable contribution of Mordecai Golin in 2008 (A Dynamic Programming Approach To Length-Limited Huffman Coding).

Most of these papers are plain difficult to read, and it's usually harder than necessary to develop a working solution just by reading them (at least, I couldn't. Honorable mention for Mordecai Golin, which proposes a graph-traversal formulation relatively straightforward. Alas, it was still too much CPU workload for my taste).
In practice, most fast Huffman implementations don't bother with them. Sure, when optimal compression is required, the PackageMerge algorithm is preferred, but in most circumstances, being optimal is not really the point. After all, Huffman is already a trade-off between optimal and speed. By following this logic, we don't want to sacrifice everything for an optimal solution, we just need a good enough one, fast and light.
That's why you'll find some cheap heuristics in many huffman codes. A simple one : start with a classic Huffman tree, flatten all leaves beyond maximum depth, then flatten enough higher leaves to maxBits to get back the total length to one. It's fast, it's certainly not optimal, but in practice, the difference is small and barely noticeable. Only when the tree depth is very constrained does it make a visible difference (see great comments from Charles Bloom on its blog).
Nonetheless, for huff0, I was willing to find a solution a bit better than cheap heuristic, closer to optimal. 12 bits is not exactly "very constrained", so the pressure is not high, but it's still constrained enough that the depth-limited algorithm is going to be necessary in most circumstances. So better have a good one.
I started by realizing some simple observations : after completing an huffman tree, all symbols are sorted in decreasing count order. That means that the number of bits required to represent each symbol must follow a strict increasing order. That means the only thing I need to track is the border decision (from 5 to 6 bits, from 6 to 7 bits, etc.).
Example Huffman Distribution
So now, the algorithm will concentrate on moving the arrows.
The first part is the same as the cheap heuristic : flatten everything that needs more than maxBits. This will create a "debt" : a symbol requiring maxBits+1 bits creates a debt of 1/2=0.5 when pushed to maxBits. A symbol requiring maxBits+2 creates a debt of 3/4=0.75, and so on. What may not be totally obvious is that the sum of these fractional debts is necessarily an integer number. This is a consequence of starting from a solved huffman tree, and can be proven by simple recurrence : if the huffman tree natural length is maxBits+1, then the number of elements at maxBits+1 is necessarily even, otherwise the sum of probabilities can't be equal to one. The debt's sum is therefore necessarily a multiple of 2 * 0.5 = 1, hence an integer number. Rince and repeat formaxBits+2 and further depth.
So now we have a debt to repay. Each time you demote a symbol from maxBits-1 to maxBits, you repay 1 debt. Since the symbols are already sorted in decreasing frequency, it's easy to just grab the latest maxBits-1 ones and demote them to maxBits up to repaying the debt. This is in essence what the cheap heuristic does.
But one must note that demoting a symbol from maxBits-2 to maxBits-1 repay not 1 but 2 debts. Demoting from maxbits-3 tomaxBits-2 repay 4 debts. And so on. So now the question becomes : is it preferable to demote a single maxBits-2 symbol or 2maxBits-1 symbols ?
The answer to this question is trivial since we deal with integer number of bits : just compare the sum of occurences of the 2maxBits-1 symbols with the occurence of the maxBits-2 one. Whichever is smallest costs less bits to demote. Proceed.
This approach can be scaled. Need to repay 16 debts ? A single symbol at maxBits-5 might be enough, or 2 at maxBits-4. By recurrence, each maxBits-4 symbol might be better replaced by 2 maxBits-3 ones, and so on. The best solution will show up by a simple recurrence algorithm.
Sometimes, it might be better to overshoot : if you have to repay a debt of 7, which formula is better ? 4+2+1, or 8-1 ? (the -1 can be achieved by promoting the best maxBits symbol to maxBits-1). In theory, you would have to compare both and select the better one. Doing so leads to an optimal algorithm. In practice though, the positive debt repay (4+2+1) is most likely the better one, since some rare twisted distribution is required for the overshoot solution to win.
The algorithm becomes a bit more complex when some bits ranks are missing. For example, you need to repay a debt of 2, but there is no symbol left at maxBits-2. In such case, you can still default to maxBits-1, but maybe there is no more symbol left there either. In which case, you are forced to overshoot (maxBits-3) and promote enough elements to get the debt back to zero.
On average, the fast variant of this algorithm remains very fast. Its CPU cost is unnoticeable, compared to the decoding cost itself, and the final compression ratio is barely affected (<0.1%) compared to unconstrained tree depth. So that's mission accomplished.
The fast variant of the algorithm is available in open source and can be grabbed at github, under the function nameHUF_setMaxHeight().

Wednesday, July 29, 2015

Huffman revisited - Part 2 : the Decoder

Huffman tree The first attempt to decompress the Huffman bitStream created by anhuff0 version modified to use FSE bitStream ended up in brutal disenchanting. While the decoding itself worked fine, the resulting speed was a mere 180 MB/s.
OK, in absolute, it looks reasonable speed, but keep in mind this is far off the objective of beating FSE (which decodes at 475 MB/s on the same system), and even worse than reference zlib huffman. Some generic attempts at improving speed barely changed this, moving up just above 190 MB/s.
This was a disappointment, and a clear proof that the bitStream alone wasn't enough to explain FSE speed. So what could produce such a large difference ?
Let's look at the code. The critical section of FSE decoding loop looks like this :
    DInfo = table[state];
    nbBits = DInfo.nbBits;
    symbol = DInfo.symbol;
    lowBits = FSE_readBits(bitD, nbBits);
    state = DInfo.newState + lowBits;
    return symbol;
while for Huff0, it would look like this :
    symbol = tableSymbols[state];
    nbBits = tableNbBits[symbol];
    lowBits = FSE_readBits(bitD, nbBits);
    state = ((state << nbBits) & mask) + lowBits;
    return symbol;
There are some similarities, but also some visible differences. First, Huff0 creates 2 decoding tables, one to determine the symbol being decoded, the other one to determine how many bits are read. This is a good design for memory space : the larger table is tableSymbols, as its size primarily depends on 1<<maxNbBits. The second table, tableNbBits, is much smaller : its size only depends on nbSymbols. This construction allows using only 1 byte per cell. It favorably compares to 4 bytes per cell for FSE. This memory advantage can be used either as a net space saver, or as a way to boost accuracy, by increasing maxNbBits.
The cost for it is that there are 2 interdependent operations : first decode the state to get the symbol, then use the symbol to get nbBits.
This interdependance is likely the bottleneck. When trying to design high performance computation loops, there are 3 major rules to keep in mind :
  • Ensure hot data is already in the cache.
  • Avoid badly predictable branches (predictable ones are fine)
  • For modern OoO (Out of Order) CPU : keep their multiple execution units busy by feeding them with independent (parallelizable) operations.
This list is given in priority order. It makes no sense to try optimizing your code for OoO operations if the CPU has to wait for data from main memory, as the latency cost is much higher than any CPU operation. If your code is full of badly predictable branches, resulting in branch flush penalties, this is also a much larger problem than having some idle execution units. So you can only get to the third set of optimization after properly solving the previous ones.
This is exactly the situation where Huff0 is, with a fully branchless bitstream and data tables entirely within L1 cache. So the next performance boost will likely be found into OoO operations.
In order to avoid dependency between symbol first, then nbBits , let's try a different table design, where nbBits is directly stored alongside symbol, in the state table. This double the memory cost, hence reducing the memory advantage enjoyed by Huffman compared to FSE. But let's see where it goes :
    symbol = table[state].symbol;
    nbBits = table[state].nbBits;
    lowBits = FSE_readBits(bitD, nbBits);
    state = ((state << nbBits) & mask) + lowBits;
    return symbol;
This simple change alone is enough to boost the speed to 250 MB/s. Still quite far from the 475 MB/s enjoyed by FSE on the same system, but nonetheless a nice performance boost. More critically, it underlines that the diagnosis was correct : untangling operation dependency free up CPU OoO execution units, they can do more work within each cycle.
So let's ramp up the concept. We have removed one operation dependancy. Is there another one ?
Yes. When looking at the main decoding loop from a higher perspective, we can see there are 4 decoding operations per loop. But each decoding operation must wait for the previous one to be completed, because in order to know how to read the bitStream for symbol 2, we need first to know of many bits were consumed by symbol 1.
Compare with how FSE work : since state values are separated from bitStream, it's possible to decode symbol1 and symbol2, and retrieve their respective nbBits, in any order, without any dependency. Only later operations, retrieving lowBits from the bitStream to calculate the next state values, introduce some ordering dependency (and even this one can be partially unordered).
The main idea is this one : to decode faster, it's necessary retrieve several symbols in parallel, without dependency. So let's create a compressed data flow which makes such operation possible.
Re-using FSE principles "as is" to design a faster Huffman decoding is an obvious choice, but it predictably results in about the same speed. As stated previously, it's not interesting to design a new Huffman encoder/decoder if it just ends up being as fast as FSE. If that is the outcome, then let's simply use FSE instead.
Fortunately, we already know that compression can be faster. So let's concentrate on the decoding side. Since it seems impossible to decode the next symbol without first decoding the previous one from the same bitStream, let's design multiple bitStreams.
The new design is a bit more complex. Compression side is affected : in order to create multiple bitStreams, one solution is to scan input data block multiple times. It proved efficient enough to not bother with a different design. On top of that, a jumptable is required at the beginning of the block, to let the decoder know where each bitStream starts.
Huff0 bitStream design
Within each bitStream, it's still necessary to decode the first symbol to read the second. But each bitStream is independent, so it's possible to decode up to 4 symbols in parallel.
This proved a design win. The new huff0 decompresses at 600 MB/s while preserving the compression speed of 500 MB/s. This compares favorably to FSE or zlib's huffman, as detailed below :
huff0500 MB/s600 MB/s
FSE320 MB/s475 MB/s
zlib-h250 MB/s250 MB/s
With that part solved, it was possible to check that there is no visible compression difference between FSE and Huff0 on Literals data. To be more precise, compression is slightly worse, but header size is slightly better (huffman headers are simpler to describe). On average, both effects compensate.
The resulting code is open sourced and currently available at :https://github.com/Cyan4973/FiniteStateEntropy (dev branch)
The new API mimic its FSE counterparts, and provides only the higher (simpler) prototypes for now :
size_t HUF_compress (void* dst, size_t dstSize, 
               const void* src, size_t srcSize);
size_t HUF_decompress(void* dst,  size_t maxDstSize,
                const void* cSrc, size_t cSrcSize);
For the time being, both FSE and huff0 are available within the same library, and even within the same file. The reasoning is that they share the same bitStream code. Obviously, many design choices will have the opportunity to be challenged and improved in the near future.
Having created a new competitor to FSE, it was only logical to check how it would behave withinZstandard. It's almost a drop-in replacement for literal compression.
ZstandardpreviousHuff0 literals
compression speed200 MB/s240 MB/s
decompression speed540 MB/s620 MB/s
A nice speed boost with no impact on compression ratio. Overall, a fairly positive outcome.

Tuesday, July 28, 2015

Huffman revisited - part 1

Huffman tree Huffman compression is a well known entropic compression technique since the 1950's. It's optimal, in the sense there is no better construction if one accept the limitation of using an integer number of bits per symbol, a constraint that can severely limit its compression capability in presence of high probability symbols.
Huffman compression is very popular, and quite rightly so, thanks to its simplicity and clarity. (It's also patent-free which helps too). For a long time, it remained the entropic compressor of choice due to its excellent speed / efficiency trade off.
Today, we can use more powerful entropic compressors such as Arithmetic Coding or the newer ANS based Finite State Entropy, which are able to grab fractional bits, hence ensuring a better compression ratio, closer to the Shannon Limit.
The Shannon Limit must be considered like the speed of light, as a hard wall that cannot be crossed. Anytime someone claims the contrary, it is either hiding some cost portions (such as headers, or the decoder itself), or solving a different problem, entangling modeling and entropy. As long as entropy alone is considered, there is simply no way to beat the Shannon Limit. You can just get closer to it.
This leads us to a simple question : are there situations where Huffman compression is good enough, meaning that it is so close to Shannon limit that there is very little gain remaining, if any ?
The answer to this question is yes.
Let's forget some curious corner cases where symbol frequencies are clean power of 2. Of course, in such case, Huffman compression would be optimal, but this is way too specific to consider.
Let's therefore imagine a more "natural" situation where all symbol frequencies are randomly scattered along the probability axis, with the sole condition that the sum of all probabilities must be equal to 1.
A simple observation : the more numerous the symbols, the most likely each symbol probability is going to be small (since their total sum must be equal to 1).
This is an important observation. When the probability of a symbol is small, its deviation from the nearest power of 2 is also small. At some point, this deviation becomes negligible.
(Edit : okay, it's a bit more complex than that. The power of low probability symbols also comes from their combinatorial effects : they help the huffman tree to be more balanced. But that part is more complex to analyze, so just take my word for it.)
Huffman deviation from Shannon optimal
Therefore, if we are in a situation where no symbol get a large probability (<10%), Huffman compression is likely to provide a "good enough" compression result, meaning close enough to the hard "Shannon limit" so that it doesn't matter to get even closer to it.
In a compression algorithm such as Zstandard, the literals are symbols which belong to this category. They are basically the "rest" from LZ compression, which couldn't be identified as part of repeated sequences. They can be any byte value from 0 to 255, which means every symbol get an average of 0.4% probability. Of course, there are some large differences between most common and less common ones, especially on text files. But in practice, most probabilities remain small, so Huffman deviation should be negligible.
In Zstandard, all symbols are compressed using Finite State Entropy, which is very fast and performs fractional bit compression. We are saying that, for literals, fractional bit makes little difference, so Huffman can be "good enough". So could we use Huffman instead of FSE for such symbols ?
This would only make sense if Huffman compression could bring some kind of advantage on the table, for example speed, and/or memory usage. Alas, currently known versions of Huffman perform worse than Finite State Entropy. The zlib reference version, which is pretty good, max out at 250-300 MB/s, which isn't close to FSE results. My own, older, version of Huffman, huff0, is not even as good as the zlib one.
But it's not game over. After all, analysing FSE algorithm in detail, there is no reason for it to be faster than Huffman, since their complexity are similar. A fast, modern, Huffman compressor should reach equivalent speed, if not better on the compression side (due to an additional operation required by FSE to provide fractional bit).
Part of the reasons why FSE is fast is that it uses some clever bitStream techniques, combining multiple symbols into branchless writes, a trick which is not strictly tied to FSE and can be used into different context. So the idea was to re-use the bitStream interface, and combine with a Huffman compressor.
huff0 was refurbished and improved to employ FSE bitStream. In order to preserve code compatibility, I kept FSE design of compressing and decompressing in reverse directions, which is not strictly necessary for Huffman. I could test though that it does not make any noticeable difference for Huffman compression, making this feature a non-event as long as it remains hidden within block API level.
Moving huff0 to this new bitStream proved extremely easy. And the result was very rewarding. With little effort, I could make it reach 500 MB/s compression speed, way better than any other huffman compressor I'm aware of, and more critically way better than FSE compression, making it a replacement candidate.
With such great result at hand, I confidently proceeded to implement huffman decompression based on the same design. I was in for a nasty surprise ...

Friday, May 29, 2015

Changing course

 Since I started programming a few years ago, and selected data compression as my little hobby and obsession, I nonetheless remained a part-time, amateur, programmer.

My then real-life job was Telecommunication Project Manager, later morphed into Marketing Product Manager. It may sound foreign to programming, but not really : within every new product, every innovation see the contribution of multiple programming teams sharing temporarily some common objectives.

I therefore started programming with a good excuse :  I was convinced that it helped me understand and communicate with programming teams, hence making me a better product manager. And, from what I can look at today, I would say it worked reasonably well. But let's be fair, the real objective was to entertain my brain, sure enough because I simply liked programming, and compression.

But as you can guess, with just a few evenings and week-end to save, progresses have been slow. Even more so since LZ4 became a "production-ready" source code, requiring a lot of maintenance and care, hence taking a sizable share of available time, and limiting further "research" activity.

That couldn't last. With a baby soon to come, it became clear that I would either have to stop, by starvation of free time, or eventually make programming my full-time activity.

I was lucky enough to receive a few propositions from several companies at this exact moment, while I was pondering my choices for the future. This acted for me as signal, a perfect opportunity to change course.

Starting June 1st, I'll become a full time employee at Facebook, Infrastructure division. On short term, it may translate into some reduced freedom to communicate around, but over the long term, it's the better choice to continue working in my field of choice, data compression.

I've selected Facebook for several reasons, not least because they are very keen to authorize my work in data compression to continue in Open Source mode. That's a great plus for them.
Of course, I guess you are also aware this team has developed an impressive set of tool, processes and mindset, to safely develop and deploy highly advanced software around the planet. So it's the kind of place where a lot of important practices can be learned. It's also an ideal crossover for my dual background in programming and telecommunication.

I'll need to ask for a few formal authorizations before being able to write again in this blog, but I'm optimistic on the outcome. And with now programming my primary activity, I should gradually find more and more time to do what I like, improving current compression algorithms and code base, and plausibly in the future, find some time to research and deliver some new ones.

Exciting times ahead...

Tuesday, April 7, 2015

Sampling, or a faster LZ4

 Quite some time ago, I've been experimenting with some unusual sampling methods, in an attempt to find a better way to compress data with LZ4.

The main idea was as follows : LZ4 hash table is getting filled pretty quickly, due to its small size. It becomes the dominant limitation, both for compression ratio and speed. In many cases, a hash cell is overwritten many times before being actually useful (i.e. produce a match). So, could there be some better way to update the hash table, which would update it less often, but in the end, update it more efficiently (i.e. limit wastes from over-writing) ?

It turned out my expectation were too optimistic. Any time I tried to reduce the update rate, it would result in a correspondingly reduced compression ratio. With that experiment failed, I settled for an "optimal" sampling pattern, which became the core of LZ4.

Recently, I've revisited this method. After all, getting a lower compression ratio at a faster speed is not necessarily a bad outcome. It depends on user expectation. So maybe, should a user be allowed to select its own "optimal" speed/compression ratio, he may actually prefer another trade-off than the default one.

Enter LZ4_compress_fast(). It's a new function , available only in developer branch for the time being, which handles a single new parameter : int acceleration.

The concept is fairly simple : the higher the value of acceleration, the faster the compression. Correspondingly, compression ratio decreases too. It can be pretty fine-tuned, each acceleration level providing a little 3-4% speed boost, meaning one could select quite exactly its preferred speed range.

In order to get a taste of this new parameter, a few limited tests were run on the same corpus using different acceleration values. Here are some early results :

                    Compression   Decompression   Ratio
memcpy                4200 MB/s      4200 MB/s    1.000
LZ4 fast 50           1080 MB/s      2650 MB/s    1.375
LZ4 fast 17            680 MB/s      2220 MB/s    1.607
LZ4 fast 5             475 MB/s      1920 MB/s    1.886
LZ4 default            385 MB/s      1850 MB/s    2.101

Silesia Corpus in single-thread mode, Core i5-4300U @1.9GHz, compiled with GCC v4.8.2 on Linux Mint 64-bits v17.

It provides some hint of the relatively wide range of newly accessible speed/compression trade-offs.

The new function prototype is currently only accessible within the "dev" branch. It's still considered experimental, but may find its way into next release r129, depending on user feedback.

Having a parameter to accelerate, rather than strengthen, compression is an unusual concept, so it's not yet clear if it's a very good one. What do you think ? Is a faster and programmable version, trading compression ratio for more speed, a good idea to fit into LZ4 API ?

Edit : LZ4_compress_fast() is released as part of LZ4 r129.