RealTime Data Compression: Finite State Entropy

Monday, December 16, 2013

Finite State Entropy - A new breed of entropy coder

In compression theory, the entropy encoding stage is typically the last stage of a compression algorithm, the one where the gains from the model are realized.

The purpose of the entropy stage is to reduce a set of flags/symbol to their optimal space given their probability. As a simple example, if a flag has 50% to be set, you want to encode it using 1 bit. If a symbol has 25% probability to have value X, you want to encode it using 2 bits, and so on.
The optimal size to encode a probability is proven, and known as the Shannon limit. You can't beat that limit, you can only get close to it.

A solution to this problem has been worked for decades, starting with Claude Shannon own work, which were efficient but not optimal. The optimal solution was ultimately found by one of Shannon's own pupils, David A. Huffman, almost by chance. His version became immensely popular, not least because he could prove, a few years later, that his construction method was optimal : there was no way to build a better distribution.

Or so it was thought.
There was still a problem with Huffman encoding (and all previous ones) : an hidden assumption is that a symbol must be encoded using an integer number of bits. To say it simply, you can't go lower than 1 bit.
It seems reasonable, but that's not even close to Shannon's limit. An event which has 90% probability to happen for example should be encoded using 0.15 bits. You can't do that using Huffman trees.

A solution to this problem was found almost 30 years later, by Jorma Rissanen, under the name of Arithmetic coder. Explaining how it works is outside of the scope of this blog post, since it's complex and would require a few chapters; I invite you to read the Wikipedia page if you want to learn more about it. For the purpose of this presentation, it's enough to say that Arithmetic encoding, and its little brother Range encoding, solved the fractional bit issue of Huffman, and with only some minimal losses to complain about due to rounding, get closer to Shannon limit. So close in fact that entropy encoding is, since then, considered a "solved problem".

Which is terrible because it gives the feeling that nothing better can be invented.

Well, there is more to this story. Of course, there is still a little problem with arithmetic encoders : they require arithmetic operations, such as multiplications, and divisions, and strictly defined rounding errors.

This is serious requirement for CPU, especially in the 80's. Moreover, some lawyer-happy companies such as IBM grabbed this opportunity to flood the field with multiple dubious patents on minor implementation details, making clear that anyone trying to use the method would face expensive litigation. Considering this environment, the method was barely used for the next few decades, Huffman remaining the algorithm of choice for the entropy stage.

Even today, with most of the patent issues cleared, modern CPU will still take a hit due to the divisions. Optimized versions can sometimes get away with the division during the encoding stage, but not the decoding stage (with the exception of the Binary arithmetic coding, which is however limited to 0/1 symbols).
As a consequence, arithmetic encoders are quite slower than Huffman ones. For low-end or even "retro" CPU, it's simply out of range.

It's been a long time objective of mine to bring arithmetic-level compression performance to vintage (or retro) CPU. Consider it a challenge. I've tested several variants, for example a mix of Huffman and Binary Arithmetic, which was free of divisions, but alas still needed multiplications, and required more registers to operate, which was overkill for weak CPU.

So I've been reading with a keen eye the ANS theory, from Jarek Duda, which I felt was heading into the same direction. If you are able to fully understand his paper, you are better than me, because quite frankly, most of the wording used in his document is way out of my reach. (Note : Jarek pointed to an update version of his paper, which should be easier to understand). Fortunately, it nonetheless resonated, because I was working on something very similar, and Jarek's document provided the last elements required to make it work.

And here is the result today, the Finite State Entropy coder, which is proposed in a BSD-license package at Github.

In a nutshell, this coder provides the same level of performance as Arithmetic coder, but only requires additions, masks, and shifts.
The speed of this implementation is fairly good, and even on modern high-end CPU, it can prove a valuable replacement to standard Huffman implementations.
Compared to zlib's Huffman entropy coder, it manages to outperform its compression ratio while besting it on speed, especially decoding speed.

Benchmark platform : Core i5-3340M (2.7GHz), Window Seven 64-bits
Benchmarked file : win98-lz4-run
Algorithm Ratio Compression Decompression
FSE 2.688 290 MS/s 415 MS/s
zlib 2.660 200 MS/s 220 MS/s

Benchmarked file : proba70.bin
Algorithm Ratio Compression Decompression
FSE 6.316 300 MS/s 420 MS/s
zlib 5.575 250 MS/s 270 MS/s

Benchmarked file : proba90.bin
Algorithm Ratio Compression Decompression
FSE 15.21 300 MS/s 420 MS/s
zlib 7.175 250 MS/s 285 MS/s

As could be guessed, the higher the compression ratio, the more efficient FSE becomes compared to Huffman, since Huffman can't break the "1 bit per symbol" limit.
FSE speed is also very stable, under all probabilities.

I'm quite please with the result, especially considering that, since the invention of arithmetic coding in the 70's, little new has been brought to this field.

-----------------------------------------------------

A little bit of History :

Jarek Duda's ANS theory was first published in 2007, and the paper received many revisions since then. Back in 2007, only Matt Mahoney had enough skill and willpower to sift through the complex theory, and morph it into a working code. However, Matt concentrated on the only use case of interest to him, the Binary version, called ABS, limited to 0/1 alphabet. This decision put his implementation in direct competition with the Binary Arithmetic Coder, which is very fast, efficient, and flexible. Basically, a losing ground for ANS. As a consequence, ANS theory looked uncompetitive, and slumbered during the next few years.

FSE work re-instates ANS as a competitive algorithm for multi-symbol alphabet (>2), concentrating its perspective as a viable alternative to block-based Huffman.

Thanks to promising early results from FSE, Jarek concentrated back its attention on multi-symbol alphabet. As we were chatting about perspectives and limitations of ANS, I underlined the requirement of a decoding table as a memory cost, and suggested a solution in the making to limit that issue (which ultimately failed). But Jarek took the challenge, and successfully created a new variant. He then published an updated version of his paper. The new method would be called rANS. He would later invent the terms tANS and rANS to distinguish the different methods.

rANS was later adapted by Fabian Giesen and Charles Bloom, producing some very promising implementations, notably vector-oriented code by Fabian.

But as said, this is not the direction selected for FSE, created before Jarek's paper revision. FSE is a finite state machine, created precisely to avoid any kind of multiplication, with an eye on low-power CPU requirements. It's interesting to note such radically different implementations can emerge from a common starting theory.

For the time being, FSE is still considered beta stuff, so please consider this release for testing purposes or private development environments.

Explaining how and why it works is pretty complex, and will require a few more posts, but bear with me, they will come in this blog.

Hopefully, with Jarek's document and these implementations now published, it will be harder this time for big corporations to confiscate an innovation from the public domain.

-----------------------------------------------------

List of Blog posts explaining FSE algorithm :

http://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
http://fastcompression.blogspot.com/2014/01/huffman-comparison-with-fse.html
http://fastcompression.blogspot.com/2014/02/a-comparison-of-arithmetic-encoding.html
http://fastcompression.blogspot.com/2014/02/fse-defining-optimal-subranges.html
http://fastcompression.blogspot.com/2014/02/fse-distributing-symbol-values.html
http://fastcompression.blogspot.com/2014/02/fse-decoding-wrap-up.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-how-it-works.html
http://fastcompression.blogspot.com/2014/02/fse-encoding-part-2.html
http://fastcompression.blogspot.com/2014/02/fse-tricks-memory-efficient-subrange.html
http://fastcompression.blogspot.com/2014/04/taking-advantage-of-unequalities-to.html

[Edit] : replaced huff0 by zlib on the comparison table
[Edit] : added entry on rANS variants by Fabian & Charles
[Edit] : added list of blog entries

40 comments:

Jarek DudaDecember 17, 2013 at 5:23 PM
Thanks, Yann - indeed while ABS for "bit-wise adaptive" is practically equivalent to arithmetic coding, ANS for fixed probabilities ("block-based") allows to encode symbol from a large alphabet in a single table check - it should be faster than Huffman, having precision like arithmetic.

The number of states should be a few times larger than alphabet size to work in deltaH~0.001 bits/symbol regime.
You could make it faster by putting also the bitwise operations into the table, like the whole "(symbol,state)->(bit sequence, new state)" rules ... but at cost of using lower number of states/alphabet to remain in L1 cache.
Generally there are lots of options to optimize among ...

Another advantage of ANS is that we can slightly perturb the initialization procedure using a pseudo-random number generator initialized with cryptographic key (e.g. and also the number of block): for example choosing between the lowest weight symbol and the second best one.
This way we get simultaneously a decent encryption for free.

The advantage of Huffman is that we can adaptively modify the tree on the run, but we could also do it in ANS - switch some appearances between symbols and renumerate the following ones ... but it would be costly for a large table.

About materials, the recent paper ( http://arxiv.org/abs/1311.2540 ) should be more readable and here is a poster gathering basic information: https://dl.dropboxusercontent.com/u/12405967/poster.pdf .
ReplyDelete
Replies
CyanDecember 18, 2013 at 11:18 AM
> Decoding Huffman is moving on the tree, which has "the size of alphabet" leaves - how you can manage without having this tree stored in memory?

Indeed, this is the minimum required. But it's much smaller than a full decode table, which would read the bitstream and directly give the symbol.
On the other hand, for FSE, the full decode table seems mandatory.

> If you want low memory ANS, see the poster - switching between 5 automata having 5 states allows for deltaH~0.01 bits/symbol, about 20 automata of 16 states for deltaH~0.001.

I'm not sure to understand this part.
The poster seems to concentrate on the ABS variant.
I understand the concept of switching between different state tables,
but I don't see how it could allow to build a lower-memory version of ANS.

Currently, the default settings of FSE is to use a single 12-bits table to decode symbols which can have up to 256 values.
Suppose I'm using this construction to encode.
How would you propose to decode it using less memory than FSE does now ?
ReplyDelete
Replies
Jarek DudaDecember 20, 2013 at 10:14 PM
>(...)while a tree will cost between 32 bytes and 512 bytes, depending on alphabet size.
32-512 bytes to quickly decode a 256 size alphabet?
I have looked at some article about Huffman decoding ( http://www.commsp.ee.ic.ac.uk/~tania/teaching/DIP%202012/huffman_1.pdf ), and don't see how would you like to do it?

For ANS decoding of 256 size alphabet, the reasonable minimum is 512 states - you need 1 byte for the symbol plus at most 10 bits for the new state (8 if no symbol is above 1/4 probability, 6 if not above 1/16,...) - about 1kB would be enough.
ReplyDelete
Replies
Jarek DudaDecember 20, 2013 at 11:41 PM
Still if something was compressed using Huffman for 256 size alphabet, I don't think you could decompress it using less that 1kB memory?
And generally in situation when the memory is that important, you could use arithmetic or ABS instead...

For ANS with very small number of states, the way we spread symbols becomes really essential. I see you use a bit different method than mine - while checking all possible spreads for a few cases, the method from the last paper gave usually the optimal spread ...
ReplyDelete
Replies
CyanDecember 20, 2013 at 11:54 PM
> Still if something was compressed using Huffman for 256 size alphabet, I don't think you could decompress it using less that 1kB memory?

511 bytes, to be exact

> And generally in situation when the memory is that important, you could use arithmetic or ABS instead...

It really misses the point :
it's certainly possible to build a low-memory entropy coder *by design*.
What I'm trying to underline is that there is no need to make such a trade off on the compression side when using Huffman. It can be figured out on the decompression side as needed.
Oh well, at least I tried to explain.

> the way we spread symbols becomes really essential. I see you use a bit different method than mine

Indeed. I was expecting to detail that point in a later post.

An optimal distribution can be built, and that's what was doing my initial algorithm. But it was slow. Moreover, it required some divisions and a few multiplication to init critical variables, not a lot, but still not ideal for low-power cpu.

Initially I thought that it was critical for the performance of the algorithm to have a perfect distribution, but then I started experimenting, and understood that no, it was just a matter of efficiency, a classic compression trade-off.

The method currently coded into FSE has been selected because :
1) It's damn fast.
2) no division, no multiplication, only add and masks
3) It's still very close to optimal. I typically get a 0.01% difference between "optimal distribution" and this method.
4) It's not adapted to very small number of states. Below 64 states, forget it, it's not designed for such a use case.
ReplyDelete
Replies
Jarek DudaDecember 21, 2013 at 8:00 PM
I understand the advantage of Huffman decoder flexibility, but don't see how to decode 256 size alphabet in 511 bytes...
This cost becomes essential for history/context dependent probability estimation (in contrast to Huffman, with ANS it makes sense - especially for static compression like of DNA) - the memory requirement has to be multiplied by the number of distinguished histories/contexts (separate coding tables).
The precision of symbol spread is nearly negligible for a large number of states (we could use this freedom for simultaneous encryption or maybe a different purpose?), however for this 1kB case: 256 size alphabet on 512 states, it will probably mater.
The main cost in precise spread is finding the symbol with the minimum weight - it is linear cost if we would just checked one by one, but it can be reduced to log(n) by using heap.
ReplyDelete
Replies
Jarek DudaDecember 25, 2013 at 6:43 PM
Hi Yann,
I was thinking about your decoding hot loop - you access the data stream table every step ( bitStream = *(U32*)ip ), what seems costly - maybe it would be faster to use 64 (or 32) bit "buffer" there?
Something like this (uses ip table >4 times less frequently):

state = decodeTable[state].newState + (buffer & ((1<> nbBits;
If((bitCount-=nbBits)<=16) {buffer = buffer | (*(U32*)ip <<bitCount; bitCount+=32; ip-=4;}

The "16" is the maximum number of bits that can be used in a single step (FSE_MEMORY_USAGE). For 32 bit buffer this 16 is important and you read 2 bytes when needed (bitCount+=16; ip-=2)

Also, bitor ("|") could be a bit faster than "+", and are you sure that "mask[nbBits]" is faster than "(1<<nbBits)-1" ?
ReplyDelete
Replies
Jarek DudaDecember 25, 2013 at 6:52 PM
It keeps removing one line:

state = decodeTable[state].newState + (buffer & mask[nbBits]);
buffer = buffer > > nbBits;
If((bitCount-=nbBits)<=16) {buffer = buffer | (*(U32*)ip < <bitCount; bitCount+=32; ip-=4;}
ReplyDelete
Replies
UnknownDecember 27, 2013 at 6:00 PM
This is a very interesting technique. It seems to have the potential to outperform my SSE-aided range decoder in my current compressor, at least when the number of contexts is small. Alternating between many decode tables is probably not a good idea with this technique.
Which is faster also depends on how well this method behaves when you interleave multiple independent decodes.
Right now I'm more interested in getting the implementation of the single decode as fast/simple as possible.

Here are some of my finding so far:
Disclaimer this is tested on an i7 laptop with msvc2012. Your mileage may vary.

Baseline decode loop: 190mb/s
-while(op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
- bitCount -= nbBits;
- rest = (bitStream >> bitCount) & mask[nbBits];
- {
- int nbBytes = (32-bitCount) >> 3;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- bitCount += nbBytes*8;
- }
- state = decodeTable[state].newState + rest;
-}

Don't precompute mask: 207mb/s
- rest = (bitStream >> bitCount) & mask[nbBits];
->
- rest = (bitStream >> bitCount) & ((1<<nbBits)-1);

Use two shifts to isolate bits and reorder to save a sub: 226mb/s
- bitCount -= nbBits;
- rest = (bitStream >> bitCount) & ((1<<nbBits)-1);
->
- rest = (bitStream << (32 - bitCount)) >> (32 - nbBits);
- bitCount -= nbBits;

We now have (32 - bitCount) twice in the same code segment, so lets invert it and keep track of 32-bitCount instead:
This also requires us to flip af few signs, as we are now counting in the opposite direction.

Inverted bit counter: 229mb/s
-bitCount = 32 - bitCount;
-while(op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
- rest = (bitStream << bitCount) >> (32 - nbBits);
- bitCount += nbBits;
- {
- int nbBytes = bitCount >> 3;
- bitCount -= nbBytes*8;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- }
- state = decodeTable[state].newState + rest;
}

We can now simplify the 'renormalization' code a bit more.

Simpler bitCount renormalization: 242mb/s
- int nbBytes = bitCount >> 3;
- bitCount -= nbBytes*8;
->
- bitCount &= 7;

If we negate nbBits (and change to signed char) in the decode table, we get:
- rest = (bitStream << bitCount) >> (32 - nbBits);
- bitCount += nbBits;
->
- rest = (bitStream << bitCount) >> (32 + nbBits);
- bitCount -= nbBits;

On x86 the second operand of shifts are modulo 32, so we can omit the +32. You probably want a define for that.
Modulo 32 shift trick: 252mb/s
- rest = (bitStream << bitCount) >> (32 + nbBits);
->
- rest = (bitStream << bitCount) >> nbBits;

I'm interested to hear if you see similar improvements.
-Rune
ReplyDelete
Replies
UnknownDecember 27, 2013 at 6:10 PM
This leaves us at: (still assuming nbBits is negated)
-bitCount = 32 - bitCount;
-while (op<oend)
-{
- U32 rest;
- const int nbBits = decodeTable[state].nbBits;
- *op++ = decodeTable[state].symbol;
-#if USE_MODULO32_TRICK
- rest = (bitStream << bitCount) >> (nbBits);
-#else
- rest = (bitStream << bitCount) >> (32 + nbBits);
-#endif
- bitCount -= nbBits;
- {
- int nbBytes = bitCount >> 3;
- ip -= nbBytes;
- bitStream = *(U32*)ip;
- bitCount &= 7;
- }
- state = decodeTable[state].newState + rest;
-}
ReplyDelete
Replies
Jarek DudaDecember 27, 2013 at 6:31 PM
hi rune stubbe,
You use both ip and op tables once per step - using 64 bits variables as buffers, you could use op table once per 8 steps and ip table could be safely used once per 4 steps - wouldn't it be much faster?

Just refill buffer and flush symbols e.g. every 4 steps:

symbols = symbols | decodeTable[state].newState < < ... ;
state = decodeTable[state].newState + (buffer & ((1 < < nbBits)-1));
buffer = buffer > > nbBits; bitCount -= nbBits;
ReplyDelete
Replies
UnknownDecember 28, 2013 at 1:53 PM
Jarek: Thank you. I think ANS/ABS is the most interesting thing I have seen in lossless data compression in years :)

You are of course correct. Skipping renormalizations is already a win for range coding. I'm sure we see similar improvements here. I'm not sure the same is true for output (op) as you are actually executing more instructions to pack the bytes into a dword/qword.

A quick and dirty test seems to indicate that unrolling the code by 2x and doing the ip update once only improves performance to about 263mb/s (~4% faster). It is actually a little disapointing. I'm guessing part of the explanation is that the ip code is not on the most critical path and thus already partially overlapped with the rest of the decode. I'm guessing the improvement would be larger, if we were doing enough independent decodes to saturate the execution units of the cpu.

I might be wrong, but I think Yann's intention with FSE is for it to be an example of ANS and a good starting point for your own implementation. If that is the case then I think it makes sense to keep the implementation as clean and simple as possible, so it is easy to integrate into your own project, where you can make all of the trade-offs with the proper context of the rest of the compressor.

On the other hand, if the intention is to make it as fast as possible, then it makes sense to interleave multiple independent decodes, find a good compromise between maximum symbol length, number of states and renormalization frequency, etc.
ReplyDelete
Replies
Jarek DudaDecember 28, 2013 at 2:28 PM
Thanks rune stubbe. It can be also used in lossy compression - it would perfectly fit into e.g. JPEG and generally in nearly every place where Huffman (,Elias, Golomb, Tunstall, ..., sometimes arithmetic) is currently used.

The 4% improvement is indeed a bit disappointing for 2x unrolling. For 64 bit buffer, we could write succeeding 8 steps in the loop: with single use of *op and 2 uses of *ip - should give another few percents.
Sure, finally it should turn into multiple separate decoders, encoders ... and initializations: symbol spreading procedures (it can be more precise, and can use PRNG initialized with a key to practically for free simultaneously encrypt the message).

I think both intentions are important, but with a good implementation it can be directly (or with small modifications) inserted into specific compressor - I think it's the main goal now, especially that Yann has already put it into Zhuff.

Thanks,
Jarek
ReplyDelete
Replies
cbloomJanuary 1, 2014 at 8:22 PM
Very interesting - can't wait to see more exposition on how this works.

How is it different than just doing a state-based arithmetic coder with only a few fractional bits? The decoder pseudo-code above looks very much like a "deferred summation" arithmetic decoder, or a Howard-Vitter state based arithmetic coder.
ReplyDelete
Replies
JarekJanuary 1, 2014 at 11:08 PM
Howard-Vitter "quasi arithmetic coding" is only for binary alphabet.
While state of arithmetic coding are two numbers defining the range, the advantage of ANS approach is that the state is only a single natural number (containing log(state) bits of information).
Thanks of that we, ANS allows to construct relatively small automata also for given probability distribution on large alphabet.

You can find the basic ideas and difference from arithmetic in the poster: https://dl.dropboxusercontent.com/u/12405967/poster.pdf
ReplyDelete
Replies

Add comment