Friday, May 13, 2016

Finalizing a compression format

With Zstandard v1.0 looming ahead, the last major item for zstd to settle is an extended set of features for its frame encapsulation layer.

Quick overview of the design : data compressed by zstd is cut into blocks. A compressed block has a maximum content size (128 KB), so obviously if input data is larger than this, it will have to occupy multiple blocks.
The frame layer organize these blocks into a single content. It also provides to the decoder a set of properties that the encoder pledges to respect. These properties allow a decoder to prepare required resources, such as allocating enough memory.

The current frame layer only stores 1 identifier and 2 parameters  :
  • frame Id : It simply tells what are the expected frame and compression formats for follow. This is currently use to automatically detect legacy formats (v0.5.x, v0.4.x, etc.) and select the right decoder for them. It occupies the first 4 bytes of a frame.
  • windowLog : This is the maximum search distance that will be used by the encoder. It is also the maximum block size, when (1<<windowLog) < MaxBlockSize (== 128 KB). This is enough for a decoder to guarantee successful decoding operation using a limited buffer budget, whatever the real content size is (endless streaming included).
  • contentSize : This is the amount of data to decode within this frame. This information is optional. It can be used to allocate the exact amount of memory for the object to decode.

These information may seem redundant.
Indeed, for a few situations, they are : when contentSize  < (1<<windowLog). In which case, it's enough to allocated contentSize bytes for decoding, and windowLog is just redundant.
But for all other situations, windowLog is useful : either contentSize is unknown (it wasn't known at the beginning of the frame and was only discovered on frame termination), or windowLog defines a smaller memory budget than contentSize, in which case, it can be used to limit memory budget.

That's all there is for v0.6.x. Arguably, that's a pretty small list.

The intention is to create a more feature complete frame format for v1.0.
Here is a list of features considered, in priority order :
  • Content Checksum : objective is to validate that decoded content is correct.
  • Dictionary ID : objective is to confirm or detect dictionary mismatch, for files which require a dictionary for correct decompression. Without it, a wrong dictionary could be picked, resulting in silent corruption (or an error).
  • Custom content, aka skippable frames : the objective is to allow users to embed custom elements (comments, indexes, etc.) within a file consisting of multiple concatenated frames.
  • Custom window sizes, including non power of 2 : extend current windowLog scheme, to allow more precise choices.
  • Header checksum : validate that checksum informations are not accidentally distorted.
Each of these bullet points introduce its own set of questions, that is detailed below :

Content checksum
The goal of this field is obvious : validate that decoded content is correct. But there are many little details to select.

Content checksum only protects against accidental errors (transmission, storage, bugs, etc). It's not an electronic "signature".

1) Should it be enabled or disabled by default (field == 0) ?

Suggestion : disabled by default
Reasoning : There are already a lot of checksum around, in storage, in transmission, etc. Consequently, errors are now pretty rare, and when they happen, they tend to be "large" rather than sparse. Also, zstd is likely to detect errors just by parsing the compressed input anyway.

2) Which algorithm ? Should it be selectable ?

Suggestion : xxh64, additional header bit reserved in case of additional checksum, but just a single one defined in v1.
Reasoning : we have transitioned to a 64-bits world. 64-bits checksum are faster to generate than 32-bits ones on such systems. So let's use the faster ones.
xxh64 also has excellent distribution properties, and is highly portable (no dependency on hardware capability). It can be run in 32-bits mode if need be.

3) How many bits for the checksum ?

Current format defines the "frame end mark" as a 3-bytes field, the same size as a block header, which is no accident : it makes parsing easier. This field has a 2-bits header, hence 22 bits free, which can be used for a content checksum. This wouldn't increase the frame size.

22-bits means there is a 1 in 4 millions chances of collision in case of error. Or said differently, there are 4194303 chances out of 4194304 to detect a decoding error (on top of all the syntax verification which are inherent to the format itself). That's more than > 99.9999 %. Good enough in my view.

Dictionary ID

Data compressed using a dictionary needs the exact same one to be regenerated. But no control is done on the dictionary itself. In case of wrong dictionary selection, it can result in a data corruption scenario.

The corruption is likely to be detected by parsing the compressed format (or thanks to the previously described optional content checksum field).
But an even better outcome would be detect such mismatch immediately, before starting decompression, and with a clearer error message/id than "corruption", which is too generic.

For that, it would be enough to embed a "Dictionary ID" into the frame.
The Dictionary ID would simply be a random value stored inside the dictionary (or an assigned one, provided the user as a way to control that he doesn't re-use the same value multiple times). A comparison between the ID in the frame and the ID in the dictionary will be enough to detect the mismatch.

A simple question is : how long should be this ID ? 1, 2, 4 bytes ?
In my view, 4 bytes is enough for a random-based ID, since it makes the probability of collision very low. But that's still 4 more bytes to fit into the frame header. In some ways it can be considered an efficiency issue.
Maybe some people will prefer 2 bytes ? or maybe even 1 byte (notably for manually assigned ID values) ? or maybe even 0 bytes ?

It's unclear, and I guess multiple scenarios will have different answers.
So maybe a good solution would be to support all 4 possibilities in the format, and default to 4-bytes ID when using dictionary compression.

Note that if saving headers is important for your scenario, it's also possible to use frame-less block format ( ZSTD_compressBlock(), ZSTD_decompressBlock() ), which will remove any frame header, saving 12+ bytes in the process. It looks like a small saving, but when the corpus consists of lot of small messages of ~50 bytes each, it makes quite a difference. The application will have to save metadata on its own (what's the correct dictionary, compression size, decompressed size, etc.).

Custom content

Embedding custom content can be useful for a lot of unforeseen applications.
For example, it could contain a custom index into compressed content, or a file descriptor, or just some user comment.

The only thing that a standard decoder can do is skip this section. Dealing with its content is within application-specific realm.

The lz4 frame format already defines such container, as skippable frames. It looks good enough, so let's re-use the same definition.

Custom window sizes

The current frame format allows defining window sizes from 4 KB to 128 MB, all intermediate sizes being strict power of 2 (8 KB, 16 KB, etc.). It works fine, but maybe some user would find its granularity or limits insufficient.
There are 2 parts to consider :

- Allowing larger sizes : the current implementation will have troubles handling window sizes > 256 MB. That being said, it's an implementation issue, not a format issue. An improved version could likely work with larger sizes (at the cost of some complexity).
From a frame format perspective, allowing larger sizes can be as easy as keeping a reserved bit for later.

- Non-power of 2 sizes : Good news is, the internals within zstd are not tied to a specific power of 2, so the problem is limited to sending more precise window sizes. This requires more header bits.
Maybe an unsigned 32-bits value would be good enough for such use.
Note that it doesn't make sense to specify a larger window size than content size. Such case should be automatically avoided by the encoder. As to the decoder, it's unclear how it should react : stop and issue an error ? proceed with allocating the larger window size ? or use the smaller content size, and issue an error if the content ends up larger than that ?
Anyway, in many cases, what the user is likely to want is simply enough size for the frame content. In which case, a simple "refer to frame content size" is probably the better solution, with no additional field needed.

Header Checksum

The intention is to catch errors in the frame header before they translate into larger problems for the decoder. Note that only errors can be caught this way : intentional data tampering can simply rebuild the checksum, hence remain undetected.

Suggestion : this is not necessary.

While transmission errors used to be more common a few decades ago, they are much less of threat today, or they tend to garbage some large sections (not just a few bits).
An erroneous header can nonetheless be detected just by parsing it, considering the number of reserved bits and forbidden value. They must all be validated.
The nail in the coffin is that we do no longer trust headers, as they can be abused by remote attackers to deliver an exploit. And that's an area where the header checksum is simply useless. Every field must be validated, and all accepted values must have controllable effects (for example, if the attacker intentionally requests a lot of memory, the decoder shall put a high limit to the accepted amount, and check the allocation result).
So we already are highly protected against errors, by design, because we must be protected against intentional attacks.

Future features : forward and bakward compatibility

It's also important to design from day 1 a header format able to safely accommodate future features, with regards to version discrepancy.

The basic idea is to keep a number of reserved bits for these features, set to 0 while waiting for some future definition.

It seems also interesting to split these reserved bits into 2 categories :
- Optional and skippable features : these are features which a decoder can safely ignore, without jeopardizing decompression result. For example, a purely informational signal with no impact on decompression.
- Future features, disabled by default (0): these features can have unpredictable impact on compression format, such as : adding a new field costing a few more bytes. A non-compatible decoder cannot take the risk to proceed with decompression. It will stop on detecting such a reserved bit to 1 and gives an error message.

While it's great to keep room for the future, it should not take a too much toll in the present. So only a few bits will be reserved. If more are needed, it simply means another frame format is necessary. It's enough in such case to use a different frame identifier (First 4 bytes of a frame).

Sunday, April 3, 2016

Working with streaming

 Streaming, an advanced and very nice processing mode that a few codecs offer to deal with small data segments. This is great in communication scenarios. For lossless data compression, it makes it possible to send tiny packets, in order to create a low-latency interaction, while preserving strong compression capabilities, by using previously sent data to compress following packets.

Ideally, on the encoding side, the user should be able to send any amount of data, from the smallest possible (1 byte) to much larger ones (~~MB). It's up to the encoder to decide how to deal with this. It may group several small fields into a single packet, or conversely break larger ones into multiple packets. In order to avoid any unwanted delay, a "flush" command shall be available, so that the user can decide it's time to send buffered data.

On the other side, a compatible decoder shall be able to cope with whatever data was sent by the encoder. This obviously requires a bit of coordination, a set of shared rules.

The zip format defines a maximum copy distance (32 KB). Data is sent as a set of blocks, but there is no maximum block size (except non-compressed blocks, which must be <= 64 KB).
A compatible zip decoder must be able to cope with these conditions. It must keep up to 32 KB of previously received data, and be able to break decoding operation in the middle of a block, should it receive a block way too large to fit into its memory buffer.
Thankfully, once this capability is achieved, it's possible to decode with a buffer size of 32 KB + maximum chunk size, with "chunk size" being the maximum size the decoder can decode from a single block. In general, it's a bit more than that, in order to ease a few side-effects, but we won't go into details.

The main take-away is : buffer size is a consequence of maximum copy distance, plus a reasonable amount of data to be decoded in a single pass.

zstd's proposition is to reverse the logic : the size of the decoder buffer is set and announced in its frame header. The decoder can safely allocate the requested amount of memory. It's up to the encoder to respect this condition (otherwise, compressed data is considered corrupted).

In current version of the format, this buffer size can vary from 4 KB to 128 MB. It's a pretty wide range, and crucially, it includes possibilities for small memory footprint. A decoder which can only handle small buffer sizes can immediately detect and discard frames which ask for more than its capabilities.

Once the buffer size is settled, data is sent as "blocks". Each block has a maximum size of 128 KB. So, in theory, a block could be larger than the agreed decoder buffer. What would happen in such case ?

Following zip example, one solution would be for the decoder to be able to stop (and then resume) decoding operation in the middle of a block. This obviously increases decoder complexity. But the benefit is that the only condition the compressor has to respect is a max copy distance <= buffer size.

On the decoder side though, it's only one side of the problem. It's no point having a very small decoding buffer if some other memory budget dwarf it.

The decoding tables are not especially large : they use 5 KB by default, and could be reduced to half, or possibly a quarter of that (but with impact on compression ratio). Not a big budget.

The real issue is the size of the incoming compressed block. A compressed block must be smaller than its original size, otherwise it will be transmitted in uncompressed format. That still makes it possible to have a (128 KB - 1) block size. This is extremely large compared to a 4 KB buffer.

Zip's solution is that it's not necessary to receive the entire compressed block in memory in order to start decompressing it. This is possible because all symbols are entangled in a single bitstream, which is read in forward direction. So input buffer can be a fraction of a block. It simply stops when there is no more information available.

This will be difficult to imitate for zstd : it has multiple independent bitstreams (between 2 and 5) read in backwards direction.

The backward direction is unusual, and a direct consequence of using ANS entropy : encoding and decoding must be done in reverse direction. FSE solution is to write forward and read backward.
It could have been a different choice : write backward, read forward, as suggested by Fabian Giesen. But it makes the encoder's API more complex : the destination buffer would be filled from the end, instead of the beginning. From a user perspective, it breaks a few common assumptions, and become a good recipe for confusion.
Alternatively, the end result could be memmove() to the beginning of the buffer, with a small but noticeable speed cost.

But even that wouldn't solve the multiple bitstreams design, which is key to zstd's speed advantage. zstd is fast because it manages to keep multiple cpu execution units busy. This is achieved by reducing or eliminating dependencies between operations. At some point, it implies bitstream independence.

In a zstd block, literals are encoded first, followed by LZ symbols. Bitstreams are not entangled : each one occupy its own memory segment.
Considering this setup, it's required to access the full content block to start decoding it (well, more precisely, a few little things could be started in parallel, but it's damn complex and not worth the point here).

Save any last-minute breakthrough on this topic, this direction is a dead-end : any compressed block must be received entirely before starting its decompression.
As a consequence, since small decoding buffer is a consequence of constrained memory budget, it looks logical that the size of incoming compressed blocks should be limited too, to preserve memory.

The limit size of a compressed block could be a dedicated parameter, but it would add complexity. A fairly natural assumption would be that a compressed block should be no larger than the decoding buffer. So let's use that.
(PS : another potential candidate would be cBlockSize <= bufferSize/2 , but even such a simple division by 2 looks like a recipe for future confusion).

So now, the encoder side enforces a maximum block size no larger than the decoding buffer. Fair enough. Multiple smaller blocks also means multiple headers, so it could impact compression efficiency. Thankfully, zstd includes both a "default statistics" and an experimental "repeat statistics" modes, which can be used to reduce header size to zero, and provide some answer to this issue.

But there is more to it.
Problem is, amount of data previously sent can be any size. The encoder may arbitrarily receive a "flush" order at any time. So each received block can be any size (up to maximum), and not necessarily fill the buffer.
Hence, what happens when we get closer to buffer's end ?

Presuming the decoder doesn't have the capability to stop decompression in the middle of a block, the next block shall not cross the limit of the decoder buffer. Hence, if there are 2.5 KB left in decoder buffer before reaching its end, the next block maximum size must be 2.5 KB.

It becomes a new condition for the encoder to respect : keep track of decoder buffer fill level, ensure to never cross the limit, stop at exact end of the buffer, and then restart from zero.
It looks complex, but the compressor knows the size of the decoder buffer : it was specified at the beginning of the frame. So it is manageable.

But is that desirable ?
From an encoder perspective, it seems better to get free of such restriction, just accept the block size and copy distance limits, and then let the decoder deal with it, even if it requires a complex capability of "stop and resume" in the middle of a block.
From a decoder perspective, it looks better to only handle full blocks, and require the encoder to pay attention to never break this assumption.

Classical transfer of complexity.
It makes for an interesting design choice. And as v1.0 gets nearer, one will have to be selected.

Edit : And the final choice is :

Well, a decision was necessary, so here it is :

The selected design only impose distance limit and maximum block size to the encoder , both values being equal, and provided in the frame header.
The encoder doesn't need to track the "fill level" of the decoder buffer.

As stated above, a compliant decoder using the exact buffer size should have the capability to break decompression operation in the middle of a block, in order to reach the exact end of the buffer, and restart from the beginning.

However, there is a trick ...
Should the decoder not have this capability, it's enough to extend the size of the buffer by the size of a single block (so it's basically 2x bigger for "small" buffer values (<= 128 KB) ). In which case, the decoder can safely decode every blocks in a single step, without breaking decoding operation in the middle.

Requiring more memory to safely decompress is an "implementation detail", and doesn't impact the spec, which is the real point here.
Thanks to this trick, it's possible to immediately target final spec, and update the decoder implementation later on, as a memory optimization. Therefore, it won't delay v1.0.

Friday, February 5, 2016

Compressing small data

 Data compression is primarily seen as a file compression algorithm. After all, the main objective is to save storage space, is it ?
With this background in mind, it's also logical to focus on bigger files. Good compression achieved on a single large archive is worth the savings for countless smaller ones.

However, this is no longer where the bulk of compression happen. Today, compression is everywhere, embedded within systems, achieving its space and transmission savings without user intervention, nor awareness. The key to these invisible gains is to remain below the end-user perception threshold. To achieve this objective, it's not possible to wait for some large amount of data to process. Instead, data is processed in small amounts.

This would be all good and well if it wasn't for a simple observation : the smaller the amount to compress, the worse the compression ratio.
The reason is pretty simple : data compression works by finding redundancy within the processed source. When a new source starts, there is not yet any redundancy to build upon. And it takes time for any algorithm to achieve meaningful outcome.

Therefore, as the issue comes from starting from a blank history, what about starting from an already populated history ?

Streaming to the rescue

A first solution is streaming : data is cut into smaller blocks, but each block can make reference to previously sent ones. And it works quite well. In spite of some minor losses at block borders, most of the compression opportunities of a single large data source are preserved, but now with the advantage to process, send, and receive tiny blocks on the fly, making the experience smooth.

However, this scenario only works with serial data, a communication channel for example, where order is known and preserved.

For a large category of applications, such as database and storage, this cannot work : data must remain accessible in a random fashion, no known "a priori" order. Reaching a specific block sector should not require to decode all preceding ones just to rebuild the dynamic context.

For such use case, a common work-around is to create some "not too small blocks". Say there are many records of a few hundred bytes each. Group them in packs of at least 16 KB. Now this achieves some nice middle-ground between not-to-poor compression ratio and good enough random access capability.
This is still not ideal though, since it's required to decompress a full block just to get a single random record out of it. Therefore, each application will settle for its own middle ground, using block sizes of 4 KB, 16 KB or even 128 KB, depending on usage pattern.

Dictionary compression

Preserving random access at record level and good compression ratio, is hard. But it's achievable too, using a dictionary. To summarize, it's a kind of common prefix, shared by all compressed objects. It makes every compression and decompression operation start from the same populated history.

Dictionary compression has the great property to be compatible with random access. Even for communication scenarios, it can prove easier to manage at scale than "per-connection streaming", since instead of storing one different context per connection, there is always the same context to start from when compressing or decompressing any new data block.

A good dictionary can compress small records into tiny compressed blobs. Sometimes, the current record can be found "as is" entirely within the dictionary, reducing it to a single reference. More likely, some critical redundant elements will be detected (header, footer, keywords) leaving only variable ones to be described (ID fields, date, etc.).

For this situation to work properly, the dictionary needs to be tuned for the underlying structure of objects to compress. There is no such thing as a "universal dictionary". One must be created and used for a target data type.

Fortunately, this condition can be met quite often.
Just created some new protocol for a transaction engine or an online game ? It's likely based on a few common important messages and keywords (even binary ones). Have some event or log records ? There is likely a grammar for them (json, xml maybe). The same can be said of digital resources, be it html files, css stylesheets, javascript programs, etc.
If you know what you are going to compress, you can create a dictionary for it.

The key is, since it's not possible to create a meaningful "universal dictionary", one must create one dictionary per resource type.

Example of a structured JSON message

How to create a dictionary from a training set ? Well, even though one could be tempted to manually create one, by compacting all keywords and repeatable sequences into a file, this can be a tedious task. Moreover, there is always a chance that the dictionary will have to be updated regularly due to moving conditions.
This is why, starting from v0.5, zstd offers a dictionary builder capability.

Using the builder, it's possible to quickly create a dictionary from a list of samples. The process is relatively fast (a matter of seconds), which makes it possible to generate and update multiple dictionaries for multiple targets.

But what good can achieve dictionary compression ?
To answer this question, a few tests were run on some typical samples. A flow of JSON records from a probe, some Mercurial log events, and a collection of large JSON documents, provided by @KryzFr.

Collection Namedirect
Small JSON recordsx1.331 - x1.366x5.860 - x6.830~ x4.7300200 - 400
Mercurial eventsx2.322 - x2.538x3.377 - x4.462~ x1.51.5 KB20 - 200 KB
Large JSON docsx3.813 - x4.043x8.935 - x13.366~ x2.86 KB800 - 20 KB

These compression gains are achieved without any speed loss, and even feature faster decompression processing. As one can see, it's no "small improvement". This method can achieve transformative gains, especially for very small records.

Large documents will benefit proportionally less, since dictionary gains are mostly effective in the first few KB. Then there is enough history to build upon, and the compression algorithm can rely on it to compress the rest of the file.

Dictionary compression will work if there is some correlation in a family of small data (common keywords and structure). Hence, deploying one dictionary per type of data will provide the greater benefits.

Anyway, if you are in a situation where compressing small data can be useful for your use case (databases and contextless communication scenarios come to mind, but there are likely other ones), you are welcomed to have a look at this new open source tool and compression methodology and report your experience or feature requests.

Zstd is now getting closer to v1.0 release, it's a good time to provide feedback and integrate them into final specification.

Wednesday, October 14, 2015

Huffman revisited part 5 : combining multi-streams with multi-symbols

 In previous article, a method to create a fast multi-symbols Huffman decoder has been described. The research was using single bitstream encoding, for simplicity. However, earlier investigation proved that using multiple bitstreams was good for speed on modern OoO (Out of Order) cpus, such as Intel's Core. So it seems only logical to combine both ideas and see where it leads.

The previous multi-streams format produced an entangled output, where each stream contributes regularly to 1-in-4 symbols, as shown below :

Multi-Streams single-symbol entangled output pattern

This pattern is very predictable, therefore decoding operations can be done in no particular order, as each stream knows at which position to write its next symbol.
This critical property is lost with multi-symbols decoding operations :

Multi-Streams multi-symbols entangled output pattern (example)

It's no longer clear where next symbols must be written. Hence, parallel-streams decoding becomes synchronization-dependent, nullifying multi-streams speed advantage.

There are several solutions to this problem :
- On the decoder side, reproduce regular output pattern, by breaking multi-symbols sequence into several single-symbol write operations. It works, but cost performance, since a single decode now produces multiple writes (or worse, introduce an unpredictable branch) and each stream requires its own tracking pointer.
- On the encoder side, take into consideration the decoder natural pattern, by grouping symbols exactly the same way they will be regenerated. This works too, and is the fastest method from a decoder perspective, introducing just some non-negligible complexity on the encoder side.

Ultimately, none of these solutions looked particularly attractive. I was especially worried about introducing a "rigid format", specifically built for a single efficient way to decode. For example, taking into consideration the way symbols will be grouped during decoding ties the format to a specific table depth.
An algorithm created for a large number of platforms cannot accept such rigidity. Maybe some implementations will prefer single-symbol decoding, maybe other ones will select a custom amount of memory for decoding tables. Such flexibility must be possible.

Final choice was to remove entanglement. And the new output pattern becomes :

Multi-Streams multi-symbols segment output pattern (example)

With 4 separate segments being decoded in parallel, the design looks a lot like classical multi-threading, at micro-op level. Which would be a fair enough description.

It looks simpler, but from a coding perspective, it's not.
The first issue is that each segment has its own tracking pointer during decoding operation. It increases the number of required registers from 1 to 4. Not a huge deal when registers are plentiful, but that's not always the case (x86 32-bits mode notably).
The second more important issue is that each segment gets decoded at its own speed, meaning some of them will be finished before other ones. Previously, entanglement ensured that all streams would finish together, with just a small tail to take care off. This is now more complex : we don't know which segment will finish first, and the "tail" sequence is now spread over multiple streams, of unpredictable length.

These complexities will cost a bit of performance, but we get serious benefits in exchange :
- Multi-streams operations is an option : platforms may decide to decode segments serially, one after another, or 2 by 2, depending on their optimal capabilities.
- Single-symbol and multi-symbols decoding strategies are compatible 
- Decoding table depth can be any size, including "frugal" ones trading cpu operations for memory space.
In essence, it's opened to a lot more trade-offs.

These new properties introduce a new API requirement : regenerated size must be known, exactly, to start decoding operation (previously, upper regenerated size limit was enough). This is required to guess where each segment starts before even finishing previous ones.

So, what kind of performance this new design delivers ? Here is an example, based on generic samples :

Decoding speed, multi-streams, 32 KB blocks

The picture looks similar to previous "single-stream" measurements, although featuring much higher speeds. Single-symbol variant wins when compression is very poor. Quite quickly though, double-symbols variant dominates the region where Huffman compression makes most sense (underlined in red boxes). Quad-symbols performance catch up when distribution becomes more favorable, and clearly dominates later on, but that's a region where Huffman is no longer an optimal choice for entropy compression.

Still, by providing speed in the range of 800-900 MB/s, the new multi-symbol decoder delivers sensible improvements over previous version. Job done ?

Let's dig a little deeper. You may have noticed that measurements were produced on block sizes of 32 KB, which is a nice "average" situation. However, in many compressors such as zstd, blocks of symbols are the product of (LZ) transformation, and their size can vary, a lot. Is above conclusion still valid when block size change ?

Let's test this hypothesis in both directions, by measuring 128 KB and 8 KB block sizes. Results become :

Decoding speed, multi-streams, 128 KB blocks

Decoding speed, multi-streams, 8 KB blocks

While the general picture may look similar, some differences can indeed be spotted.

First, 128 KB blocks are remarkably faster than 8 KB ones. This is a natural consequence of table construction times, which remain static whatever the size of blocks. Hence, their relative impact is inversely proportional to block sizes.
At 128 KB, symbol decoding dominates. It makes the quad-symbols version slightly better compared to double-symbols. Not necessarily enough, but still an alternative to consider when the right conditions are met.
At 8 KB, the reverse situation happens : quad-symbols is definitely out of the equation, due to its larger table construction time. Single-symbol relative performance is now better, taking the top spot when compression ratio is low enough.

With so many parameters, it can seem difficult to guess which version will perform best on a given compressed block, since it also depends on the content to decode. Fortunately, you won't have to.
huff0's solution is to propose a single decoder (HUF_decompress()) which makes such selection transparently. Given a set of heuristic values (table construction time, raw decoding speed, quantized compression ratio), it will automatically select which decoding algorithm it believes is a better fit for the job.

Decoding speed, auto-mode, 32 KB blocks

Ultimately, it's just a matter of faster speed, since all versions are compatible and produce valid results. And should you don't like its default choices, you can still manually override which version you prefer or want to test.

As usual, the result of this investigation is made available as open source software, at github, under a BSD license. If you are used to previous versions of fse, pay attention that the directory and file structures have been changed a bit. In an attempt to provide a clearer interface, huff0 gets its own file and header from now on.

Huffman revisited, Part 4 : Multi-bytes decoding

 In most Huffman implementations I'm aware of, decoding symbols is achieved in a serial fashion, one-symbol-after-another.

Decoding fast is not that trivial, but it has been well studied already. Eventually, the one symbol per decoding operation becomes its upper limit.

Consider how work a fast Huffman decoder : all possible bit combinations are pre-calculated into a table, of predefined maximum depth. For each bit combination, it's a simple table lookup to get the symbol decoded and the number of bits to consume.

Huffman Table lookup (example)

More complex schemes may break the decoding into 2 steps, most notably in an attempt to reduce look-up table sizes and still manage to decode symbols which exceed table depth. But it doesn't change the whole picture : that's still a number of operations to decode a single symbol.

In an attempt to extract more speed from decoding operation, I was curious to investigate if it would be possible to decode more than one symbol per lookup.

Intuitively, that sounds plausible. Consider some large Huffman decoding table, there is ample room for some bit sequences to represent 2 or more unequivocal symbols. For example, if one symbol is dominant, it only needs 1 bit. So, with only 2 bits, we have 25% chances to get a sequence which means "decode 2 dominant symbols in a row", in a single decode operation.

This can be visualized on below example :

Example of small single-symbol decoding table

which can be transformed into :

Example of multi-symbols decoding table

In some ways, it can look reminiscent of Tunstall codes, since we basically try to fit as many symbols as possible into a given depth. But it's not : we don't guarantee reading the entire depth each time, the number of bits read is still variable, just more regular. And there is no "order 1 correlation" : probabilities remain the same per symbol, without depending on prior prefix.

Even with above table available, there is still the question of using it efficiently. It doesn't make any good if a single decoding step is now a lot more complex in order to potentially decode multiple symbols. As an example of what not to do, a straightforward approach would be to start decoding the first symbol, then figure out if there is some place left for another one, proceed with the second symbol, then test for a 3rd one, etc. Each of these tests become an unpredictable branch, destroying performance in the process.

The breakthrough came by observing LZ decompression process such as lz4 : it's insanely fast, because it decodes matches, aka. suite of symbols, as a single copy operation.
This is in essence what we should do here : copy a sequence of multiple symbols, and then decide how many symbols there really is. It will avoid branches.
On current generation CPU, copying 2 or 4 bytes is not much slower than copying a single byte, so the strategy is effective. Overwriting same position is also not an issue thanks to modern cache structure.

With this principle settled, it now requires an adapted lookup table structure to work with. I finally settled with these ones :
Huffman lookup cell structure

The double-symbols structure could seem poorly ambitious : after all, it is only able to store up to 2 symbols into the `sequence` field. But in fact, tests will show it's a good trade-off, since most of the time, 2 symbols is what can be reasonably stored into a table lookup depth.

Some quick maths : depth of a lookup table is necessarily limited, in order to fit into memory cache where access times are best. An Intel's cpu L1 data cache is typically 32 KB (potentially shared due to hyper-threading). Since no reasonable OS is single-threaded anymore, let's not use the entire cache : half seems good enough, that's 16 KB. Since a single cell for double-symbols is now 4 bytes (incidentally, the same size as FSE decoder), that means 4K cells, hence a maximum depth of 12 bits. Within 12 bits, it's unlikely to get more than 2 symbols at a time. But this conclusion entirely depends on alphabet distribution.

This limitation must be balanced with increased complexity for table lookup construction. The quad-symbols one is significantly slower, due to more fine-tuned decisions and recursive nature of the algorithm, potentially defeating inlining optimizations. Below graph show the relative speed of each construction algorithm (right side, in grey, is provided for information, since if target distribution falls into this category, Huffman entropy is no longer a recommended choice).

Lookup table construction speed

The important part is roughly underlined in red boxes, showing areas which are relevant for some typical LZ symbols. The single-symbol lut construction is always faster, significantly. To make sense, slower table construction must be compensated by improved symbol decoding speed. Which, fortunately, is the case.

Decoding speed, at 32 KB block

As suspected, the "potentially faster" quad-symbols variant is hampered by its slower construction time. It manages to become competitive at "length & offset" area, but since it costs 50% more memory, it needs to be unquestionably better to justify that cost. Which is the case as alphabet distribution become more squeezed. By that time though, it becomes questionable if Huffman is still a reasonable choice for the selected alphabet, since its compression power will start to wane significantly against more precise methods such as FSE.
The "double-symbols" variant, on the other hand, takes off relatively fast and dominate the distribution region where Huffman makes most sense, making it a prime contender for an upgrade.

By moving from a 260 MB/s baseline to a faster 350-450 MB/s region, the new decoding algorithm is providing fairly sensible gains, but we still have not reached the level of previous multi-stream variant, which gets closer to 600 MB/s. The logical next step is to combine both ideas, creating a multi-streams multi-symbols variant. A challenge which proved more involving than it sounds. But that's for another post ...

Tuesday, August 25, 2015

Fuzz testing Zstandard

 An advance issue that any production-grade codec must face is the ability to deal with erroneous data.

Such requirement tends to come at a second development stage, since it's already difficult enough to make an algorithm work under "normal conditions". Before reaching erroneous data, there is already a large number of valid edge cases to properly deal with.

Erroneous input is nonetheless important, not least because it can degenerate into a full program crash if not properly taken care of. At a more advanced level, it can even serve as an attack vector, trying to push some executable code into unauthorized memory segments. Even without reaching that point, just the perspective to make a system crash with the use of a predictable pattern is a good enough nuisance.

Dealing with such problems can be partially mitigated using stringent unit tests. But that's more easily said than done. Sometimes, not only is it painful to build and maintain a thorough and wishfully complete list of unit test for each function, it's also useless in predicting some unexpected behavior resulting from an improbable chain of events at different stages in the program.

Hence the idea to find such bugs at "system level". The system's input will be fed with a set of data, and the results will be observed. If you create test set manually, you will likely test some important, visible and expected use cases, which is still a pretty good start. But some less obvious interaction patterns will be missed.

That's where starts the realm of Fuzz Testing. The main idea is that random will make a better job at finding stupid forgotten edge cases, which are good candidates to crash a program. And it works pretty well. But how to setup "random" ?

In fact, even "random" must be defined within some limits. For example, if you only feed a lossless compression algorithm with some random input, it will simply not be able to compress it, meaning you will always test the same code path. 

The way I've dealt with such issue for lz4 or zstd is to create programs able to generate "random compressible data", with some programmable characteristics (compressibility, symbol variation, reproducible by seed). And it helped a lot to test valid code path.

The decompression side is more interested by resistance to invalid input. But even with random parameters, there is a need to target interesting properties to test. Typically, a valid decompression stage is first run, to serve as a model. Then some "credible" fail scenarios are built from them. Zstd fuzzer tool typically tests : truncated input, too small destination buffer, and noisy source created from a valid one with some random changes, in order to bypass too simple screening stages.

All these tests were extremely useful to strengthen the reliability of the code. But the idea that "random" was in fact defined within some limits make it clear that maybe some other code path, outside of limits of "random", may still fail if properly triggered.

But how to find them ? As stated earlier, brute force is not a good approach. There are too many similar cases which would be trivially reduced to a single code path. For example, the compressed format of zstd includes an initial 4-bytes identifier. A dumb random input would therefore have a 1 in 4 billion chances to pass such early screening, leaving little energy to test the rest of the code.

For a long time, I believed it was necessary to know in details one's code to create some useful fuzzer tool. Thanks to kind notification from Vitaly Magerya, it seems this is no longer the only one solution. I discovered earlier today the American Fuzzy Lop. No, not the rabbit; this test tool, by MichaƂ Zalewski.

It's relatively easy to setup (for Unix programmers). Build, install and usage follow clean conventions, and the Readme is a fairly good read, easy to follow. With just a few initial test cases to provide, a special compilation stage and a command line, the tool is ready to go.

American Fuzzy Lop, testing zstd decoder

It displays a simple live board in text mode, which successfully captures the mind. One can see, or rather guess, how the genetic algorithm tries to create new use cases. It basically starts from the initially provided set of tests, and create new ones by modifying them using simple transformations. It analyzes the results, which are relatively precise thanks to special instrumentation installed in the target binary during the compilation stage. It deduces from them the triggered code path and if it has found a new one. Then generate new test cases built on top of "promising" previous ones, restart, ad infinitum. 

This is simple and brilliant. Most importantly, it is generic, meaning no special knowledge of zstd was required for it to test thoroughly the algorithm and its associated source code.

There are obviously limits. For example, the amount of memory that can be spent for each test. Therefore, successfully resisting for hours the tricky tests created by this fuzzer tool is not the same as "bug free", but it's a damn good step into this direction, and would at least deserve the term "robust".

Anyway, the result of all these tests, using internal and external fuzzer tools, is a first release of Zstandard. It's not yet "format stable", meaning specifically that the current format is not guaranteed to remain unmodified in the future (such stage is planned to be reached early 2016). But it's already quite robust. So if you wanted to test the algorithm in your application, now seems a good time, even in production environment.

[Edit] : If you're interested in fuzz testing, I recommend reading an excellent follow up by Maciej Adamczyk, which get into great details on how to do your own fuzz testing for your project.

Wednesday, August 19, 2015

Accessing unaligned memory

 Thanks to Herman Brule, I recently received an access to real ARM hardware systems, in order to test C code and tune them for performance. It proved a great experience, with lots of learnings.

It started with the finding that xxhash speed was rubbish on ARM systems. To this end, 2 systems were benchmarked : first, an ARMv6-J, and then an ARMv7-A.

This was a unwelcomed surprise, and among the multiple potential reasons, it turns out that accessing unaligned data became the most critical one.

Since my latest blog entry on this issue, I converted unaligned-access code to the QEMU-promoted solution using `memcpy()`. Compared with earlier method (`pack` statement), the `memcpy()` version has a big advantage : it's highly portable. It's also supposed to be correctly optimized by the compiler, to end up to a trivial `unaligned load` instruction on CPU architecture which support this feature.

Well, supposed to is really the right word. It turns out, this is not true in a number of cases. While initially only direct benchmark tests were my main investigation tool, I was pointed towards godbolt online assembly generator, which became an invaluable asset to properly understand what was going on at assembly level.

Thanks to these new tools, the issue could be summarized into a selection between 3 possibilities to access unaligned memory :

1. Using `memcpy()` : this is the most portable and safe one.
It's also efficient in a large number of situations. For example, on all tested targets, clang translates `memcpy()` into a single `load` instruction when hardware supports it. gcc is also good on most target tested (x86, x64, arm64, ppc), with just arm 32bits standing out.
The issue here is that your mileage will vary depending on specific compiler / targets. And it's difficult, if not impossible, to test and check all possible combinations. But at least, `memcpy()` is a good generic backup, a safe harbour to be compared to.

2. `pack` instruction : the problem is that it's a compiler-specific extension. It tends to be present on most compilers, but using multiple different, and incompatible, semantics. Therefore, it's a pain for portability and maintenance.

That being said, in a number of cases where `memcpy()` doesn't produce optimal code, `pack` tends to do a better job. So it's possible to `special case` these situations, and left the rest to `memcpy`.

The most important use case was gcc with ARMv7, basically the most important 32-bits ARM version nowadays (included in current crop of smartphones and tablets).
Here, using `pack` for unaligned memory improved performance from 120 MB/s to 765 MB/s compared to `memcpy()`. That's definitely a too large difference to be missed.

Unfortunately, on gcc with ARMv6, this solution was still as bad as `memcpy()`.

3. direct `u32` access : the only solution I could find for gcc on ARMv6.
This solution is not recommended, as it basically "lies" to the compiler by pretending data is properly aligned, thus generating a fast `load` instruction. It works when the target cpu is hardware compatible with unaligned memory access, and does not risk generating some opcode which are only compatible with strictly-aligned memory accesses.
This is exactly the situation of ARMv6.
Don't use it for ARMv7 though : although it's compatible with unaligned load, it can also issue multiple load instruction, which is a strict-align only opcode. So the resulting binary would crash.

In this case too, the performance gain is too large to be neglected : on unaligned memory access, read speed went up from 75 MB/s to 390 MB/s compared to `memcpy()` or `pack`. That's more than 5 times faster.

So there you have it, a complex setup, which tries to select the best possible method depending on compiler and target. Current findings can be summarized as below :

Better unaligned read method :
| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |
A good news is that there is a safe default method, which tends to work well in a majority of situations. Now, it's only a matter of special-casing specific combinations, to use alternate method.

Of course, a better solution would be for all compilers, and gcc specifically, to properly translate `memcpy()` into efficient assembly for all targets. But that's wishful thinking, clearly outside of our responsibility. Even if it does improve some day, we nonetheless need an efficient solution now, for current crop of compilers.

The new unaligned memory access design is currently available within xxHash source code on github, dev branch.

Summary of gains on tested platforms :
compiled with gcc v4.7.4
| program            | platform|  before  |  after   | 
| xxhash32 unaligned |  ARMv6  |  75 MB/s | 390 MB/s |
| xxhash32 unaligned |  ARMv7  | 122 MB/s | 765 MB/s |
| lz4 compression    |  ARMv6  |  13 MB/s |  18 MB/s |
| lz4 compression    |  ARMv7  |  33 MB/s |  49 MB/s |
[Edit] : apparently, this issue will help improve GCC for the better