Monday, May 28, 2012

Members properties

 After spending some time on expected properties at streaming level, let's now get to the core of the objective, regarding the compressed data parameters.

As stated previously, a compressed stream consists of several members, the most important ones being compressed data sets. Each member starts with a header, in order to identify its content. And each header starts with a magic number, a kind of 'ID tag'.

We'll focus here on "LZ4 compressed data set". The stream design above allows adding any future compression algorithm at a later stage.

And let's take as an example the old legacy framing format, defined into lz4demo.

1) There is a magic number, which is 0x184C2102,in little endian format.
2) There are no explicit parameters. In fact, all parameters are implicit.
They are :
- The compressed data set is cut into blocks of 8MB
- Each block starts with a field giving its size (therefore, the compressed size)
- Blocks are independent
- The original data size is not stored. It will be known on decoding completion
- There is no checksum

Well, even with such limitations, the format nonetheless works perfectly fine. It's just a little too restricted to become a "generic format", and therefore, the objective of the specification is to provide more room for parameters selections.

We have already established in previous blog posts that allowing checksum for Error detection is an important selectable feature.
Another important one is the ability to select block size, since they directly control the amount of memory buffers necessary at decoding side.

Let's now study and establish potential needs for a few other properties :
  • Source data size
    The original size of source data is not an absolute necessity : it's always possible to decode without it, as long as buffer sizes are properly described.

    But it is nonetheless useful. For example, thanks to this information, the number of blocks within the current member can be calculated beforehand. Moreover the amount of data to decode from the last block is known.
    Or, if there is a single block, the exact amount of memory can be allocated, instead of the block maximum size.
    It is also useful to display the processing position (yep, we decoded 80MB, but does that represent 10% or 90% of the stream to decode ?)

    However, there are also circumstances in which this data is not known. For example, if the input was piped to the compressing process, then the size will be known only on hitting its end. This might be too late to "retrofit" the output.
    Another situation is when several compressed data sets are appended into a single stream : then the "source data size" field only applies to the current data set, but the total size is not known.

    Therefore, since it is useful but not compulsory, this information shall be present, but as an option only.

  • Uncompressed blocks
    A simple but important feature, since it avoids the bandwidth overhead and CPU consumption of the compression format when it's useless.
    This could be done very cheaply, by accepting that, if the size of the compressed block is the same as the defined one, then it's necessarily uncompressed.

    This suggestion looks simple enough for most blocks, except for the last one, which size is unknown (but capped).
    Therefore, it would be necessary to know the size of the last block to compare it to the compressed one, and therefore determine if the block is compressed or not.

    Another idea would be : let's give up this complexity, the last block is always compressed, even if compression is either useless or detrimental.
    Actually, it's not a good idea to "not handle the last block", since there is a disastrous corner case : supposed that the compressed size of the last block is exactly the size of an uncompressed full block : then the decoding will assume it's uncompressed, leading to data corruption.

    This corner case can be avoided by enforcing a simple rule : a compressed block is necessary smaller than original size. Therefore, as the last block has a size <= block size, its compressed size is necessarily < block size. Hence, if the size of this last block is the maximum size, then we are in the specific but valid corner case where the last block size is exactly the maximum size of a block, and is not compressible.

    OK, enough of corner cases, let's now be in the normal situation where the last block size is a fraction of the maximum block size. How could we know it is uncompressed ?

    This problem could be mitigated by inserting an information to know that we are dealing with the last block. For example, knowing the original size of the source data is enough for this need.

    But it's not always available. As said previously, this is just an option, since in some streaming mode, this information is unknown. Therefore we need another indicator.

    It could be something as simple as a bit, which simply tells that there is another block to follow, and as a consequence, the current block is full sized. As a bonus, this mechanism also protects against "silent block truncation" (when the compressed stream is cut exactly at the border between 2 blocks).
    On reaching the last block, we need another piece of information, either the uncompressed size of the block, or if the block is compressed. The latter seems more compact.

  • Zero-filled blocks
    This idea was actually proposed by Charles Bloom : it's not rare, for a section of input data, to be filled with zeros.
    The idea would be to "mark" such blocks with a specific prefix, such as "0".
    For such situation to have reasonable chances to happen, the block size must be small enough. For example, this will probably almost never happen with lz4demo block size (8MB), while this is going to be much more frequent with very small blocks, such as 4KB ones.

  • Error correction
    While error detection has been much talked about, nothing has been said up to now about error correction.
    That's both because this feature is much more complex to provide and of questionable usefulness.

    Error correction is mostly useful in situations when there is no way to "resend" erroneous data. This applies to real-time codec (such as voice or video) and stored archive.
    The difficulty in both cases is that erroneous data tends to be "bursty". For example, when a storage sector fails, we don't lose just a few bytes, but an entire sector size, typically 4KB. Same for transmission, where the most common error is a missing packet.
    Dealing with large burst of errors requires some specific techniques, which unfortunately cost much processing power and memory. As a consequence, the CPU and memory budget for error correction is way beyond LZ4 one, which makes the association a questionable choice.

    Therefore, it seems this feature is not expected to be "generic enough" to reserve a place into the generic framing format specification. Obviously, forking is always possible, and even recommended, to support specific features.

  • Allow multi-threaded compression and decompression
    Multi-threaded compression is easily achievable thanks to the division of input data into "blocks".

    Multi-threaded decoding is also possible if those blocks are "independent".
    Both mode shall be possible, and selectable

  • Variable block sizes
    This one is tricky : up to now, we have been talking about "fixed size" blocks only, with only the last block of a compressed data set having an unknown (but capped) size.
    The idea here would be to authorize blocks of arbitrary size, instead of fixed ones.

    The benefits are two-fold :
    • Separate data on "natural boundaries", in order to improve compression ratio and speed
    • Allow data insertion of any size

      The first point is simple to argue with : such benefit only occurs with very-high ratio (and slow) compression algorithms, such as CM, which "learn" the data behavior through statistics. There is no tangible benefit in trying to do the same for LZ4.

      The second benefit is more interesting, since it authorizes some flexibility in archive management.
      Actually, this is mitigated by the possibility to concatenate compressed data sets (or "members") together in a single stream or file.
      Inserting data could therefore be realized by cutting the initial member into 2 parts, inserting the new member, and concatenating the 3 members together.
      As a consequence, it seems the format already supports such scenario, without needing variable block sizes.

  • Partial extraction and Quick Jump Table
    Another potential idea is that, within a member, one could need to only extract a specific portion.
    It's always possible to decode the full data set and get to the required location, but sometimes this might be overkill. For example, one may need a few MB of data which happen to be a few GB away from the starting point.

    However, the idea to decode just the necessary part introduces a few restrictions :
    • First, the input media should be seekable. It makes little sense to partially decode a piped streams, since the decoding process is likely faster than the pipe itself.
    • Second, the compressed data shall be cut into independent blocks. Otherwise, it would be necessary to decode, and therefore read, all previous blocks
    • Third, to avoid to decode "too much data", the blocks shall be small enough, with corresponding impact on compression ratio (the smaller the block, the lower the compression ratio).
    • Fourth, since the i/o process is likely slower than LZ4 decoding, there is a benefit only if it is possible to quick-jump to the right location immediately.
      This can be achieved thanks to a table at the beginning of the compressed file. Such a table can only be filled after compression, and therefore is incompatible with non-seekable output.
    • Fifth, such "table" mechanism at member level would be useless in members appending scenarios.

      These are quite many restrictions, for the benefit of a hardly-requested feature.
      So probably this capability shall be left to a dedicated framing format.
      Moreover, should the input stream be seekable, it's still possible to "hop" over blocks without reading/decoding them. This is still slower than a direct jump, but still a sensible potential speed improvement.

  • Error detection algorithm
    As a quick follow up of selecting-checksum-algorithm, one could note that i had not specified a preferred checksum algorithm, only a preferred checksum size (32-bits).
    Although at this stage i'm somewhat inclined to use xxhash-strong, due to its speed and very good distribution property, there is still a chance that the algorithm might be found unsuitable at a later stage. Therefore, some provision should be left to allow another algorithm to take over later on if need be.

    Pushing the idea a bit further, one could think "let the user select its own checksum algorithm". While the idea may sound nice, it goes against the principle of interoperability, which is exactly what this framing format tries to achieve. Therefore, only clearly defined checksum algorithms shall be allowed.

I believe this post went through most foreseeable requirements for the LZ4 framing format.
So now seems a reasonable time to start a header specification.

No comments:

Post a Comment