Sunday, April 8, 2012

A file container format for LZ4

 With increased usage of LZ4 come new use cases, and with them new requirements.

The recent ports of LZ4 to Python and Perl are two excellent examples, demultiplying the possible usages by making the algorithm available to a broader range of applications, including easy-to-deploy scripts.

In this context, a simple question is asked : how can one be sure that a file compressed with, for example, the Python port of LZ4, will be decodable by the C# port of LZ4 ?

Which is indeed a very good question. And the sorry answer is that : there is no such guarantee. At least, not yet.

LZ4 has been developed as an in-memory compression algorithm : it takes a memory buffer as an input, and provides the compressed result into an output memory buffer. Whatever is done before or after that operation is beyond its area of responsibility. The fact that these buffers may be read, or written, to disk is just one of many possibilities.

This is a correct positioning. There are too many possible usages for LZ4, and a lot of them will never bother with on-disk files or network streaming.

Nevertheless, it happens that one natural usage of compression algorithm is file compression, and it applies to LZ4 too. In this case, the compressed result is written to a media, and may be decoded later, possibly by a completely different system. To ensure this operation will be always possible, even when using any other version of LZ4 written in any language, it is necessary to define a file container format.

To be more complete, a container format has been defined into lz4demo, but it was never meant to become a reference. It is too limited for this ambition. Therefore, a new container format specification is necessary.

The ideas in this post will be largely borrowed from this older one, where a container format was discussed and defined for Zhuff. Most of the points are still applicable for LZ4.

We'll start by defining the objectives. What are the responsibilities of a file container format ? Well, mostly, it shall ensure that the original file will be restored entirely and without errors. This can be translated into the following technical missions :
  • Detect valid containers from invalid input (compulsory)
    • Detect / control errors (recommended)
      • Correct errors (optional)
  • Help the decoder to organize its memory buffers 
    • buffer sizes, and number of buffers (compulsory)
    • provide user control over buffer sizes (optional)
  • Organize data into independent chunks for multi-threaded operations (optional)
  • Allow to quick-jump to a specific data segment within the container (optional)
  • Be compatible with non-seekable streams, typically pipe mode (compulsory)
  • Enforce alignment rules (optional)
  • Allow simplified concatenation of compressed streams (optional)

This is already quite a few missions. But most of them will be possible, as we'll see later.

Another important thing to define is what's not within this format missions :
  • Save the file name (debatable)
  • Save file attributes and path
  • Group several files (or directory) into a single compressed stream
  • Provide optional user space, typically for comments (debatable)
  • Split the compressed stream into several output files
In my view, these elements should be part of an archive format.
While the file container format takes care of regenerating original content, the archive will also take care of directory structure. It will be able to save file names, and re-organize them in a better way to help compression. On decoding, it will regenerate the entire tree structure, placing each file at its correct location, with sometimes also original attributes.
An archive format therefore occupy a higher layer in the structure. And it typically depends itself on such container formats.

Last but not least, we'll make the definition of "file" a bit broader, Unix-like, in order to also encompass "pipe'd data". The "file container format" then becomes a "stream container format", with files being one of the possible streams.

All these elements can be provided in a relatively compact and efficient header.
But first, let's discuss the most important features.

[Note] Considering the comments, it seems the word "container" does not properly define the objective of this specification. Another word shall be found. "Framing format" is the current candidate.

5 comments:

  1. Forwarding an idea proposed by Steven Pigeon (http://www.stevenpigeon.org/) : use IFF (Interchange File Format) or MPEG4-Part14 (MPEG-4 Part 14) as a source of inspiration for a container format (http://en.wikipedia.org/wiki/Interchange_File_Format).

    My comment : it looks broader in term of scope. The main idea seems to append meta-data sharing a common prolog convention into a file. Each meta-data can be, basically, anything. One major property is that a reader can skip over metatdata it can't decode to reach next ones.

    ReplyDelete
  2. Containter format is imho something different - it describes
    stream types, space allocation for each stream etc - something
    like movie formats with their audio/video interleaving.

    And what you describe seems like a lower-level thing - a stream format.
    Which is normally not necessary at all:
    - stream parsing and decompression are usually different things,
    so there's enough space for MT;
    - the best way to validate the stream is to try decoding a block
    and see if there'd be an error. Stuff like signatures and
    comp/uncomp lens is hardly helpful at all - unpacked size may be
    intentionally set to lower value to cause a buffer overflow etc,
    so there'd be more checks and no gain.
    - useless overhead is useless

    Then, there're also codecs with blockwise optimization, where
    block headers actually contain information which is necessary
    for block decoding (eg. deflate).
    But its a different thing - there're no signatures or sizes
    in deflate's block headers (aside from "stored" blocks).

    As to buffer allocation for decoding, I've seen 3 methods
    1. Store uncompressed size in a header and hope that
    nobody would hack it, or at least it would be possible
    to stop decoding at end of buffer (but note that mp3 players
    are usually expected to keep playing after encountering a
    broken frame).
    2. Do overflow checks once per token - precompute the
    maximum length of LZ token's code and use unchecked
    decoder if there's still enough data, then switch to
    a checked version near end of buffer. (LZMA does this)
    3. Use coroutines/state machines to work with provided
    input/output buffers of any size.
    3a. Bound checks can be replaced with guard pages, hardware breakpoints,
    segment limits or any other implicit checks supported by hardware.

    So I just think that a correctly implemented codec doesn't
    need any redundant information - it would just return an error
    when there's an error.

    But these higher level formats are much more tricky.
    Sequential stream interleaving requires knowing precisely
    how interleaved codecs work, or there'd be a significant
    redundancy (which seems common though).
    And archive formats are even more troublesome - there's
    also an interleaving issue, and eg. folder's timestamps
    can be only restored after all processing within the folder
    is finished.

    ReplyDelete
  3. Thanks for your insightful comment Shelwien.

    I agree that what is needed here is of "lower level" than Container (at least as defined by IFF). Probably the naming should be changed to avoid confusion.
    The equivalent is named "framing format" in Snappy convention. See http://code.google.com/p/snappy/source/browse/trunk/framing_format.txt.

    I also agree that signature is useless, or only useful together with an encryption layer, which is not part of the objective.

    And finally, i also agree that the decoder must be able to handle errors, especially bad input values.

    However, i expect the framing format to be useful. At the very least, it shall be a published format, which ensures that programs using the specification will be compatible with each other, on any machine (little/big endian). This is the primary objective.

    Having some basic error detection algorithm looks also useful, imho. But this is very different from detecting tampered (intentionally modified) data. Basic error detection only target transmission or storage errors (ie random ones) .

    Such error could, indeed, trigger an error within the decoder.
    But for Byte-aligned LZ77, the situation is different than for more complex format. A modified literal will remain unnoticed for example. Even an offset error can be unnoticeable if it happens after 64KB.

    This is different from Arithmetic coders, or even Huffman, when an error anywhere in the stream has devastating avalanche consequences for the rest of the stream, and therefore stands little chance of remaining unnoticed.

    ReplyDelete
  4. This seems to be mostly addressing the issue of storing the compressed data as a file/directory structure, but what if you're just passing the compressed blob between processes or over the network. You don't actually need a complicated container, only the stream size so it can be decompressed on the other side. Snappy does this, it has a stream header containing size, and it also has a separate framing standard for more complicated cases: https://code.google.com/p/snappy/source/browse/trunk/framing_format.txt

    ReplyDelete
  5. I think there is a misunderstanding.
    File/directory structure is exactly what's this specification is not about. You may read again the chapter starting with : "Another important thing to define is what's not within this format missions".

    Passing the compressed blob between processes or over the network is indeed part of the objective. It is included in the property called "Be compatible with non-seekable streams", which requires some special handling since input size is unknown, and output cannot accept a "second pass" correction or optimization.

    Now, i still believe that for a common framing format to be usable for both file and data streams, it is beneficial to consider more than just sending the size (although it is perfectly possible to reduce the format options to only output this). If you look at the snappy frame format, you'll see that it defines some provisions to handle memory buffers (in a fixed way, with chunks being guaranteed to be <=32KB), and to detect error (using masked CRC32).

    These questions are the very same we are talking about here, alongside a few other options which might, or might not, end in the specification.

    ReplyDelete