Tuesday, April 9, 2013

LZ4 Framing format : Final specifications


The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.4.1 (see edit below), which looks stable enough to start implementing the next version of LZ4.


As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce stream checksum. It's not correct to assume that block checksum makes stream checksum redundant : block checksum only validates that each block has no error, while stream checksum verify that all blocks are present and in correct order. Finally, stream checksum also validates the encoding/decoding stage itself.
v1.4 also introduces the definition of "skippable chunks", which can encapsulate user-defined data of any kind, and be integrated into a flow of LZ4 streams.

[Edit 4] : Changed naming convention in v1.4.1 from "streaming" to "framing".

36 comments:

  1. Could there be some sort of delta compression based on lz4?

    Something like Cloudflare's Railgun, which ultilize differential compression to archive minimal IO.

    ReplyDelete
    Replies
    1. Well, when the streaming interface will be ready, that's exactly the kind of use-case it's built for.

      Delete
  2. Hi, here are conventional questions:

    (1) Concatenaion: Is concatenating more than 2 streams possible ?

    (2) Concatenaion: How should we implement concatenation checking code ?
    Straightforward method looks like this:

    if(isEos) {
    uint32_t probe = read32(input_stream);
    if(ARCHIVE_MAGICNUMBER == probe) goto start_decoding;
    unread32(input_stream, probe);
    return;
    }

    But this code may read after stream data.

    (3) Concatenaion: To prevent accidental concatenation, is additional guard word after EoS recommended ?
    Multiple LZ4 streams without padding which may cause accidental concatenation:

    individual but concat: MAGIC ... EoS, MAGIC ... EoS

    GURAD word (GURAD != MAGIC) prevents this accident:

    individual guard : MAGIC ... EoS, GUARD, MAGIC .. EoS, GUARD

    (4) Descriptor Flags: Is byte order BC, FLG ?
    In "Stream Descriptor" table, it looks like 2 bytes (16 bits) little endian value.

    (5) Version Number: Shall decoder check FLG.VersionNumber first (after Magic Number) ?

    (6) Block checksum flag: Could we insert "If this flag is set," before first sentence ?

    ReplyDelete
    Replies
    1. Takayuki> Is concatenating more than 2 streams possible ?

      Yes, there's no limit.

      Delete
    2. Takayuki> How should we implement concatenation checking code ?

      The straightforward method you describe looks good to me.

      Delete
    3. Takayuki> To prevent accidental concatenation, is additional guard word after EoS recommended ?

      I'm not sure. What is an "accidental concatenation" ? How could it happen ?

      Delete
    4. Takayuki> Descriptor Flags: Is byte order BC, FLG ?

      No, it is FLG - BC.
      You are right, this should be more clearly described.

      Delete
    5. Takayuki> Shall decoder check FLG.VersionNumber first (after Magic Number) ?

      Yes, exactly

      Delete
    6. Takayuki> (6) Block checksum flag

      Yes, good proposal, done

      Delete
    7. Thanks for tons of answers !

      @Yann > I'm not sure. What is an "accidental concatenation" ? How could it happen ?

      In my mind, LZ4 stream has API looks like this
      int LZ4_stream_uncompress(const char* source, char* dest, int destSize);

      This is a counterpart of LZ4_uncompress().
      "source" is pointed whole stream which has MAGIC and EoS.
      Since LZ4 stream has EoS, there is no arg like "sourceSize".

      On the other hand, someone make a his in-house LZ4 stream archiver.
      This archiver just concat LZ4 stream files to archive file with no padding,
      and output tuple { filename, offset_in_archive } to other info file.

      Here, in archive file, since there is no padding, magic number and EoS
      mark always adjoined each other.

      It looks like { MAGIC, LZ4 file1, EoS, MAGIC, LZ4 file2, EoS, .. }.

      So, after loading archive & info is completed, he decompress the first
      file with LZ4_stream_uncompress().
      He want to decompress only first file, but after EoS of first file,
      there is a MAGIC number. So concatenation will occur, and this process
      will continue to the end of archive.

      Of course this will not happen he set appropriate destSize, etc.
      But this kind of "buffer fill with some files" will not work:

      size_t s = 32 * 1024*1024;
      std::vector d(s);
      for(int r = 0; r < s; ) {
      src = getFilePtr(...);
      r += LZ4_stream_uncompress(src, d.data()+r, s-r);
      }

      So I think MAGIC after EoS is dangerous.

      Delete
    8. One way to solve this would be to introduce another Magic Number, called as you propose "Guard", which would instruct the decoder to stop there. By the way, this is how the decoder will behave anyway : after the end of stream mark, it will try to decode the next magic number, and fail, stopping the decoding process.

      On the other hand, i'm not sure this issue is related to this specification.

      Here, we are just introducing a "stream specification". Creating an "archive format" on top of this must be dealt within another specification. I believe it will be up to the archive specification to precisely define how compressed streams are appended and how they should be interpreted.

      Delete
    9. > On the other hand, i'm not sure this issue is related to this specification.

      Agreed. I've written too implement centric question.
      Thanks for precise answer !

      Delete
  3. Skippable chunks: The proposed magic word is 0x184D2A50 and the following 15 values, which corresponds to a LZ4 compressed block (highest bit is zero) with a size of 407'710'288 (+ up to 15) bytes.

    While it is unlikely that the valid block sizes will ever be expanded to include this value, I would still prefer to play it safe and use a value that is as large as possible instead of using one 'in the middle' of the available range. My proposal would be to use 0x7fffffff: It is the largest compressed block size that can be expressed with 32 bits, so even if valid block sizes will be expanded beyond our expectations, an uncompressed block would probably be used instead.

    David

    PS: I realize that this is only a single value instead of 16 values. I suppose the reasoning to allow 16 values was to allow multiple 'streams'. Skippable chunks are a black-box for LZ4, what is stored inside is user defined. So if users ever need multiple 'streams' they can do it themselves inside the skippable chunks.

    ReplyDelete
    Replies
    1. Hi David

      Skippable chunk's magic number cannot be confused with block sizes.

      If the preceding stream is a Legacy one (which is now deprecated, therefore not advised), the legacy stream will stop as soon as a block size is beyond its authorized size, which means above LZ4_compressBound(LEGACY_BLOCKSIZE) , which is a value slightly above 8MB. As a consequence, any Magic Number with a value > 16MB will stop the Legacy stream, and trigger the stream selector.

      If the preceding stream is a fully defined LZ4S stream according to the above specification, the end of stream is explicit ("0000"), and most likely followed by a checksum. As a consequence, the next field is necessarily a Magic Number.

      Note there is currently no way to define large block sizes of >400MB, but this is capability which may be defined in a later version of the specification.

      PS : Your understanding of skippable chunks usage is correct.

      Rgds

      Delete
    2. Thanks Yann!

      "Skippable chunks allow the integration of user-defined data into a flow of concatenated streams." I did not read this carefully enough and thought skippable chunks are allowed inside a stream, in the place of a data block. If they are not, your magic number makes perfectly sense :).

      I feel like your spec could still be more explicit about where exactly skippable chunks are allowed. E.g. is a stream allowed to start with a skippable chunk (can make file type detection by magic bytes more complicated) or do we need to insert a 0-byte stream before the skippable chunk, etc.

      You could also change structure of your spec to make it more hierarchical, e.g. by adding hierarchical section numbers. At the moment the titles 'Data Blocks' and 'Skippable Chunks' look visually identical and are on consecutive pages. This conveys the message that they are 'on the same level' and describe items in the same place.

      Delete
    3. They are all very good comments. I'll update the document to take them into consideration.

      Delete
  4. I am curious as to why there are only two bits for the version number?

    ReplyDelete
    Replies
    1. The main ideas are :
      - It's likely we'll never need more than one or two versions.
      - In case we nonetheless reach 4 versions : the 4th one will reserve a few more bits to make expandability still possible.

      Delete
  5. For stream sizes, it seems its representation as a compressed integer (7-bit or zigzag encoded) would allow large stream sizes, while minimizing the bloat for smaller streams. This is the technique I usually use.

    ReplyDelete
    Replies
    1. Indeed, Mark Adler was also in favor of the same representation, while some other contributors were in favor of simplicity, arguing that LZ4 anyway is not about efficiency, but speed and simplicity.

      Direct 8-bytes field looked simpler at the time this specification was written.

      Delete
    2. Simplicity is in the eye of the beholder. The stream size fields are useful in that it allows us to allocated exactly the correct size output buffer; if using a compressed integer, it has no real negative impact on size. This is useful for the applications I am working on now specifically using C#. At any rate, I include a 32 bit decoder here in C#; a 64 bit one is nearly identical:
      internal static uint Read7BitEncodedUInt32Core(BinaryReader reader)
      {
      uint result = 0;
      int shift = 0;
      do
      {
      byte next = reader.ReadByte();
      result |= (uint)(next & 0x7f) << shift;
      if ((next & 0x80) == 0)
      return result;
      shift += 7;
      }
      while (shift <= 32);
      throw new FormatException("Top bit set in byte " + (shift / 7 + 1) +
      " in Read7BitEncodedUInt32Core.");
      }

      Delete
    3. Does that sometimes happen that some messages (streams) have length <= 127 ?

      If it's rare, would be better to start directly at 2 bytes ? (allowing sizes <= 32767)

      Delete
    4. I cannot speak for others, but I have been using it to compress data going to databases, and for compression of real-time messages. In both cases, many can be small. Perhaps I am wrong, but I thought LZ4 was about both speed and efficiency.

      Delete
    5. OK, I've been pondering a bit about the better alternative. So, to sum up, you would like a format which minimize header size in order to send small and very small packets with it.

      Reducing Stream Size field is one way to get closer to it. But there might be other questions around. For example what about the 4-bytes block size field ? Is it too long too ? What about the 4-bytes Magic Number ? What about the 1-byte header checksum ?

      Also, one thing to keep in mind : if sending multiple streams, they will not be correlated, meaning it's impossible to use previous data to compress next streams. Whereas, when sending one stream with multiple blocks (of variable size), it's possible to use previous blocks within the same stream to compress the next blocks.

      Which may be wonder : do you want to send multiple very small independent streams ? or multiple small blocks belonging to the same stream ?

      Delete
    6. In my case, it is multiple independent streams. It may be silly to use a streaming format for small messages. Perhaps small streams are not important, and I should only compress the whole message and forget about a streaming format (as now). Then we have to consider what to do with larger messages... it would be nice to have one format that works well for both tiny and large streams.

      Delete
    7. I will give you specific examples. Using the C# safe version of your LZ4 compression, a client of mine has reduced the size of his database enormously, with no negative performance impact (thank you very much!). We had to use the safe version (no pointers or unsafe code) as sometimes it is executed directly in SQL server, which has those limitations. This particular database only compresses one column in one table. It is a column containing a character string that is the formatted version of a receipt (a different representation would have allowed better compression, but this is the most reliable format for them). Mostly they are tiny, and sometimes longer. There are enormous numbers of these strings, and they are the only reason the database became large.

      A second example are some real-time messages between two servers. These are binary encoded messages specific to the application, containing some strings. Most are tiny, but there are occasional giant ones that contain database data. The giant ones require a streaming format, and they need to be broken up. I am in the process of figuring out what to do with these.

      Delete
  6. OK, I feel your use cases are perfectly valid.
    The current streaming format is most likely fine for large string messages.
    For very small ones though, there are 2 sub-cases.

    For real-time messages between servers, you may benefit from the "inter-dependent blocks" feature. In essence, you don't send a stream per packet, you start the link with a stream, and then you send "data blocks". Not only are the headers much lower, but more importantly, each data block benefit from previously sent ones in order to improve compression. For small data blocks, this feature produces huge compression improvements.

    However, for your column example, you probably want to decode each field individually. In which case, it's not desirable to link data blocks together. Here, the streaming format might not be your best choice.
    It's the reason LZ4 compression format, and LZ4 streaming format are kept separated. In several circumstances, a custom format on top of LZ4 "core" will prove better tuned.

    ReplyDelete
    Replies
    1. Hello Yann,

      Your comments are helpful. I suppose I will add a marker to indicate which format is used for a message. I will read about the inter-dependent data blocks. It makes sense. My limit is about 3MB so I also have to figure out how to break messages up when they go over the limit as well. Thanks, Frank

      Delete
  7. Hi,

    Is there an ETA on when streaming interface for LZ4 will be available in a library, not merely in a CLI tool?

    Thanks,
    Alexander.

    ReplyDelete
    Replies
    1. LZ4 streaming interface in a lib is a primary objective, but I can't state a date. Real life is taking away too much time currently.

      Delete
  8. Hi, could you please C# example code for this LZ4.
    Like using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress)) for gZip.
    How can we compress files using LZ4

    ReplyDelete
  9. Sorry, I do not provide, and therefore does not support, C# version.
    Your questions will likely be better answered by the authors of LZ4 C# themselves :

    C# : by Milosz Krajewski, at http://lz4net.codeplex.com

    C# streaming : by Phill Djonov, at https://github.com/pdjonov/Lz4Stream

    ReplyDelete
  10. Hi Yann,

    any chance to re-implement the streaming API in a way that allows using different input buffers, not just rewinding the same one?

    I've been trying to get that to work by having a function that copies out the nextBlock-64KB 64k chunk, and then copies it back into the new buffer, adjusting base and bufferStart to point to new_buffer + 64k, but that's not really right.

    The input buffer sliding code seems fairly unintuitive - what does 'base' really mean?

    ReplyDelete
    Replies
    1. Yes,

      and indeed I'm currently working on it, right now.
      But it's fairly complex, and not completed at this stage.

      For an early look, you can have a glance at the "streaming" branch on github : https://github.com/Cyan4973/lz4/tree/streaming

      The decompression functions are ready, and testable.

      However the compression functions are much more difficult to complete...

      Delete
  11. Could you please clarify the HC part for us non-English people.

    According to the Spec.

    One-byte checksum of all descriptor fields, including optional ones when present.
    The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } ,
    using zero as a seed,
    and the full Frame Descriptor as an input (including optional fields when they are present).
    A different checksum indicates an error in the descriptor.

    I don't understand at the -> " The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } " part of the spec.

    I assuming i get:
    SPEC CODE - BYTE - BINARY
    FLG - 100 - 01100100
    BD - 112 - 01110000
    HC - 65533

    What should be the xxhash?
    xxhash(100 +112) = xxhash(212)?
    or
    xxhash("100"+"112" = xxhash(100112)?

    Thank you.

    ReplyDelete
    Replies
    1. (XXH32()>>8) & 0xFF
      is just a way to tell, using code, the same thing as :
      The byte is second byte of XXH32()

      As for XXH32(), its arguments are :
      1- pointer
      2- length
      3- seed
      The pointer : where the header starts, therefore the position of byte FLG
      length : 2 in your example (FLG & BD)
      seed : 0, as stated by the spec.

      Is it clearer ?

      Delete