Tuesday, April 9, 2013

LZ4 Frame format : Final specifications

[Edit] : the specification linked from this blog post is quite old by now. Prefer consulting the up-to-date version, stored directly into the project's repository, at https://github.com/lz4/lz4/tree/master/doc .


The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.5 (see edit below), which looks stable enough.


As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce frame content checksum. It's not correct to assume that block checksum makes frame content checksum redundant : block checksum only validates that each block has no error, while frame content checksum verify that all blocks are present and in correct order. Finally, frame content checksum also validates the encoding & decoding stages.
v1.4 also introduces the definition of "skippable frames", which can be used to encapsulate any kind of user-defined data into a flow of appended LZ4 frames.

[Edit 4] : Changed naming convention in v1.4.1 to "frame".

[Edit 5] : v1.5 removed Dict_ID from specification

54 comments:

  1. Could there be some sort of delta compression based on lz4?

    Something like Cloudflare's Railgun, which ultilize differential compression to archive minimal IO.

    ReplyDelete
    Replies
    1. Well, when the streaming interface will be ready, that's exactly the kind of use-case it's built for.

      Delete
  2. Hi, here are conventional questions:

    (1) Concatenaion: Is concatenating more than 2 streams possible ?

    (2) Concatenaion: How should we implement concatenation checking code ?
    Straightforward method looks like this:

    if(isEos) {
    uint32_t probe = read32(input_stream);
    if(ARCHIVE_MAGICNUMBER == probe) goto start_decoding;
    unread32(input_stream, probe);
    return;
    }

    But this code may read after stream data.

    (3) Concatenaion: To prevent accidental concatenation, is additional guard word after EoS recommended ?
    Multiple LZ4 streams without padding which may cause accidental concatenation:

    individual but concat: MAGIC ... EoS, MAGIC ... EoS

    GURAD word (GURAD != MAGIC) prevents this accident:

    individual guard : MAGIC ... EoS, GUARD, MAGIC .. EoS, GUARD

    (4) Descriptor Flags: Is byte order BC, FLG ?
    In "Stream Descriptor" table, it looks like 2 bytes (16 bits) little endian value.

    (5) Version Number: Shall decoder check FLG.VersionNumber first (after Magic Number) ?

    (6) Block checksum flag: Could we insert "If this flag is set," before first sentence ?

    ReplyDelete
    Replies
    1. Takayuki> Is concatenating more than 2 streams possible ?

      Yes, there's no limit.

      Delete
    2. Takayuki> How should we implement concatenation checking code ?

      The straightforward method you describe looks good to me.

      Delete
    3. Takayuki> To prevent accidental concatenation, is additional guard word after EoS recommended ?

      I'm not sure. What is an "accidental concatenation" ? How could it happen ?

      Delete
    4. Takayuki> Descriptor Flags: Is byte order BC, FLG ?

      No, it is FLG - BC.
      You are right, this should be more clearly described.

      Delete
    5. Takayuki> Shall decoder check FLG.VersionNumber first (after Magic Number) ?

      Yes, exactly

      Delete
    6. Takayuki> (6) Block checksum flag

      Yes, good proposal, done

      Delete
    7. Thanks for tons of answers !

      @Yann > I'm not sure. What is an "accidental concatenation" ? How could it happen ?

      In my mind, LZ4 stream has API looks like this
      int LZ4_stream_uncompress(const char* source, char* dest, int destSize);

      This is a counterpart of LZ4_uncompress().
      "source" is pointed whole stream which has MAGIC and EoS.
      Since LZ4 stream has EoS, there is no arg like "sourceSize".

      On the other hand, someone make a his in-house LZ4 stream archiver.
      This archiver just concat LZ4 stream files to archive file with no padding,
      and output tuple { filename, offset_in_archive } to other info file.

      Here, in archive file, since there is no padding, magic number and EoS
      mark always adjoined each other.

      It looks like { MAGIC, LZ4 file1, EoS, MAGIC, LZ4 file2, EoS, .. }.

      So, after loading archive & info is completed, he decompress the first
      file with LZ4_stream_uncompress().
      He want to decompress only first file, but after EoS of first file,
      there is a MAGIC number. So concatenation will occur, and this process
      will continue to the end of archive.

      Of course this will not happen he set appropriate destSize, etc.
      But this kind of "buffer fill with some files" will not work:

      size_t s = 32 * 1024*1024;
      std::vector d(s);
      for(int r = 0; r < s; ) {
      src = getFilePtr(...);
      r += LZ4_stream_uncompress(src, d.data()+r, s-r);
      }

      So I think MAGIC after EoS is dangerous.

      Delete
    8. One way to solve this would be to introduce another Magic Number, called as you propose "Guard", which would instruct the decoder to stop there. By the way, this is how the decoder will behave anyway : after the end of stream mark, it will try to decode the next magic number, and fail, stopping the decoding process.

      On the other hand, i'm not sure this issue is related to this specification.

      Here, we are just introducing a "stream specification". Creating an "archive format" on top of this must be dealt within another specification. I believe it will be up to the archive specification to precisely define how compressed streams are appended and how they should be interpreted.

      Delete
    9. > On the other hand, i'm not sure this issue is related to this specification.

      Agreed. I've written too implement centric question.
      Thanks for precise answer !

      Delete
  3. Skippable chunks: The proposed magic word is 0x184D2A50 and the following 15 values, which corresponds to a LZ4 compressed block (highest bit is zero) with a size of 407'710'288 (+ up to 15) bytes.

    While it is unlikely that the valid block sizes will ever be expanded to include this value, I would still prefer to play it safe and use a value that is as large as possible instead of using one 'in the middle' of the available range. My proposal would be to use 0x7fffffff: It is the largest compressed block size that can be expressed with 32 bits, so even if valid block sizes will be expanded beyond our expectations, an uncompressed block would probably be used instead.

    David

    PS: I realize that this is only a single value instead of 16 values. I suppose the reasoning to allow 16 values was to allow multiple 'streams'. Skippable chunks are a black-box for LZ4, what is stored inside is user defined. So if users ever need multiple 'streams' they can do it themselves inside the skippable chunks.

    ReplyDelete
    Replies
    1. Hi David

      Skippable chunk's magic number cannot be confused with block sizes.

      If the preceding stream is a Legacy one (which is now deprecated, therefore not advised), the legacy stream will stop as soon as a block size is beyond its authorized size, which means above LZ4_compressBound(LEGACY_BLOCKSIZE) , which is a value slightly above 8MB. As a consequence, any Magic Number with a value > 16MB will stop the Legacy stream, and trigger the stream selector.

      If the preceding stream is a fully defined LZ4S stream according to the above specification, the end of stream is explicit ("0000"), and most likely followed by a checksum. As a consequence, the next field is necessarily a Magic Number.

      Note there is currently no way to define large block sizes of >400MB, but this is capability which may be defined in a later version of the specification.

      PS : Your understanding of skippable chunks usage is correct.

      Rgds

      Delete
    2. Thanks Yann!

      "Skippable chunks allow the integration of user-defined data into a flow of concatenated streams." I did not read this carefully enough and thought skippable chunks are allowed inside a stream, in the place of a data block. If they are not, your magic number makes perfectly sense :).

      I feel like your spec could still be more explicit about where exactly skippable chunks are allowed. E.g. is a stream allowed to start with a skippable chunk (can make file type detection by magic bytes more complicated) or do we need to insert a 0-byte stream before the skippable chunk, etc.

      You could also change structure of your spec to make it more hierarchical, e.g. by adding hierarchical section numbers. At the moment the titles 'Data Blocks' and 'Skippable Chunks' look visually identical and are on consecutive pages. This conveys the message that they are 'on the same level' and describe items in the same place.

      Delete
    3. They are all very good comments. I'll update the document to take them into consideration.

      Delete
  4. I am curious as to why there are only two bits for the version number?

    ReplyDelete
    Replies
    1. The main ideas are :
      - It's likely we'll never need more than one or two versions.
      - In case we nonetheless reach 4 versions : the 4th one will reserve a few more bits to make expandability still possible.

      Delete
  5. For stream sizes, it seems its representation as a compressed integer (7-bit or zigzag encoded) would allow large stream sizes, while minimizing the bloat for smaller streams. This is the technique I usually use.

    ReplyDelete
    Replies
    1. Indeed, Mark Adler was also in favor of the same representation, while some other contributors were in favor of simplicity, arguing that LZ4 anyway is not about efficiency, but speed and simplicity.

      Direct 8-bytes field looked simpler at the time this specification was written.

      Delete
    2. Simplicity is in the eye of the beholder. The stream size fields are useful in that it allows us to allocated exactly the correct size output buffer; if using a compressed integer, it has no real negative impact on size. This is useful for the applications I am working on now specifically using C#. At any rate, I include a 32 bit decoder here in C#; a 64 bit one is nearly identical:
      internal static uint Read7BitEncodedUInt32Core(BinaryReader reader)
      {
      uint result = 0;
      int shift = 0;
      do
      {
      byte next = reader.ReadByte();
      result |= (uint)(next & 0x7f) << shift;
      if ((next & 0x80) == 0)
      return result;
      shift += 7;
      }
      while (shift <= 32);
      throw new FormatException("Top bit set in byte " + (shift / 7 + 1) +
      " in Read7BitEncodedUInt32Core.");
      }

      Delete
    3. Does that sometimes happen that some messages (streams) have length <= 127 ?

      If it's rare, would be better to start directly at 2 bytes ? (allowing sizes <= 32767)

      Delete
    4. I cannot speak for others, but I have been using it to compress data going to databases, and for compression of real-time messages. In both cases, many can be small. Perhaps I am wrong, but I thought LZ4 was about both speed and efficiency.

      Delete
    5. OK, I've been pondering a bit about the better alternative. So, to sum up, you would like a format which minimize header size in order to send small and very small packets with it.

      Reducing Stream Size field is one way to get closer to it. But there might be other questions around. For example what about the 4-bytes block size field ? Is it too long too ? What about the 4-bytes Magic Number ? What about the 1-byte header checksum ?

      Also, one thing to keep in mind : if sending multiple streams, they will not be correlated, meaning it's impossible to use previous data to compress next streams. Whereas, when sending one stream with multiple blocks (of variable size), it's possible to use previous blocks within the same stream to compress the next blocks.

      Which may be wonder : do you want to send multiple very small independent streams ? or multiple small blocks belonging to the same stream ?

      Delete
    6. In my case, it is multiple independent streams. It may be silly to use a streaming format for small messages. Perhaps small streams are not important, and I should only compress the whole message and forget about a streaming format (as now). Then we have to consider what to do with larger messages... it would be nice to have one format that works well for both tiny and large streams.

      Delete
    7. I will give you specific examples. Using the C# safe version of your LZ4 compression, a client of mine has reduced the size of his database enormously, with no negative performance impact (thank you very much!). We had to use the safe version (no pointers or unsafe code) as sometimes it is executed directly in SQL server, which has those limitations. This particular database only compresses one column in one table. It is a column containing a character string that is the formatted version of a receipt (a different representation would have allowed better compression, but this is the most reliable format for them). Mostly they are tiny, and sometimes longer. There are enormous numbers of these strings, and they are the only reason the database became large.

      A second example are some real-time messages between two servers. These are binary encoded messages specific to the application, containing some strings. Most are tiny, but there are occasional giant ones that contain database data. The giant ones require a streaming format, and they need to be broken up. I am in the process of figuring out what to do with these.

      Delete
  6. OK, I feel your use cases are perfectly valid.
    The current streaming format is most likely fine for large string messages.
    For very small ones though, there are 2 sub-cases.

    For real-time messages between servers, you may benefit from the "inter-dependent blocks" feature. In essence, you don't send a stream per packet, you start the link with a stream, and then you send "data blocks". Not only are the headers much lower, but more importantly, each data block benefit from previously sent ones in order to improve compression. For small data blocks, this feature produces huge compression improvements.

    However, for your column example, you probably want to decode each field individually. In which case, it's not desirable to link data blocks together. Here, the streaming format might not be your best choice.
    It's the reason LZ4 compression format, and LZ4 streaming format are kept separated. In several circumstances, a custom format on top of LZ4 "core" will prove better tuned.

    ReplyDelete
    Replies
    1. Hello Yann,

      Your comments are helpful. I suppose I will add a marker to indicate which format is used for a message. I will read about the inter-dependent data blocks. It makes sense. My limit is about 3MB so I also have to figure out how to break messages up when they go over the limit as well. Thanks, Frank

      Delete
  7. Hi,

    Is there an ETA on when streaming interface for LZ4 will be available in a library, not merely in a CLI tool?

    Thanks,
    Alexander.

    ReplyDelete
    Replies
    1. LZ4 streaming interface in a lib is a primary objective, but I can't state a date. Real life is taking away too much time currently.

      Delete
  8. Hi, could you please C# example code for this LZ4.
    Like using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress)) for gZip.
    How can we compress files using LZ4

    ReplyDelete
  9. Sorry, I do not provide, and therefore does not support, C# version.
    Your questions will likely be better answered by the authors of LZ4 C# themselves :

    C# : by Milosz Krajewski, at http://lz4net.codeplex.com

    C# streaming : by Phill Djonov, at https://github.com/pdjonov/Lz4Stream

    ReplyDelete
  10. Hi Yann,

    any chance to re-implement the streaming API in a way that allows using different input buffers, not just rewinding the same one?

    I've been trying to get that to work by having a function that copies out the nextBlock-64KB 64k chunk, and then copies it back into the new buffer, adjusting base and bufferStart to point to new_buffer + 64k, but that's not really right.

    The input buffer sliding code seems fairly unintuitive - what does 'base' really mean?

    ReplyDelete
    Replies
    1. Yes,

      and indeed I'm currently working on it, right now.
      But it's fairly complex, and not completed at this stage.

      For an early look, you can have a glance at the "streaming" branch on github : https://github.com/Cyan4973/lz4/tree/streaming

      The decompression functions are ready, and testable.

      However the compression functions are much more difficult to complete...

      Delete
  11. Could you please clarify the HC part for us non-English people.

    According to the Spec.

    One-byte checksum of all descriptor fields, including optional ones when present.
    The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } ,
    using zero as a seed,
    and the full Frame Descriptor as an input (including optional fields when they are present).
    A different checksum indicates an error in the descriptor.

    I don't understand at the -> " The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } " part of the spec.

    I assuming i get:
    SPEC CODE - BYTE - BINARY
    FLG - 100 - 01100100
    BD - 112 - 01110000
    HC - 65533

    What should be the xxhash?
    xxhash(100 +112) = xxhash(212)?
    or
    xxhash("100"+"112" = xxhash(100112)?

    Thank you.

    ReplyDelete
    Replies
    1. (XXH32()>>8) & 0xFF
      is just a way to tell, using code, the same thing as :
      The byte is second byte of XXH32()

      As for XXH32(), its arguments are :
      1- pointer
      2- length
      3- seed
      The pointer : where the header starts, therefore the position of byte FLG
      length : 2 in your example (FLG & BD)
      seed : 0, as stated by the spec.

      Is it clearer ?

      Delete
  12. Can we use it for real time video streaming?

    ReplyDelete
    Replies
    1. Yes, of course, you can.
      But be aware that LZ4 is a lossless compression algorithm. It regenerates original data exactly as it was provided. As a consequence, its compression performance is a order of magnitude less thatn dedicated video compression algorithms.

      Delete
  13. I am using lz4 to compress http body and the body maybe very large. I want to use restricted size of buffer. Is it good to user streaming api or frame api to accomplish this?

    ReplyDelete
    Replies
    1. The frame API can be used to ensure maximum memory usage on the decoder side, whatever the real body size can be.

      Your choices will be limited though.
      I would recommend a 64 KB block size.
      If blocks are dependents, each block can use up to 64KB of previous block to better compress, the decoder will need space for 2 blocks at all times, hence 128 KB.
      If you want to reduce that amount, you can use independent blocks. In which case, the decoder will only need a buffer of 64 KB. But it will have an impact on compression ratio (if body > 64 KB).

      The current frame format version is unfortunately limited to these values. If you need smaller buffer sizes, you will have to create your own frame format.

      Delete
  14. The streaming format spec says that "The content checksum is the result of xxh32() hash function on digesting the original (decoded) data as input". However, it says that the block checksum is calculated by using the xxHash-32 algorithm on the raw (compressed) data block. If the content checksum is on the uncompressed data and the block checksum is on the compressed data how can the block checksum be cumulative with the content checksum.

    ReplyDelete
    Replies
    1. It is not.
      Content checksum and Block checksum are 2 different checksums. They check different things.

      Delete
    2. The spec says, "Block checksum is cumulative with Content checksum" under the block checksum description. Is that a spec issue then ?

      Delete
    3. Maybe it's a spelling issue.
      It simply means that both checksums can be used at the same time.

      In which case, both checksums must be validated.
      That's where the "cumulative" effect gets in.

      Delete
    4. Thank you for the clarification

      Delete
  15. Wish you could upload the source to the "LZ4 Command Line Utility for Windows" or even explain how to use the function in LZ4HC algorithm source code. I'm planning on using LZ4HC algorithm in an ARM Cortex-M3 processor but I'm stuck since there is no documents explaining how to do this. The amount of SRAM is limited in these kind of systems and since the amount data I'm talking about is pretty large I can't compress the whole data at once and I guess I have to do this on little chunks of data or block by block (for example every 512 or 4096 bytes of data)
    Could you help me ?

    ReplyDelete
    Replies
    1. See answer in the LZ4 discussion group :
      https://groups.google.com/forum/#!topic/lz4c/ZAPKWUQOIH8

      Delete
  16. Hi Cyan,

    I have 2 questions.
    1. Does blocksize field included in calculation of block checksum in frame format?

    2. If I implement only block format, then whole input file will compressed and represented in one output compressed block or can formatted in multiple output compressed blocks.
    If output is multiple compressed block then how to find where next block is starting. Is it required to add block size at start of each block as it is in Frame format .

    ReplyDelete
    Replies
    1. 1) Nope. If I remember correctly, only the compressed block itself is used to calculate the hash.
      2) If you have a format using multiple blocks, you need a way to know the size of each block. You need either the exact compressed size, or the exact decompressed size. And depending on which one you know, the decompression function needed is different. The simples solution is to know the compressed size.

      Delete
    2. Yann, Thanks for the quick response.
      Regarding 2nd question, in block format specification nowhere written that add compressed block size at the start of compressed block data.
      Since I need to implementing Lz4 in ASIC. And lz4 compressed data will be output from asic io pins. If there will be multiple output compressed blocks ( which will be transferred one after other in series through io bus), what specifications say about how to pass the block size information to other system(decoder).
      What wil be the output format.
      Will it be same as data blocks of frame format specification.
      Hope I am able to explain my doubt.

      Delete
    3. Continue...
      Since in asic, resources are very less. Say memory buffer in few MBs( even in KBs).
      What will you recommend the best way to implement LZ4.
      Considering input file size can range from KBs to GBs. And decoder can be third party system. So requirement is only standard algorithm need to implement.

      Delete
    4. Use the LZ4 Frame format.
      The Block format is merely an internal component of the Frame format. It makes sense to use it in private environment, for which the frame specification is overkill.
      But if you need any kind of interoperability, or any advanced mechanism such as streaming, the Frame format is more appropriate.
      You can decrease the internal block size of the frame to 64 KB, which should be easier to handle for hardware implementations. Even then, 64 KB is a maximum, so any implementation is free to split input into blocks smaller than that.
      Blocks can be independent or linked. Independent blocks are simplest. Linked blocks are more efficient though.

      Delete
    5. Yann, You have written this comment 6 years ago in this blog:
      "Your choices will be limited though.
      I would recommend a 64 KB block size.
      If blocks are dependents, each block can use up to 64KB of previous block to better compress, the decoder will need space for 2 blocks at all times, hence 128 KB."

      I am not able to understand why there is need of 128KB buffer(for compressor or decompressor). Since history window size is 64 KB. If current position is half way of current block then only i need to look from half way of previous block to current position for match.
      Am I correct?

      Sorry if you find these questions silly. And i am really thankful to you for your active response.

      Delete
    6. This comment only applies to linked blocks.
      If you go for independent blocks, this is much simpler, you only need X KB for the largest block size you wish your implementation to handle.

      If you go for linked blocks, this is more complex.
      I would recommend you get a firm understanding of streaming of independent blocks first. Then you can add more complexity.

      Delete