RealTime Data Compression: LZ4 Frame format : Final specifications

Tuesday, April 9, 2013

LZ4 Frame format : Final specifications

[Edit] : the specification linked from this blog post is quite old by now. Prefer consulting the up-to-date version, stored directly into the project's repository, at https://github.com/lz4/lz4/tree/master/doc .

The LZ4 Framing Format specification has progressed quite a bit since last post, taking into consideration most issues raised by commenters. It has now reached version 1.5 (see edit below), which looks stable enough.

LZ4 Frame format : Specifications v1.5

As a consequence, save any last-minute important item raised by contributors, the currently published specification will be used in upcoming LZ4 releases.

[Edit] : and last-minute change there is. Following a suggestion by Takayuki Matsuoka, the header checksum is now slightly different, in an effort to become more friendly with read-only media, hopefully improving clarity in the process. Specification version is now raised to v1.3.

[Edit 2] : A first version of LZ4c, implementing the above specification, is available at Google Code.

[Edit 3] : Following recommendations from Mark Adler, version v1.4 re-introduce frame content checksum. It's not correct to assume that block checksum makes frame content checksum redundant : block checksum only validates that each block has no error, while frame content checksum verify that all blocks are present and in correct order. Finally, frame content checksum also validates the encoding & decoding stages.
v1.4 also introduces the definition of "skippable frames", which can be used to encapsulate any kind of user-defined data into a flow of appended LZ4 frames.

[Edit 4] : Changed naming convention in v1.4.1 to "frame".

[Edit 5] : v1.5 removed Dict_ID from specification

54 comments:

estApril 10, 2013 at 3:53 AM
Could there be some sort of delta compression based on lz4?

Something like Cloudflare's Railgun, which ultilize differential compression to archive minimal IO.

ReplyDelete
Replies
UnknownApril 10, 2013 at 6:24 AM
Hi, here are conventional questions:

(1) Concatenaion: Is concatenating more than 2 streams possible ?

(2) Concatenaion: How should we implement concatenation checking code ?
Straightforward method looks like this:

if(isEos) {
uint32_t probe = read32(input_stream);
if(ARCHIVE_MAGICNUMBER == probe) goto start_decoding;
unread32(input_stream, probe);
return;
}

But this code may read after stream data.

(3) Concatenaion: To prevent accidental concatenation, is additional guard word after EoS recommended ?
Multiple LZ4 streams without padding which may cause accidental concatenation:

individual but concat: MAGIC ... EoS, MAGIC ... EoS

GURAD word (GURAD != MAGIC) prevents this accident:

individual guard : MAGIC ... EoS, GUARD, MAGIC .. EoS, GUARD

(4) Descriptor Flags: Is byte order BC, FLG ?
In "Stream Descriptor" table, it looks like 2 bytes (16 bits) little endian value.

(5) Version Number: Shall decoder check FLG.VersionNumber first (after Magic Number) ?

(6) Block checksum flag: Could we insert "If this flag is set," before first sentence ?
ReplyDelete
Replies
AnonymousApril 19, 2013 at 9:21 AM
Skippable chunks: The proposed magic word is 0x184D2A50 and the following 15 values, which corresponds to a LZ4 compressed block (highest bit is zero) with a size of 407'710'288 (+ up to 15) bytes.

While it is unlikely that the valid block sizes will ever be expanded to include this value, I would still prefer to play it safe and use a value that is as large as possible instead of using one 'in the middle' of the available range. My proposal would be to use 0x7fffffff: It is the largest compressed block size that can be expressed with 32 bits, so even if valid block sizes will be expanded beyond our expectations, an uncompressed block would probably be used instead.

David

PS: I realize that this is only a single value instead of 16 values. I suppose the reasoning to allow 16 values was to allow multiple 'streams'. Skippable chunks are a black-box for LZ4, what is stored inside is user defined. So if users ever need multiple 'streams' they can do it themselves inside the skippable chunks.
ReplyDelete
Replies
Frank HilemanApril 30, 2013 at 4:06 PM
I am curious as to why there are only two bits for the version number?
ReplyDelete
Replies
Frank HilemanApril 30, 2013 at 4:10 PM
For stream sizes, it seems its representation as a compressed integer (7-bit or zigzag encoded) would allow large stream sizes, while minimizing the bloat for smaller streams. This is the technique I usually use.
ReplyDelete
Replies
CyanMay 22, 2013 at 8:47 PM
OK, I feel your use cases are perfectly valid.
The current streaming format is most likely fine for large string messages.
For very small ones though, there are 2 sub-cases.

For real-time messages between servers, you may benefit from the "inter-dependent blocks" feature. In essence, you don't send a stream per packet, you start the link with a stream, and then you send "data blocks". Not only are the headers much lower, but more importantly, each data block benefit from previously sent ones in order to improve compression. For small data blocks, this feature produces huge compression improvements.

However, for your column example, you probably want to decode each field individually. In which case, it's not desirable to link data blocks together. Here, the streaming format might not be your best choice.
It's the reason LZ4 compression format, and LZ4 streaming format are kept separated. In several circumstances, a custom format on top of LZ4 "core" will prove better tuned.
ReplyDelete
Replies
Александр ГладышJuly 1, 2013 at 2:56 AM
Hi,

Is there an ETA on when streaming interface for LZ4 will be available in a library, not merely in a CLI tool?

Thanks,
Alexander.
ReplyDelete
Replies
!deas for tech mindApril 18, 2014 at 5:10 AM
Hi, could you please C# example code for this LZ4.
Like using (GZipStream compressionStream = new GZipStream(compressedFileStream, CompressionMode.Compress)) for gZip.
How can we compress files using LZ4
ReplyDelete
Replies
CyanApril 21, 2014 at 4:29 PM
Sorry, I do not provide, and therefore does not support, C# version.
Your questions will likely be better answered by the authors of LZ4 C# themselves :

C# : by Milosz Krajewski, at http://lz4net.codeplex.com

C# streaming : by Phill Djonov, at https://github.com/pdjonov/Lz4Stream
ReplyDelete
Replies
AlexMay 15, 2014 at 6:46 PM
Hi Yann,

any chance to re-implement the streaming API in a way that allows using different input buffers, not just rewinding the same one?

I've been trying to get that to work by having a function that copies out the nextBlock-64KB 64k chunk, and then copies it back into the new buffer, adjusting base and bufferStart to point to new_buffer + 64k, but that's not really right.

The input buffer sliding code seems fairly unintuitive - what does 'base' really mean?
ReplyDelete
Replies
UnknownSeptember 27, 2014 at 1:19 AM
Could you please clarify the HC part for us non-English people.

According to the Spec.

One-byte checksum of all descriptor fields, including optional ones when present.
The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } ,
using zero as a seed,
and the full Frame Descriptor as an input (including optional fields when they are present).
A different checksum indicates an error in the descriptor.

I don't understand at the -> " The byte is second byte of xxh32() : { (xxh32()>>8) & 0xFF } " part of the spec.

I assuming i get:
SPEC CODE - BYTE - BINARY
FLG - 100 - 01100100
BD - 112 - 01110000
HC - 65533

What should be the xxhash?
xxhash(100 +112) = xxhash(212)?
or
xxhash("100"+"112" = xxhash(100112)?

Thank you.
ReplyDelete
Replies
AnonymousJanuary 14, 2015 at 8:37 PM
Can we use it for real time video streaming?
ReplyDelete
Replies
AnonymousApril 11, 2016 at 6:32 AM
I am using lz4 to compress http body and the body maybe very large. I want to use restricted size of buffer. Is it good to user streaming api or frame api to accomplish this?
ReplyDelete
Replies
AnonymousNovember 2, 2016 at 5:21 PM
The streaming format spec says that "The content checksum is the result of xxh32() hash function on digesting the original (decoded) data as input". However, it says that the block checksum is calculated by using the xxHash-32 algorithm on the raw (compressed) data block. If the content checksum is on the uncompressed data and the block checksum is on the compressed data how can the block checksum be cumulative with the content checksum.
ReplyDelete
Replies
MosiJanuary 5, 2017 at 11:48 AM
Wish you could upload the source to the "LZ4 Command Line Utility for Windows" or even explain how to use the function in LZ4HC algorithm source code. I'm planning on using LZ4HC algorithm in an ARM Cortex-M3 processor but I'm stuck since there is no documents explaining how to do this. The amount of SRAM is limited in these kind of systems and since the amount data I'm talking about is pretty large I can't compress the whole data at once and I guess I have to do this on little chunks of data or block by block (for example every 512 or 4096 bytes of data)
Could you help me ?
ReplyDelete
Replies
RohitMay 11, 2022 at 5:56 PM
Hi Cyan,

I have 2 questions.
1. Does blocksize field included in calculation of block checksum in frame format?

2. If I implement only block format, then whole input file will compressed and represented in one output compressed block or can formatted in multiple output compressed blocks.
If output is multiple compressed block then how to find where next block is starting. Is it required to add block size at start of each block as it is in Frame format .
ReplyDelete
Replies

Add comment