Tuesday, July 28, 2015

Huffman revisited - part 1

Huffman tree Huffman compression is a well known entropic compression technique since the 1950's. It's optimal, in the sense there is no better construction if one accept the limitation of using an integer number of bits per symbol, a constraint that can severely limit its compression capability in presence of high probability symbols.
Huffman compression is very popular, and quite rightly so, thanks to its simplicity and clarity. (It's also patent-free which helps too). For a long time, it remained the entropic compressor of choice due to its excellent speed / efficiency trade off.
Today, we can use more powerful entropic compressors such as Arithmetic Coding or the newer ANS based Finite State Entropy, which are able to grab fractional bits, hence ensuring a better compression ratio, closer to the Shannon Limit.
The Shannon Limit must be considered like the speed of light, as a hard wall that cannot be crossed. Anytime someone claims the contrary, it is either hiding some cost portions (such as headers, or the decoder itself), or solving a different problem, entangling modeling and entropy. As long as entropy alone is considered, there is simply no way to beat the Shannon Limit. You can just get closer to it.
This leads us to a simple question : are there situations where Huffman compression is good enough, meaning that it is so close to Shannon limit that there is very little gain remaining, if any ?
The answer to this question is yes.
Let's forget some curious corner cases where symbol frequencies are clean power of 2. Of course, in such case, Huffman compression would be optimal, but this is way too specific to consider.
Let's therefore imagine a more "natural" situation where all symbol frequencies are randomly scattered along the probability axis, with the sole condition that the sum of all probabilities must be equal to 1.
A simple observation : the more numerous the symbols, the most likely each symbol probability is going to be small (since their total sum must be equal to 1).
This is an important observation. When the probability of a symbol is small, its deviation from the nearest power of 2 is also small. At some point, this deviation becomes negligible.
Huffman deviation from Shannon optimal
So, if we can be in a situation where no symbol ever get a large probability (<10%), Huffman compression is likely to provide a "good enough" compression result, meaning close enough to the hard "Shannon limit" so that it doesn't matter to try getting even closer to it.
In a compression algorithm such as Zstandard, the literals are symbols which belong to this category. They are basically the "rest" from LZ compression, which couldn't be identified as part of repeated sequences. They can be any byte value from 0 to 255, which means every symbol get an average of 0.4% probability. Of course, there are some large differences between most common and less common ones, especially on text files. But in practice, most probabilities remain small, so Huffman deviation should be negligible.
In Zstandard, all symbols are compressed using Finite State Entropy, which is very fast and performs fractional bit compression. We are saying that, for literals, fractional bit makes little difference, so Huffman can be "good enough". So could we use Huffman instead of FSE for such symbols ?
This would only make sense if Huffman compression could bring some kind of advantage on the table, for example speed, and/or memory usage. Alas, currently known versions of Huffman perform worse than Finite State Entropy. The zlib reference version, which is pretty good, max out at 250-300 MB/s, which isn't close to FSE results. My own, older, version of Huffman, huff0, is not even as good as the zlib one.
But it's not game over. After all, analysing FSE algorithm in detail, there is no reason for it to be faster than Huffman, since their complexity are similar. A fast, modern, Huffman compressor should reach equivalent speed, if not better on the compression side (due to an additional operation required by FSE to provide fractional bit).
Part of the reasons why FSE is fast is that it uses some clever bitStream techniques, combining multiple symbols into branchless writes, a trick which is not strictly tied to FSE and can be used into different context. So the idea was to re-use the bitStream interface, and combine with a Huffman compressor.
huff0 was refurbished and improved to employ FSE bitStream. In order to preserve code compatibility, I kept FSE design of compressing and decompressing in reverse directions, which is not strictly necessary for Huffman. I could test though that it does not make any noticeable difference for Huffman compression, making this feature a non-event as long as it remains hidden within block API level.
Moving huff0 to this new bitStream proved extremely easy. And the result was very rewarding. With little effort, I could make it reach 500 MB/s compression speed, way better than any other huffman compressor I'm aware of, and more critically way better than FSE compression, making it a replacement candidate.
With such great result at hand, I confidently proceeded to implement huffman decompression based on the same design. I was in for a nasty surprise ...

Friday, May 29, 2015

Changing course

 Since I started programming a few years ago, and selected data compression as my little hobby and obsession, I nonetheless remained a part-time, amateur, programmer.

My then real-life job was Telecommunication Project Manager, later morphed into Marketing Product Manager. It may sound foreign to programming, but not really : within every new product, every innovation see the contribution of multiple programming teams sharing temporarily some common objectives.

I therefore started programming with a good excuse :  I was convinced that it helped me understand and communicate with programming teams, hence making me a better product manager. And, from what I can look at today, I would say it worked reasonably well. But let's be fair, the real objective was to entertain my brain, sure enough because I simply liked programming, and compression.

But as you can guess, with just a few evenings and week-end to save, progresses have been slow. Even more so since LZ4 became a "production-ready" source code, requiring a lot of maintenance and care, hence taking a sizable share of available time, and limiting further "research" activity.

That couldn't last. With a baby soon to come, it became clear that I would either have to stop, by starvation of free time, or eventually make programming my full-time activity.

I was lucky enough to receive a few propositions from several companies at this exact moment, while I was pondering my choices for the future. This acted for me as signal, a perfect opportunity to change course.

Starting June 1st, I'll become a full time employee at Facebook, Infrastructure division. On short term, it may translate into some reduced freedom to communicate around, but over the long term, it's the better choice to continue working in my field of choice, data compression.

I've selected Facebook for several reasons, not least because they are very keen to authorize my work in data compression to continue in Open Source mode. That's a great plus for them.
Of course, I guess you are also aware this team has developed an impressive set of tool, processes and mindset, to safely develop and deploy highly advanced software around the planet. So it's the kind of place where a lot of important practices can be learned. It's also an ideal crossover for my dual background in programming and telecommunication.

I'll need to ask for a few formal authorizations before being able to write again in this blog, but I'm optimistic on the outcome. And with now programming my primary activity, I should gradually find more and more time to do what I like, improving current compression algorithms and code base, and plausibly in the future, find some time to research and deliver some new ones.

Exciting times ahead...

Tuesday, April 7, 2015

Sampling, or a faster LZ4

 Quite some time ago, I've been experimenting with some unusual sampling methods, in an attempt to find a better way to compress data with LZ4.

The main idea was as follows : LZ4 hash table is getting filled pretty quickly, due to its small size. It becomes the dominant limitation, both for compression ratio and speed. In many cases, a hash cell is overwritten many times before being actually useful (i.e. produce a match). So, could there be some better way to update the hash table, which would update it less often, but in the end, update it more efficiently (i.e. limit wastes from over-writing) ?

It turned out my expectation were too optimistic. Any time I tried to reduce the update rate, it would result in a correspondingly reduced compression ratio. With that experiment failed, I settled for an "optimal" sampling pattern, which became the core of LZ4.

Recently, I've revisited this method. After all, getting a lower compression ratio at a faster speed is not necessarily a bad outcome. It depends on user expectation. So maybe, should a user be allowed to select its own "optimal" speed/compression ratio, he may actually prefer another trade-off than the default one.

Enter LZ4_compress_fast(). It's a new function , available only in developer branch for the time being, which handles a single new parameter : int acceleration.

The concept is fairly simple : the higher the value of acceleration, the faster the compression. Correspondingly, compression ratio decreases too. It can be pretty fine-tuned, each acceleration level providing a little 3-4% speed boost, meaning one could select quite exactly its preferred speed range.

In order to get a taste of this new parameter, a few limited tests were run on the same corpus using different acceleration values. Here are some early results :

                    Compression   Decompression   Ratio
memcpy                4200 MB/s      4200 MB/s    1.000
LZ4 fast 50           1080 MB/s      2650 MB/s    1.375
LZ4 fast 17            680 MB/s      2220 MB/s    1.607
LZ4 fast 5             475 MB/s      1920 MB/s    1.886
LZ4 default            385 MB/s      1850 MB/s    2.101

Silesia Corpus in single-thread mode, Core i5-4300U @1.9GHz, compiled with GCC v4.8.2 on Linux Mint 64-bits v17.


It provides some hint of the relatively wide range of newly accessible speed/compression trade-offs.

The new function prototype is currently only accessible within the "dev" branch. It's still considered experimental, but may find its way into next release r129, depending on user feedback.

Having a parameter to accelerate, rather than strengthen, compression is an unusual concept, so it's not yet clear if it's a very good one. What do you think ? Is a faster and programmable version, trading compression ratio for more speed, a good idea to fit into LZ4 API ?

Edit : LZ4_compress_fast() is released as part of LZ4 r129.

Saturday, January 24, 2015

Zstandard - A stronger compression algorithm

 Zstd, short for Zstandard, is a new lossless compression algorithm, aiming at providing both great compression ratio and speed for your standard compression needs. "Standard" translates into everyday situations which neither look for highest possible ratio (which LZMA and ZPAQ cover) nor extreme speeds (which LZ4 covers).

It is provided as a BSD-license package, hosted on Github.

For a taste of its performance, here are a few benchmark numbers, completed on a Core i5-4300U @ 1.9 GHz, using fsbench 0.14.3, an open-source benchmark program by m^2.

Name            Ratio  C.speed D.speed
                        MB/s    MB/s

zlib 1.2.8 -6   3.099     18     275
zstd            2.872    201     498
zlib 1.2.8 -1   2.730     58     250
LZ4 HC r127     2.720     26    1720
QuickLZ 1.5.1b6 2.237    323     373
LZO 2.06        2.106    351     510
Snappy 1.1.0    2.091    238     964
LZ4 r127        2.084    370    1590
LZF 3.6         2.077    220     502
An interesting feature of zstd is that it can qualify as both a reasonably strong compressor and a fast one.

Zstd delivers high decompression speed, at around ~500 MB/s per core.
Obviously, your exact mileage will vary depending on your target system.

Zstd compression speed, on the other hand, can be configured to fit different situations.
The first, fast, derivative offers ~200 MB/s per core, which is suitable for a few real-time scenarios.
But similar to LZ4, Zstd can offer derivatives trading compression time for compression ratio, while keeping decompression properties intact. "Offline compression", where compression time is of little importance because the content is only compressed once and decompressed many times, is therefore within the scope.

Note that high compression derivatives still have to be developed.
It's a complex area which will certainly benefit the contributions from a few experts.


Another property Zstd is developed for is configurable memory requirement, with the objective to fit into low-memory configurations, or servers handling many connections in parallel.

On the decoding side, Zstd memory requirement is divided into 2 main parts :
  1. The entropy tables : Zstd entropy stage is handled by FSE (Finite State Entropy).
    FSE needs several transformation tables, which currently cost 10 KB.
    The intention is to make this size configurable, from a minimum of 2.5 KB to a maximum of 20 KB. This is relatively mild requirement, mostly interesting for systems with very limited memory resource.
  2. The match window size, which is basically the size of "look back buffer" decompression side must maintain in order to complete "match copy" operations.
    Basic idea is : the larger the window size, the better the compression ratio.
    However, it also increases memory requirement on the decoding side, so a trade off must be found.
    Current default window size is 512 KB, but this value will be configurable, from very small (KB) to very large (GB), in the expectation to fit multiple scenarios needs.

The compression stage needs to handle a few more memory segments, the number and size of which is highly dependent on the selected search algorithm. At a minimum, there is a need for a "look-up table", such as the one used by the "fast mode". The current default size of this table is currently selected at 128 KB, but this too will be configurable, from as low as a few KB to a few MB.
Stronger search algorithms will need more tables, hence more memory.


While such speed qualify Zstd as a "fast compressor / decompressor", it still doesn't reach LZ4 territory. Therefore, selecting which algorithm best suits your need highly depends on your speed objective.

In order to help such selection, I've reused the benchmark methodology proposed here, which adds compression, communication, and decompression time in a simplistic transmission scenario. It results in the following chart :

(click to enlarge)

As expected, using "direct raw numbers representation" results in a clogged graphic, where each compressor is difficult to distinguish. Therefore, the representation is reworked, using the same scenario and same numbers, but dynamically zooming each speed sample so that the ranking is preserved, with the best solution receiving always the relative note "1", and other following depending on their speed difference. It creates the following graph :

(click to enlarge)

which is easier to interpret.
From this table we see that LZ4 is a better choice for speeds above ~50 MB/s, while Zstd takes the lead for speeds between 0.5 MB/s and 50 MB/s. Below that point, stronger alternatives prevail.

Zstd development is starting. So consider current results merely as early ones. The implementation will gradually evolve and improve overtime, especially during this first year. This is a phase which will depend a lot on user feedback, since these feedbacks will be key in deciding next priorities or features to add.

Tuesday, November 25, 2014

Portability woes : Endianess and Alignment (Part 2)


In the previous part, endianess and 32/64 bits detection were detailed. Now we'll have a look at more complex memory alignment troubles.

Memory Access Alignment is a lesser known issue, but its impact can be huge : it will crash your program or result in a disproportionate slow down.

Let's first summarize the problem. When accessing a single byte into memory, there is no restriction. But when trying to access 2-bytes at a time (short), or 4-bytes at a time (int), alignment get into the way.
Aligned property means a 2-bytes field must always be accessed on an even address (multiple of 2), or accessing a 4-bytes field is always done on an address multiple of 4, and so on.
From a performance perspective, accessing many bytes at a time is a win, as it makes better use of the memory bus. But when a multi-bytes field is accessed on a non-aligned memory address, all sort of problems can happen : bus width or addressing limitation, cache line overlap, memory segment border overlap, and so on. 

All these issues can be solved, and indeed, on the most widely known programming environment, x86/x64, these problems are solved since a long long time.
But it has a cost, it makes the CPU more complex, and consume some precious transistor space. As a consequence, several CPU vendors selected to be a bit lazy on these issues, deciding to not address them, leaving the problem into the hands of software developers. In such case, if an unaligned memory access is nonetheless performed, the CPU sends an exception, typically resulting in a program crash.
To make the matter more complex, some CPU addressed alignment issues, but in an inefficient manner, resulting in undesirable slow performance. Other ones address it correctly for short (2-bytes) or int (4-bytes) but not long long (8-bytes).

Data alignment issue is well described, with many sources throughout Internet. Unfortunately, finding a proper portable solution for it is not, many "advisers" simply telling to avoid unaligned access altogether. Thanks, really.
But the above condition cannot be met in every circumstances. Consider how the compression algorithm works : it looks for similar pattern into past data. Such pattern can appear anywhere, not just on "aligned" addresses.

For a portable code, this situation is a nightmare. The "safe" approach would be to always access data byte-by-byte, but then, the impact on performance is huge, and for speed-oriented application such as LZ4, this trade-off is unacceptable.

The way it was handled by LZ4 up to now relied on the compiler. The basic idea is : the compiler should be fully aware if its target CPU can, or cannot, handle unaligned memory access.
This is achieved through the "pack" instruction, which, in a nutshell, tell the compiler to "not align these data", and therefore generate proper cautious assembler code when accessing them.

It gives the following result :

#if defined(__GNUC__)  && !defined(LZ4_FORCE_UNALIGNED_ACCESS)
#  define _PACKED __attribute__ ((packed))
#else
#  define _PACKED
#endif

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__IBMC__) || defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(1)
#  else
#    pragma pack(push, 1)
#  endif
#endif

typedef struct { U16 v; }  _PACKED U16_S;
typedef struct { U32 v; }  _PACKED U32_S;
typedef struct { U64 v; }  _PACKED U64_S;
typedef struct {size_t v;} _PACKED size_t_S;

#if !defined(LZ4_FORCE_UNALIGNED_ACCESS) && !defined(__GNUC__)
#  if defined(__SUNPRO_C) || defined(__SUNPRO_CC)
#    pragma pack(0)
#  else
#    pragma pack(pop)
#  endif
#endif

#define A16(x)   (((U16_S *)(x))->v)
#define A32(x)   (((U32_S *)(x))->v)
#define A64(x)   (((U64_S *)(x))->v)
#define AARCH(x) (((size_t_S *)(x))->v)

If it looks a bit complex, that's because it is.
The first issue we have is that issuing the "pack" instruction must be done in a variety of ways, depending on the compiler and its version. It translates into this monstrous macro, trying to figure out all possible situations reported up to now. As you can guess, I regularly receive notice of new situations the macro cannot cope with.
This is about as bad as the previous stories regarding 32/64 bits and endianess.

But there is more.
Compilers are not that clever.
In many circumstances, the compiler will issue a "safe" slow code to access unaligned data, even though the target CPU is able to efficiently handle this situation, resulting in a large speed drop. This is especially true for late ARM CPU.
To counter-balance this effect, there is a need to manually "turn off" the "pack" instruction, using in the above example the #define LZ4_FORCE_UNALIGNED_ACCESS.
Unfortunately, the manual switch is almost never used. Source code will most of the time be compiled "as is", which is no surprise.

So we have 2 issues : issuing the "pack" instruction is complex, and not future-proof, and compilers don't automatically make the best choice.

To save the day, maybe a new runtime check will help, like for previous issues ?
Alas, there is no portable runtime test available to check for aligned properties.
(Note : of course, one could imagine running an external program/process just for this purpose, but it's outside of the scope of a little single-file library).

So we are stuck, aren't we ?

Well, that's the difficult part. To make some progresses on current situation, I'm going to change the logic : instead of relying on the compiler, take explicit charge to handle unaligned accesses.

The core foundation of the new solution is based on below function, already used within lz4frame :

static U32 LZ4_readLE32 (const BYTE* srcPtr)
{
    U32 value32 = srcPtr[0];
    value32 += (srcPtr[1]<<8);
    value32 += (srcPtr[2]<<16);
    value32 += (srcPtr[3]<<24);
    return value32;
}
What's good with this function ?
It handles both endianess and alignment in a safe way. The code is portable.

What's wrong with it ?
It's the safe approach, and therefore is slower than necessary when CPU can correctly handle unaligned memory accesses.

So, we will now special-cases CPU which do support efficient unaligned access.

Some of them are easily detectable, thanks to  widely supported standard macro : __i386____x86_64____ARM_ARCH_7__ are known architectures with good support for unaligned memory accesses. __ARM_ARCH_6__ is also able to handle it, but in a less efficient manner, so it's unclear if it's really faster than the portable version.

Finding a list of CPU with efficient support of unaligned memory accesses (and their related detection macro) has proven an impossible task so far. One may have in mind that Linux faces a similar challenge, which is supposed to be solved thanks to the macro CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. Unfortunately, I couldn't find a place where this macro is defined. It seems to be a decentralized methodology, where each architecture tells independently if it's compatible or not. For the Linux kernel, it's likely the correct method. But that also means there is no central repository where this property is listed.

So I'm a bit stuck right now.
My expectation is that external contributors interested in LZ4 performance may benchmark the speed of the new version, tentatively enabling/disabling the prominent switch at the top of lz4.c when they see fit :

/*
 * CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS :
 * You can force the code to use unaligned memory access, should you know your CPU can handle it efficiently.
 * If it effectively results in better speed (up to 50% improvement can be expected)
 * please report your configuration to upstream (https://groups.google.com/forum/#!forum/lz4c)
 * so that an automatic detection macro can be added to mainline.
 */
/* #define CPU_HAS_EFFICIENT_UNALIGNED_MEMORY_ACCESS 1 */

Each time a CPU is known to efficiently handle unaligned memory access, its standard detection macro can be added to the list, in order to automatically use the faster code path.

Is it any better than previous method ?
Well, sort of.

To begin with, there is no longer a need to fiddle with the different "pack" instructions depending on compilers and versions. This is one less complexity to manage, and dependency to worry.
Getting formally in charge of unaligned access allows introduction of dedicated functions. For example, a more adapted "string comparison" function could be introduced.

More importantly, instead of crashing when the detection fail, the library will now run slower, but still run correctly. But it introduces some new risk : many users may simply not notice the slow down, and just use the library "as is", unaware of the latent performance improvement which could be unleashed. The hope is, as long as a few contributors can detect and report the performance issue, the situation can be solved for everybody with similar setup.


Latest version of LZ4 using these new detection routines is accessible in the feature branch "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, it's still worthwhile to check.

Monday, November 24, 2014

Portability woes : Endianess and Alignment (Part 1)

 In creating an ultra-portable code, able to be compiled on (almost) every platform, there are some unusual problems to take into consideration. Listing them by order of increasing complexity, this post will review in detail address space, endianess and alignment restrictions (part 2).

Detecting 32 vs 64 bits
This is the easiest part, now practiced by many programmers. It's very common for a program to be designed with a single target environment. But since PC programmers had to deal with the 32->64 bits transitions, there are many solutions available around, just looking throughout Internet. However, they are not all equivalent.
Consider the initial solution adopted by LZ4 :

/* 32 or 64 bits ? */
#if (defined(__x86_64__) || defined(_M_X64) || defined(_WIN64) \
  || defined(__64BIT__)  || defined(__mips64) \
  || defined(__powerpc64__) || defined(__powerpc64le__) \
  || defined(__ppc64__) || defined(__ppc64le__) \
  || defined(__PPC64__) || defined(__PPC64LE__) \
  || defined(__ia64) || defined(__itanium__) || defined(_M_IA64) \
  || defined(__s390x__) )   /* Detects 64 bits mode */
#  define LZ4_ARCH64 1
#else
#  define LZ4_ARCH64 0
#endif

It works. But you can easily spot the weakness : this is a long macro, with many special cases, and there is no guarantee there will be no more additional cases in the future. And by the way, this is what happened : the list was completed thanks to contributors adding one by one each target platform as they were discovered.
So it's complex, and not completely future proof.
Let's compare to the new method :

static unsigned LZ4_64bits(void) { return sizeof(void*)==8; }

Yes, a single trivial line, and it is future proof. It could be done as a macro too, but I prefer a static function, since it gives the compiler a chance to do something clever about it.

The initial feeling is that the macro is "runtime free" while the function will cost a small comparison test every time it's called, thus be slower.
But eventually, that's not what a modern compiler is expected to do. Clever compilers will realize this function always return the same value, and therefore replace it by its result. When there are branches depending on the result, the compiler is also expected to automatically solve the test, and remove the useless branch through dead code optimization.
And in practice, it works well.

It doesn't solve everything though.
For example, using Visual, the intrinsic function _BitScanForward64() is only accessible during x64 compilation. Compiling a source code mentioning this function in 32-bits mode will fail the link stage, even if the program will never call the function. That's a situation a runtime test cannot solve.
For this special case, it's still necessary to restrict the compiled source code through macro selection.

Detecting Endianess
While 32/64 bits is (mostly) a question of performance, endianess will impact result correctness. So it's damn important to correctly detect it.
A code able to handle different endianess is less common, but it's still relatively easy to find several solutions over Internet. And once again, they are not all equivalent.

Initially, LZ4 adopted the macro detection approach :

/*
 * Little Endian or Big Endian ?
 * Overwrite the #define below if you know your architecture endianess
 */
#include <stdlib.h>   /* Apparently required to detect endianess */
#if defined (__GLIBC__)
#  include <endian.h>
#  if (__BYTE_ORDER == __BIG_ENDIAN)
#     define LZ4_BIG_ENDIAN 1
#  endif
#elif (defined(__BIG_ENDIAN__) || defined(__BIG_ENDIAN) || defined(_BIG_ENDIAN)) && !(defined(__LITTLE_ENDIAN__) || defined(__LITTLE_ENDIAN) || defined(_LITTLE_ENDIAN))
#  define LZ4_BIG_ENDIAN 1
#elif defined(__sparc) || defined(__sparc__) \
   || defined(__powerpc__) || defined(__ppc__) || defined(__PPC__) \
   || defined(__hpux)  || defined(__hppa) \
   || defined(_MIPSEB) || defined(__s390__)
#  define LZ4_BIG_ENDIAN 1
#else
/* Little Endian assumed. PDP Endian and other very rare endian format are unsupported. */
#endif
As you can see, it is a big mess. It depends on so many different parameters, it's hard to maintain, and it's difficult to guarantee it will always work. Indeed, it does not. Quite regularly, I received external contributions regarding specific platforms which would fail the test, in both directions (some little endian declared as big endian, and the other way round).
Add to this already complex situation the case of bi-endian CPU, a growing list of hardware which can select to be little-endian or big-endian, at will. That makes using architecture as an endian hint an unsustainable proposition.

As previously, the intention is to replace this list of macros by a guaranteed runtime test. Here is the new method within LZ4 :

static unsigned LZ4_isLittleEndian(void)
{

const union { U32 i; BYTE c[4]; } one = { 1 }; /* don't use static : performance detrimental */
return one.c[0]; }

Once again, the runtime method is not just much shorter and readable, it's also guaranteed to always produce the right result, whatever the CPU and its local mode.

However, this time, it's a little bit more difficult for the compiler to make the test "runtime free".

Here, unfortunately, your mileage may vary. For example, in the above example, I only succeeded in making the compiler realize the function always produce the same result after removing the "static" property from variable "one". Other possibilities exist, such as filling a 32-bits value and then accessing it with a char* pointer. They all work. At the end of the day, the question is : which version is most likely to be reduced into a constant value by as many compilers as possible ?

The above version proved successful so far. I can only wish it will remain as successful for other compiler/target combinations. At least, it's no longer the correctness of the test which is at stake, only its performance.


In the next article, we'll review memory access alignment restriction, which is, by far, the most complex issue.
In the meantime, should you wish to review and test the new detection methods, you can grab them in the latest LZ4 feature branch named "AlignEndian" : https://github.com/Cyan4973/lz4/tree/AlignEndian

It's possible to compare it with latest "official" release r124. On x86/x64 CPU, it was benchmarked and proved to provide similar performance. On other CPU though, especially big-endian ones, it would deserve to be checked.

Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

 An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.
(Edit : Performance numbers are measured  on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

Counting1[*ip++]++;
Counting2[*ip++]++;
Counting3[*ip++]++;
Counting4[*ip++]++;

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.


With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit :
Should you be willing to test some new ideas for the counting algorithm, you may find it handy to get the benchmark program which produced the speed results mentioned in this article. The program is part of the "test directory" within FSE project, as a single file named fullbench.c :
https://github.com/Cyan4973/FiniteStateEntropy/blob/master/test/fullbench.c

Edit 2 :
Thanks to recent comments, notably from gpdNathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :
Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.