Saturday, September 27, 2014

Counting bytes fast - little trick from FSE

 An apparently trivial and uninteresting task nonetheless received some special optimization care within FSE : counting the bytes (or 2-bytes shorts when using the U16 variant).

It seems a trivial task, and could indeed be realized by a single-line function, such as this one (assuming table is properly allocated and reset to zero) :

while (ptr<end) count[*ptr++]++;

And it works. So what's the performance of such a small loop ?
Well, when counting some random data, the loop performs at 1560 MB/s on the test system. Not bad.
(Edit : Performance numbers are measured  on a Core i5-3340M @2.7GHz configuration. Benchmark program is also freely available within the FSE project)
But wait, data is typically not random, otherwise it wouldn't be compressible. Let's use a more typical compression scenario for FSE, with a distribution ratio of 20%. With this distribution, the counting algorithm works at 1470 MB/s. Not bad, but why does it run slower ? We are starting to notice a trend here.
So let's go to the other end of the spectrum, featuring highly compressible data with a distribution ratio of 90%. How fast does the counting algorithm run on such data ? As could be guessed, speed plummets, reaching a miserable 580 MB/s.

This is a 3x performance hit, and more importantly, it makes counting now a sizable portion of the overall time to compress a block of data (let's remind FSE targets speeds of 400 MB/s overall, so if just counting costs that much, it drags the entire compression process down).

What does happen ? This is where it becomes interesting. This is an example of CPU write commit delay.

Because the algorithm writes into a table, this data can't be cached within registers. Writing to a table cell means the result must necessarily be written to memory.
Of course, this data is probably cached into L1, and a clever CPU will not suffer any delay for this first write.
The situation becomes tricky for the following byte. In the 90% distribution example, it means we have a high probability to count the same byte twice. So, when the CPU wants to add +1 to the appropriate table cell, write commit delay gets into the way. +1 means CPU has to perform both a read and then a write at this memory address. If the previous +1 is still not entirely committed, cache will make the CPU wait a bit more before delivering the data. And the impact is sensible, as measured by the benchmark.

So, how to escape this side-effect ?
A first idea is, don't read&write to the same memory address twice in a row. A proposed solution can be observed in the FSE_count() function. The core loop is (once cleaned) as follows :

Counting1[*ip++]++;
Counting2[*ip++]++;
Counting3[*ip++]++;
Counting4[*ip++]++;

The burden of counting bytes is now distributed over 4 tables. This way, when counting 2 identical consecutive bytes, they get added into 2 different memory cells, escaping write commit delay. Of course, if we have to deal with 5 or more identical consecutive bytes, write commit delay will still be there, but at least, the latency has been used counting 3 other bytes, instead of wasted.

The function is overall more complex : more tables, hence more memory to init, special casing non-multiple-of-4 input sizes, regroup all results at the end, so intuitively there is a bit more work involved in this strategy. How does it compare with the naive implementation ?

When compressing random data, FSE_count() gets 1780 MB/s, which is already slightly better than the naive strategy. But obviously, that's not the target. This is when distribution gets squeezed that it makes the most difference, with 90% distribution being counted at 1700 MB/s. Indeed, it's still being hit, but much less, and prove overall much more stable.


With an average speed > 1700MB/s, it may seem that counting is a fast enough operation. But it is still nonetheless the second contributor to overall compression time, gobbling on its own approximately 15% of budget. That's perceptible, and still a tad too much if you ask me for such a simple task. But save another great find, it's the fastest solution I could come up with.

Edit :
Should you be willing to test some new ideas for the counting algorithm, you may find it handy to get the benchmark program which produced the speed results mentioned in this article. The program is part of the "test directory" within FSE project, as a single file named fullbench.c :
https://github.com/Cyan4973/FiniteStateEntropy/blob/master/test/fullbench.c

Edit 2 :
Thanks to recent comments, notably from gpdNathan, and Henry Wong, a new and better reason has been provided to explain the observed delay. Its name is store-to-load forwarding. I would like to suggest here the read of the detailed explanation from Nathan Kurz, backed by his cycle-exact Likwid analysis, and the excellent article from Henry on CPU microarchitecture.
In a nutshell, while write commit delay used to be a problem, it should now be properly handled by store-cache on modern CPU. However, it introduces some new issues, related to pipeline, serial dependency and prefetching, with remarkably similar consequences, save the number of lost cycles at stake, which is quite reduced.

Edit 3 :
Nathan Kurz provided an entry which beats the best speed so far, achieving 2010 MB/s on a Core i5-3340M @ 2.7 GHz. Its entry is provided within the fullbench program (as algorithm 202), alongside a simplified version which achieves the same speed but is shorter (algorithm 201).
It's more than 10% better than the initial entry suggested in this blog, and so is definitely measurable.
Unfortunately, these functions use SSE 4.1 intrinsic functions, and therefore offer limited portability perspectives.

Saturday, July 19, 2014

xxHash wider : assessing quality of a 64-bits hash function

 The initial version of xxHash was created in a bid to find a companion error detection algorithm for LZ4 decoder. The objective was set after discovering that usual implementation of CRC32 were so slow that the overall process of decoding + error check would be dominated by error check.
The bet was ultimately successful, and borrowed some its success from Murmurhash, most notably its test tool smHasher, the best of its kind to measure the quality of a hash algorithm. xxHash speed advantage stems from its explicit usage of ILP (Instruction Level Parallelism) to keep the multiple ALU of modern CPU cores busy.

Fast forward to 2014, the computing world has evolved a bit. Laptops, desktops and servers have massively transitioned to 64-bits, while 32-bits is still widely used but mostly within smartphones and tablets. 64-bits computing is now part of the daily experience, and it becomes more natural to create algorithms targeting primarily 64-bits systems.

An earlier demo of XXH64 quickly proved that moving to 64-bits achieves much better performance, just by virtue of wider memory accesses. For some time however, I wondered if it was a "good enough" objective, if XXH64 should also offer some additional security properties. It took the persuasion of Mathias Westerdhal to push me to create XXH64 as a simpler derivative of XXH32, which was, I believe now, the right choice.

XXH64 is therefore a fairly straighfoward application of XXH methodology to 64-bits : an inner loop with 4 interleaved streams, a tail sequence, to handle input sizes which are not multiple of 32, and a final avalanche, to ensure all bits are properly randomized. The bulk of the work was done by Mathias, while I mostly provided some localized elements, such as prime constants, shift sequences, and some optimization for short inputs.

The quality of XXH64 is very good, but that conclusion was difficult to assess. A key problem with 64-bits algorithms is that it requires to generate and track too many results to properly measure collisions (you need 4 billions hashes for a 50% chance of getting 1 collision). So, basically, all tests must be perfect, ending with 0 collision. Which is the case.
Since it's a bare minimum, and not a good enough objective to measure 64-bits quality, I also starred at bias metric. The key idea is : any bit within the final hash must have a 50% chance of becoming 0 or 1. The bias metric find the worst bit which deviates from 50%. Results are good, with typical worst deviation around 0.1%, equivalent to perfect cryptographic hashes such as SHA1.

Since I was still not totally convinced, I also measured each 32-bits part of the 64-bits hash (high and low) as individual 32-bits hashes. The theory is : if the 64-bits hash is perfect, any 32-bits part of it must also be perfect. And the good thing is : with 32-bits, collision can be properly measured. The results are also excellent, each 32-bits part scoring perfect scores in all possible metric.

But it was still not good enough. We could have 2 perfect 32-bits hashes coalesced together, but being a repetition of each other, which of course would not make an excellent 64-bits hash. So I also measured "Bit Independence Criteria", the ability to predict one bit thanks to another one. On this metric also, XXH64 got perfect score, meaning no bit can be used as a possible predictor for another bit.

So I believe we have been as far as we could to measure the quality of this hash, and it looks good for production usage. The algorithm is delivered with a benchmark program, integrating a coherency checker, to ensure results remain the same across any possible architecture. It's automatically tested using travis continuous test environment, including valgrind memory access verification.

Note that 64-bits hashes are really meant for 64-bits programs. They get roughly double speed thanks to increased number of bits loaded per cycle. But if you try to fit such an algorithm on a 32-bits program, the speed will drastically plummet, because emulating 64-bits arithmetic on 32-bits hardware is quite costly.

SMHasher speed test, compiled using GCC 4.8.2 on Linux Mint 64-bits. The reference system uses a Core i5-3340M @2.7GHz
VersionSpeed on 64-bitsSpeed on 32-bits
XXH6413.8 GB/s1.9 GB/s
XXH326.8 GB/s6.0 GB/s






Monday, July 7, 2014

Pointer overflow : an analysis of LZ4 literal attack

 Last week, when a blog announced to the wild that it was possible to overflow a pointer within LZ4, I immediately produced a fix within the next few hours to protect users, without checking how the code would naturally behave in such circumstance. After all, one assumption of 32-bits memory allocation was broken, so as a rule of thumb, I accepted the idea that it must have broken something.

With the fix available, I was curious to have a deeper look at the technical behavior of the overflow. What follows is a continuation of an attack scenario presented here, which incidentally match an observation I made a long time ago, while assessing the level of issue 52, and totally forgot last week. Since current code is protected against overflow scenario, I will look at this issue from an "old version" perspective, such as, for example, the relatively old r91 (march 2013). The behavior analyzed concerns the function LZ4_decompress_safe(), which is the one designed to be protected against malicious packets. Note that an unsafe version also exists, which is called LZ4_decompress_fast() and is not protected against malicious packets, and therefore offers no such guarantee.
(Note : the safe function is also mapped to LZ4_uncompress_unknownOutputSize(), the unsafe one to LZ4_uncompress()).

A key claim is that it is possible to achieve a Remote Code Execution on such older version. An RCE attack requires to deliver a crafted payload at a selected address in memory (Edit : see some relevant comments on this). The proposed attack is described here. A later version would add that it is possible to do the same with less payload if the destination buffer get allocated within high address region, but ultimately uses the same logic. The present article starts from there.

We will suppose that the target OS has no memory protection in place, such as detection of illegal reading or writing, which would make the attack pointless.

At the start of the attack, we have the destination pointer op, which points into a valid buffer region. If the malicious payload wants to trick the library into writing into an unauthorized region, it looks good enough to cheat on the length of data to be copied. Unfortunately, this condition is checked, and if the pointer gets beyond the valid destination buffer, the LZ4 decoder stops right there and output an error code.

For the attack to have a chance to succeed, the objective is to provide a length which is so large that it makes the pointer wraps beyond the last address 0xFFFFFFFF (note : this is only possible in 32-bits mode, 64-bits address space is way too large to overflow). It requires a lot on input data, but we'll suppose that this condition is met too. This is then possible because the end of literal area is calculated as a pointer called cpy :
cpy = op + length ;  // if length is large enough, we may have cpy < op because of overflow

OK, so what happens next ? The article claims that it delivers a payload of one's choice at the cpy address, basically the code to execute.
What's going to happen is a bit more complex. To understand why, let's follow the code. The relevant lines, for r91, are copied below :

        // get runlength
        token = *ip++; /* ... calculate length ... */
        // copy literals
        cpy = op+length; /* ... check buffer limits ... */
        LZ4_WILDCOPY(ip, op, cpy); ip -= (op-cpy); op = cpy;

        // get offset
        LZ4_READ_LITTLEENDIAN_16(ref,cpy,ip); ip+=2;
        if unlikely(ref < (BYTE* const)dest) goto _output_error;   // Error : offset outside of destination buffer

        // get matchlength

Since cpy < op, the tests checking for the end of output buffer will pass. We also suppose that the test for input buffer pass, so we move to the next line.

        LZ4_WILDCOPY(ip, op, cpy);

This macro is not too easy to understand. Basically, it will speculatively copy 8 bytes from ip (which is supposed to be valid, otherwise the decoder would have already stopped) to op, which is still valid. Yes, only 8 bytes, whatever the value of length. Why ? Because cpy < op,  so after 8 bytes it just stops there.

ip -= (op-cpy); op = cpy;

That's where it starts to become nasty. With op = cpy, the destination pointer is now at a forbidden area. Note that ip has changed too ! Basically, both ip and op have been translated by the same amount.

ip is now somewhere in memory. We don't know where exactly, but it's likely to be an unauthorized memory segment. So the next line of code matters :

LZ4_READ_LITTLEENDIAN_16(ref,cpy,ip); ip+=2;

In LZ4, a literal copy is necessarily followed by a match copy. This line calculates where the data to be copied is, and stores it into the pointer ref. At this point, some OS will crash, due to unauthorized or unmapped access, but we'll take the assumption that the read will silently pass.
By definition of the format, ref = op - distance;
op is currently the same as cpy, so points to an unauthorized memory area.
distance has just been provided by ip. But ip, too, is into an unauthorized memory area. So that means that we don't know what ip is reading. distance is basically a random 16-bits number.

So now, we have ref < op = cpy < validBuffer.
At this stage comes the next line of code :

if unlikely(ref < (BYTE* const)dest) goto _output_error;

Translated, it means that if ref < validBuffer, the decoder detects a problem, and immediately stops decoding. Which is the case here. As a consequence, the overflow finally gets detected here, and the decoder exits with an error code.


OK, so if the issue was already solved, why keeping issue 52 opened on the board ? Well, I confusely remember that I almost believed for a time that the issue was solved, but then realized that this check wasn't good enough. Indeed, after some more scrutiny, here is one border case scenario which requires some additional conditions.

In this specific scenario, it is possible to make the detection fail with the help of a bit of luck : suppose that cpy is so small that its value is < 64K. This requires to either be extremely lucky, or to know the start address op before even generating the crafted payload. This is a very strong hypothesis, and suggests either the existence of another exploit, or a level of insider knowledge associated with predictable allocation practices. But for this exercise, we will nonetheless suppose just that.
Now, let's also suppose that we get lucky, and distance (which is not controlled by the attacker, so is basically a random number) is large enough to make ref underflow memory space. Now ref stands at very high address, and therefore passes the detection test.

What follows is a match copy, which size is controllable by the attacker, thanks to the token, from 4 to 18 bytes (beyond that, size is no longer controllable). Let's suppose we'll only copy 4 bytes, from ref, a very high address presumed unauthorized, to cpy, a very low address < 64K.
This is starting to spell trouble, since an unauthorized memory area gets modified.
Note however that the attacker still does not control what is being copied.

The next stage therefore is another literal copy. It seems it is the right moment to deliver the code to execute ?
Unfortunately, ip is currently lost, somewhere in memory. It will deliver a random token, followed by a random literal sequence.

As long as both ip & op remain small enough, they will evade the detection, and the algorithm will continue to write some random noise at op position, but as soon as ref stops underflowing memory address space, which is probable at each step and necessary beyond 64K, it will get detected, and trigger an error code.

So, the conclusion is :
  • A blind overflow attempt of literal length is extremely likely to get detected
  • To remain undetected, the overflow must be accurate, and finish into the 0-64K memory segment (guaranteeing the success of this operation requires complementary knowledge)
  • It's not possible to start writing beyond address 64K without being detected.
  • Whatever get written into this low memory segment is some random copy & literals operations  from other parts of the memory, which are not under the control of the attacker.
This situation is not the same as safe : writing noise into the low-memory area can result in bad consequences, likely a process crash. Furthermore, the indirect protection requires reading a few bytes from unauthorized memory area, which is also susceptible to crash the program. These are good reasons to update today to r119 for 32-bits applications.

Wednesday, July 2, 2014

Vulnerabilities disclosure - how it's supposed to work

Note : there is a follow-up to this story

 In the lifetime of LZ4, I've received tons of feedbacks, feature requests, warming thanks, bug disclosures, often with their bugfix freely offered.

This is the essence of open source : through disseminate usage and reports, the little piece of software get exposed to more and more situations, becoming stronger, and reaching overtime a level a perfection that most closed-source implementations can only dream of, due to limited capabilities of internal testings only. The strength of the formula totally relies on such feedback, since it is impossible for a single organization to guess and test all possible situations.
It is an enlightening experience, and I can encourage every developer to follow a similar journey. It will open your eyes and perspective into a much larger linked world.

Among theses issues, there were more than a few security bugfixes. I can't tell how much I thank people for their welcomed contribution in this critical area. Disclosures were received by email, or using the LZ4 issue board. Typical contributors were modest professional developers, providing their piece of help for free, on top of a larger and larger edifice, and got a quick "thanks notice" in the version log, an honor they were not even requesting. This is this modesty and positive construction mindset which really touched and drove me forward.

Such early security fixes deserve high praise today, way more than a recent "integer overflow hype" which was souped up to receive inappropriate media coverage. It's a sad state of affair, but much more important issues got fixed way more politely and securely, as a sincere desire to improve a situation. Overtime, such security bug became less and less common, and almost disappeared, basically proving the strength of the open source formula, an implementation becoming stronger after each correction.


When an implementation becomes widespread, the rules of security disclosure gradually evolve, because the number of systems and user potentially exposed become too large to game with. A good illustration of a proper reporting process is explained on the CERT "Computer Emergency Response Team" website. The process is both transparent and clean. The security auditing firm would first start by contacting upstream developer team, ensure the bug is understood and a fix undertaken. It will issue a typical disclosure delay of 45 days, to ensure upstream organization get an incentive to solve the problem fast enough (without the delay, some organisations would opt for not doing anything at all !). After which delay, a public disclosure can be sent. Hype is optional, but is not forbidden either, as long as users get correctly protected beforehand.


That's why, in retrospect, I've been so angry last time about DonB code of conduct. I initially reacted at the overblown vulnerability description, and only realized later that it was also a question of disclosure policy. With just a little bit of communication, everything would have been fine, as always.
Unfortunately this is not what happened. The auditor barely left a footnote on the issue board to request a level raise on an item (which was accepted), and then simply lost contact, focusing instead on overselling a security risk for maximum media effect. In doing so, he never looked back to ensure that a fix was ongoing or being deployed to protect users before disclosure.

This behavior is totally untypical from a respectable security firm behavior. In fact, it is in total contradiction with core values of security auditing. This is considered insane to willingly sacrifice public safety for some selfish media coverage. I was thinking : "Damn, suppose he'd been right..."

So, let's just suppose it.
That means the security auditor decided to release to the public, hence to every electronic criminal worldwide, a critical vulnerability, "DCE level", the highest possible danger, allowing remote execution on target platform, without even bothering if a fix was completed and being deployed to protect exposed users. This is so borderline, I was incredulous. That much havoc, just for a headline in the medias ? Is that still within the limits of the law ?

Fortunately, the risk was not as large as claimed. And since then, I've been pushed into believing it was just a genuine mis-communication lapse, thanks to reassuring words from the auditor himself, acknowledging the communication problem and wishing to improve the situation for the future (and linking his nice words to Twitter, spreading a positive image). Considering that previous vulnerability couldn't result in anything serious, and willing to grant the privilege of doubt, I issued a more neutral-tone statement to end the story. I would then expect that, from now on, a normal communication level would be restored, future bugs being disclosed "normally", that is starting with an issue notification, and a discussion.

I was wrong.

In total contradiction with his own written statements, donb doubled down earlier today, broadcasting a new vulnerability issue directly to the wild, without a single notification to upstream :
http://blog.securitymouse.com/2014/07/i-was-wrong-proving-lz4-exploitable.html

The new vulnerability could be correct this time. I have not been able to prove/disprove it myself, but have no reason to disbelieve it. Some specific hardware and debugging capabilities seem required to observe it though.

Apparently, the risk is valid for ARM platforms (maybe some specific versions, or a set of additional platform specific rules, the exact scope of which I don't know about). I have doubts that it is only hardware related, I believe it must be OS-driven instead, or a combination of both.
The key point is the ability of the 32-bits system to allocate memory blocks at high addresses, specifically beyond 0x80000000h. This situation never happened in earlier tests. Each 32-bits process was encapsulated by the OS into its own virtual address space, which maximum size is 2 GB, irrespective of the total amount of RAM available (this was tested on 4 GB machines). With no counter example at hand, it became an assumption. The assumption was key in the discussions assessing gravity level, and remained undisputed up to now. Today's new information is that this situation can in fact be different for some combinations of OS and hardware, the precise list of which is not clearly established.
[Edit] The vulnerability existence can now be tested, using the new fuzzer test available at https://github.com/Cyan4973/lz4/tree/dev.
Should you have a system able to generate such a condition, you're very welcomed to test the proposed fix on it. The quality of the final fix for this use case will depend on its ability to be tested.
The issue tracker is at this address :
https://code.google.com/p/lz4/issues/detail?id=134
A first quickfix is proposed there.


In normal circumstances, a vulnerability disclosure is welcomed. For the open source movement, it translates into better and safer code. That's a view I totally embrace.
But obviously, everything depends on the way vulnerability is disclosed. Even a simple mail or a note on the issue board is a good first step. At the end of the day, the objective is to get the issue fixed and deployed first, before any user get exposed, to reduce the window of opportunity for malware spreaders. It's plain common sense.

This latest disclosure does not share such goodwill elements. By selecting direct wide broadcasting without ever notifying the upstream developer about its finding, the security auditor did the exact opposite of his social mission, ensuring maximum exposition danger for users and systems alike. I can only regret he made this choice, since it will create him a "special" reputation within his own professional circle. It's questionable if a paying company can entrust its critical security vulnerabilities to a self-declared security auditor with teenager-grade reactions like that.

As far as we are concerned, the goal of the game is to get a safer implementation of LZ4 available to the general public, trying nonetheless to make the window of opportunity as small as possible. In the longer run, the episode will serve as another reinforced stone, providing security benefit to the open sourced edifice. But in the short term, we suffer exposure.

The new status is as follows :
  • The vulnerability affects only 32-bits applications (64-bits are safe)
  • The new vulnerability affects systems allocating memory beyond address 0x80000000h (others were already safe)
This last point is very difficult to know. It seems Windows systems are safe for example, but that still leaves a lot of other systems to check. The new fuzzer tool is now designed to test the existence of this vulnerability, and check the efficiency of the last fix against this new exploit scenario.

You can get the new fuzzer tool and the proposed fix at : https://github.com/Cyan4973/lz4/tree/dev
The fix seems to provide good results so far, don't hesitate to test it on your target system, should it match the above conditions.

[Edit] : the fix is good to go to master, hello r119 !
[Edit 2] : since the second condition is relatively difficult to assess, the fix is recommended for any 32-bits application.

[Edit 3] : After further analysis, it seems the new overflow issue remain relatively difficult to exploit usefully. It has opened new fronts, but still require some favorable conditions, outside of attacker control, such as allocation at the extreme end of the available memory address range. Relatively large data blocks remain a necessity for a correspondingly good success perspective. Previously published conditions still apply to design an interesting attack scenario. With most LZ4 programs using small blocks, it makes overflow risk a rarity, if not provably impossible depending on allocation rules.
Still, with a fix available, updating your 32-bits applications to r119+ remains a recommended move.

[Edit 4] : End of Linux kernel risk assessment. This potential overflow has no useful attack scenario for now. It is nonetheless getting fixed, to protect future applications. Knowing current list of applications using LZ4 within the kernel, the only remaining attack scenario is a boot image modification. When such a scenario is possible, then you've got a lot more to worry about, a non-guaranteed potential overflow under cooperative conditions pales in comparison of a direct modification of the boot image, inserting typically some worm code.

------------------------------------------------------------------------------

[Edit] : Sight, I can only wish such thing never happen to you. I'm currently facing an organized internet straw-man campaign (guess by who), inventing words I never said and even strongly disagree with. I feel compelled to put some records straight for the public :
OK, so where is the problem ?
  • Broadcasting a vulnerability to the wild without even providing a single notification to upstream organization cannot get close to the definition of ethical disclosure, under no possible metric
  • Conflating gravity levels and spreading meritless fear to the public to harvest some free ad is a despicable scare-mongering practice
The debate over Responsible Disclosure is not new. In factit is gaining strength, precisely because software becomes ubiquitous. Long gone are the day when a vulnerability would mostly put at risk some isolated computers primarily used to play games. Computing is now interconnected, and the backbone of our most critical services. With Internet of Things, it's going to be present into everyday devices, including medical equipment, surveillance systems, smartgrid probes, etc.

In Responsible Disclosure, there is Disclosure, which is a good thing. There is also Responsible. For CERT, it translates into un-hyped vulnerability classification, a notification, and a fix delay. There is a huge difference between a calm notification into a public issue board, which remains nonetheless public, and a communication campaign designed to make a vulnerability known by a maximum number of people before a fix get a chance to get produced and deployed.

In Europe, law has selected its side, ruling in simple words that providing a manual to launch a cyber attack is about as good as providing a plan for a bomb. Of course, only the most adamant cases had to meet a judge, resulting in few convictions, mostly when the specific charges of "willful harm" were on the table. Currently, justice get involved when an offender explicitly targets a plaintiff. I believe that someday, it will simply no longer remain acceptable to consider public safety a dispensable collateral victim, freely exposed by the fire of an advertisement exercise or a petty personal revenge.

Saturday, June 28, 2014

Vulnerability assessment

[Edit] There is a new follow up to this story.

I've received an answer from Don Bailey. He blames the situation on a lack of communication. OK. In an attempt to bring the discussion to a more neutral level, this post is dedicated at providing a hopefully clear, concise and factual description of the situation. I hope it will help the reader to properly assess the risk level.

  • There was a vulnerability, it was fully described by Ludwig Strigeus and tracked for correction on the LZ4 issue board
  • The vulnerability is fixed. Update is recommended for 32-bits applications
  • The vulnerability requires uncommon conditions to be exploitable :
    • Input data supplied by untrusted source
    • Large block sizes, preferably ~16MB (beyond LZ4 file format specification)
    • 32-bits applications only
  • These conditions are not met by currently known programs using LZ4
    • For a list of them, get to the LZ4 homepage. It's correct to state that some yet unknown application may match these conditions. That's why the reference version has been fixed
    • Edit : One use case has been found, involving custom scripts to receive untrusted data from external sources using a Python version of LZ4 while not enforcing block size limits. The targeted Python version was using an unprotected version of the decoder. Conclusion remains the same : the Python version has been updated to cover such situation, and is a recommended upgrade
  • The Linux Kernel version, which is different and supplied by LG, has this vulnerability too
  • Linux Kernel programs (which are the only ones to access kernel functions) do not match the conditions that would make it exploitable
  • The Linux Kernel version is nonetheless being fixed too, to avoid any future risk for future versions of the kernel

Hope it helps.

Thursday, June 26, 2014

Debunking the LZ4 "20 years old bug" myth

[Edit] The below post was redacted in the heat, right after the initial publication. Should you wish a summarized, and arguably more neutral tone, view of the situation, please consider reading the follow up.

 A recent post on a security blog has made a wonderful job at sexing up a subtle bug affecting LZ4, claiming that it would result in remote code execution. Given the current spread of use of LZ4, from every modern Linux distro, including critically Android, to modern file systems such as ZFS, shipped with FreeBSD and Illumos, through widely deployed databases such as MySQL, or big data nodes powered by Hadoop, the implication is that it must be a pretty big deal.

The detailed article then makes a fairly good job at describing the conditions required to trigger that risk, but unfortunately, these explanations are lost, scattered within the content of an overly long and boring technical article, ensuring most readers will stop at the dramatic headline. The present article has the objective to counterweight those "high stakes" claims.

I won't spend too much time underlining the vast difference between reporting a bug for correction to the relevant team, and creating a dramatic communication for maximum coverage without even bothering if a fix is available for the exploit. The first behavior is valuable, helpful and welcomed. The second one is at best irresponsible, if not downright harmful. A security auditor is expected to know this simple common sense truth, and behave accordingly.
[Edit] : Later comments from DonB state that this situation is due to him not reading answers provided for him on the issue board.

But perhaps more importantly, this bug wasn't unknown, which contradicts the statement that it was found through careful code study. The risk was identified and published on the LZ4 public issue board by Ludwig Strigeus, the creator of ĀµTorrent. Ludwig made a very fine job at describing the risk. Instead of trying to make a headline, he even proposed a solution for it (unfortunately with side effects). The bug was classified minor (rightly or wrongly) for reasons which will be detailed below. Thus, it was tracked for correction, but with no urgency, trying to design a fix without initial side effects. After multiple partial improvements, a change of code related to the new streaming mode finally allowed such a fix to be created, closing the issue (by chance, merely a few days before DonB second "disclosure").

Let's now get into the technical details.
In order to use the vulnerability, a number of conditions must be met.
A first minor one is : it is necessary to work within a 32-bits environment. 64-bits is totally unaffected. That basically put most server-side applications outside of the risk zone.

Second, the attacker need to forge a special compressed block to overflow 32-bits address space. This is possible ... if the compressed block is something like 16 MB.

There is just a little problem with that : the legacy LZ4 file format is limited to 8 MB blocks. Any value larger than that just stops the decoding process. 8MB is not enough to trigger a problem, since 32-bits application are restricted to low address ranges. The newer streaming format is even stricter, with a hard limit at 4 MB. As a consequence, it's not possible to exploit that vulnerability using the documented LZ4 file/streaming format.

Well, but what about programs which use their own, undocumented, format ?
Indeed, same condition applies. To be exposed to this risk, very large blocks of 16MB or larger must be read by the decoder.
Does that happen ?
Let's have a look at several high-profile programs using LZ4. ZFS ? Max block size is 128 KB. Lucene ? Typical index size is 16 KB. It could be a bit more, say 64 KB, that's still far short of the objective. zram maybe ? nope, 4 KB memory segments. Linux kernel ? The boot section has to decompress a multi-megabytes kernel into memory, surely it can match the limit ? Yes, it does, but it uses the LZ4 Legacy File format, which limits each block to 8 MB maximum. OK, maybe AAA games such as Guild Wars 2 ? nope, real-time protocol is the realm of short messages, the protocol itself doesn't allow anything beyond a few KB. And so on, and on.

At the end of the day, none of the known implementation of LZ4 is exposed to this risk.
Sure, someone creating some private program at home could get into this pitfall, but this situation is not exactly a "security risk". Only widely deployed programs actually matter, since they are the potential targets by hackers or surveillance organizations.
Basically, most user programs employ LZ4 for small data packet structure, way below the critical limit. Programs which generate and distribute large compressed blocks (notably the lz4c pos-x compression utility, distributed within Linux Distro) use the documented framing format, which limits block size to 4 or 8 MB. Remove also from the list programs which never take "externally provided" data as input, they can't be targeted either.

So sorry, this is not a "new heartbleed" situation the author seems to dream for.

Sure, since a fix is currently available, it's nonetheless a good move to close this risk. But I wouldn't flag that action as "critical", since most applications stand outside of the "custom compression format using large blocks of > 8 MB on 32-bits system, and receiving data from untrusted external sources" scenario.

It's one thing to tell there is a potential vulnerability that should be fixed, to ensure it does not become exploitable in the future. It's a totally different thing to pretend there is a dangerous RCE (Remote Code Execution) exploit currently active on the Internet, which is scary. DonB article cleverly conflates the two, implying the second to create a flashy headline, while disseminating some minor disclaimer elements throughout the long article to pretend, whenever necessary, having said the first. That's clever, but conflating gravity level to grab some free ad is not a respectful behavior from a security firm.

The world is already full of real security threats, from which heartbleed is just one of the most recent examples. Triggering too many alarms to grab a bit of fame is a good way to weaken the power of future alarms. You can guess that when every headline claim a "new heartbleed" situation, no one will pay attention to the real next one which will matter. And that's dangerous.

The long list of "credits" at the end of the article is another reason for caution. Man, this guy must be desperate for attention : it happens I know some the influential names listed there, and they told me they barely heard indirectly about the guy. So why are they listed there ? Fame by association ? Sure, please thank Linus Torvald for "coordinating and advising" on the issue. This is becoming plain ridiculous.

Anyway, should you feel nonetheless at risk now, please don't hesitate to update your LZ4 version. It's a good thing to do anyway, and as stated previously, the vulnerability was already patched.

Tuesday, May 20, 2014

Streaming API for LZ4

 For quite some time, the LZ4 Streaming API project has been started and delayed, as other priorities stepped in the way. To be fair, one important delaying factor was the difficulty to define a "clean enough" API, something that would be simple to use and understand, while also providing the level of fine-tuning required by advanced programmers (typically within embedded environments).

I feel quite close today from breaking this shell, with an interface definition which matches my definition of "clean enough" API. So let's share some preliminary results.

Current streaming interface

The current streaming API exposes the following functions :

void* LZ4_create (const char* inputBuffer);
int   LZ4_compress_continue (void* LZ4_Data, const char* source, char* dest, int inputSize);
char* LZ4_slideInputBuffer (void* LZ4_Data);
int   LZ4_free (void* LZ4_Data);

Although it works as intended, this API introduces some critical limitations.

First, there is a need to create a data structure which tracks the behavior of the stream.
This is void* LZ4_data, which will be generated by LZ4_create().

It looks fine, but a first issue is that allocation happens inside LZ4_create(). In some circumstances, this may not be acceptable : sometimes allocation must be fully controlled by the host process. For example, standard functions such as malloc() may be unavailable. For this use case have been created 2 more functions :

int LZ4_sizeofStreamState(void);
int LZ4_resetStreamState(void* state, const char* inputBuffer);

It looks fine too, but is unfortunately still not usable for static allocation purpose (on stack, for example).

The second issue is the const char* inputBuffer argument, specified at creation stage, because it will be fixed during the entire compression process. It's not possible to jump between memory segments. This is a major limitation, preventing "in-place" compression of scattered memory segments.

LZ4_compress_continue() looks fine as a function definition, but what is less so is the documented requirement : up to 64KB of previously processed data must stand *before* the data to be compressed. This is a major constraint, typically requiring to prepare data into an intermediate buffer (which in some circumstances, might prove an unaffordable burden).

Then there is LZ4_slideInputBuffer(), which is arguably not at its right responsibility level. Its mission is to copy the last previous 64KB of source data to the beginning of the input buffer. It exists because the content of void* LZ4_data is not accessible.

No function is defined to simply load a dictionary : to achieve that goal, one has to compress a segment using LZ4_compress_continue(), throw the result, and compress the next segment using data accumulated so far.

To summarize, yes it can work, but it's cumbersome. One has to accept the idea that data to compress must be prepared into an intermediate buffer where above conditions can be met.
It's not convenient, and may sometimes not be possible, for example due to allocation limitations.
It also has a negative impact on performance, due to memory copy operations overhead.

New streaming interface definition

While defining the streaming API, I constantly struggled to find the right balance between a "fully automated" API, and a "user-controlled" one. The "fully automated API" would take in charge all kind of workload, generate data respecting the LZ4 Streaming Format with full headers and markers, and take care of allocation for required buffers. On the other hand, the "user-controlled" API would serve the needs for host programs which require full control over resource allocation.

I therefore settled with providing both.
The "user-controlled" version will be part of LZ4 library. It's the simpler one, and only takes care of generating compressed block format, chained or independent. It depends on the host process to pay attention to all memory operations (allocation, position and availability) and to provide its own custom format linking the successive blocks.

The "fully automated" API will be part of a new library, which will be called LZ4S for "LZ4 Streaming". It is basically a user program of the previous API.

In this blog post, we will therefore concentrate on the first API, since it is also an underlying foundation for the 2nd one.

typedef struct { unsigned int table[LZ4_STREAMSIZE_U32]; } LZ4_stream_t;

int LZ4_compress_continue (void* LZ4_stream, const char* source, char* dest, int inputSize);

OK, let's focus on the most important differences with current API.

An LZ4_stream_t structure is exposed. It is provided to give a better control over memory management to the host program. For example, allocation can be static (stack, global) or dynamic, and use any user-defined allocation function.

Obviously, the host program is both in charge of allocating and freeing this memory space. It may also duplicate it "at will", which is a good idea for "static dictionary compression" scenarios.

Cleaning (or resetting) LZ4_stream_t is only a matter of memset() it to zero. It's a requirement to do it once before using this structure.

LZ4_stream_t 's size is controlled by the value LZ4_STREAMSIZE_U32, which is checked at compile time thanks to a static assert piece of code, as already used within xxHash. It mostly depends on LZ4's hash table (which typical size is about 16KB). Hash table size is a parameter controlled by the defined value LZ4_MEMORY_USAGE. Up to now, this define was present into lz4.c. To be coherent with the new interface, it will be moved to lz4.h instead. I don't foresee any issue with that.

LZ4_stream_t is described as a table of unsigned int. This is intentional. The real usage of this memory bank remains hidden inside lz4.c. This way, internal variables within the structure cannot be used as a kind of implicit interface contract, allowing future modifications to happen without breaking compatibility with existing programs.

Once the streaming structure is created, you can start to populate it by loading a static dictionary using :

int LZ4_loadDict (void* LZ4_stream, const char* source, int size);

This part is optional. Loading a dictionary flushes any prior data reference from LZ4_stream_t , if there is any, so you may also use this function to initialize an LZ4_stream_t structure by simply setting a size of 0.

LZ4_compress_continue() looks the same as previously, and is indeed compatible,  but a major difference is that it no longer requires to compress next data block "right after" previous data. Previous and Next data can be anywhere in memory. The LZ4_stream_t structure will keep track of it automatically.

One strong assumption of LZ4_compress_continue() is that previously compressed data is still available. Unfortunately, this might not be the case, and this function cannot guess that.
Should previously compressed data be no longer accessible, for example because you need the space for something else, or you can't guarantee its availability, it's possible to solve that situation :
  • You can "save" the relevant data segment from previous block (typically, the last 64KB, it can also be less than that) into a "safe" place. At which point, you will need to indicate this change of position.
int LZ4_moveDict (void* LZ4_stream, char* safeBuffer, int dictSize);

Memory space available at char* safeBuffer must be at least dictSize,  since it is the amount of data  LZ4_moveDict() will copy there (Note : the maximum amount of data that will be copied is 64KB, even if dictSize is larger).


Decompression

The current streaming decompression API is defined as follows :

int LZ4_decompress_safe_withPrefix64k (const char* source, char* dest, int compressedSize, int maxOutputSize);
int LZ4_decompress_fast_withPrefix64k (const char* source, char* dest, int originalSize);

Its documented requirement is that previously decompressed data must stand *right before* the memory address where the next data block is going to be decompressed. It works, but implies the creation of a temporary buffer to match the requirement.

The new streaming API get rid of this limitation, allowing previous data to stand anywhere within accessible memory :

int LZ4_decompress_safe_usingDict (const char* source, char* dest, int compressedSize, int maxOutputSize, const char* dictStart, int dictSize);
int LZ4_decompress_fast_usingDict (const char* source, char* dest, int originalSize, const char* dictStart, int dictSize);

The dictionary definition is part of the argument list (const char* dictStart, int dictSize), therefore there is no need for a tracking structure, in contrast with compression.



A first code implementing this API is currently available in the "dev" branch of LZ4 on github. It is still early stage, and probably the right time to be provide your comments should you have any.

[Edit] Added : LZ4_moveDict() as a potential replacement of LZ4_SetDictPos()
[Edit] Added : Paragraph on decompression
[Edit] Modified : move to LZ4_moveDict()
[Edit] Interface definition converges towards LZ4_compress_continue()
[Edit] "streaming" branch now merged into "dev" branch