Monday, December 19, 2011

LZ4 into Hadoop-MapReduce

 After a very fast evaluation, LZ4 has been recently integrated into the Apache project Hadoop - MapReduce.

This is an important news, since, in my humble opinion, Hadoop is among the most advanced and ambitious projects to date (an opinion which is shared by some). It also serves as an excellent  illustration of LZ4 usage, as an in-memory compression algorithm for big server applications.

But first, a few words on Hadoop.
By 2005, Google shook the IT world by presenting Big Table, its home-grown distributed database with eventual consistency, able to store virtually the entire web and queries it. BigTable was built on top of Google FS, a virtual file system covering the entire planet, tens of thousands of computers distributed in hundreds of datarooms all over the world, as if it was a single massive one. This limitless amount of stored data could then be processed in parallel, typically for query preparation, thanks to the MapReduce framework, which allows to process petabytes of data in a small amount of time (if you can afford the number of servers necessary for that).

At this very moment, Google stealed the crown of programmatic champion from Microsoft. It was now clear that they were setting the future. Although most of these technologies were already studied, it was the first time they were executed together and so well, at such a huge scale for commercially available products. This gave Google literally years of advance over the competition, since most of its Web products were based on these capabilities.

Since then, all other "big names" of IT, (namely Yahoo, Facebook, Amazon, IBM, Microsoft, Apple, etc.) have been willing to duplicate this architecture. The result of all these efforts finally converged into the open-source project Hadoop.
Hadoop now has most of the capabilities presented in 2005 by Google, including a Distributed File storage system (HDFS), a distributed Database (HBase), and the same distributed-computing framework as Google, MapReduce.

So, where does that leave any place for LZ4 ?
Well, in such architecture, compression is used as a performance enabler.

As can be guessed, massive amounts of data are traveling between the different nodes. Moreover, each node is also processing a fair amount of data, more or less permanently.
In such situations, compression offers some advantages : less data to transfer means less time and cost to send/receive it. It also means that more data can be stored into RAM memory, or that more data can remain into the CPU cache. All this translates into better system speed.

Or does it ? For this affirmation to be true, it is mandatory for the compression algorithm to be "unobtrusive", which means it should consume very little CPU cycles. Otherwise, the cost of compression can void or reverse the speed advantage. This basically means only fast compressors do qualify for the job.

In the beginning, LZO was such a champion. It offered great speed, however with some important usage limitations, due to its GPL license.
Then early 2011, Google released Snappy, ex-zippy, the very same algorithm used by Google in its own BigTable implementation. It quickly became a great alternative, thanks to its better licensing policy and better performance.

LZ4 was also released this year, just after Snappy. Google's notoriety means there was basically little attention left for competing algorithms. But it also raised awareness that Fast compression algorithms have a role in IT architecture. LZ4 gradually improved overtime, to the point of providing now better performance than Google's creation. One Hadoop's contributors, Binglin Chang, made the effort to implement LZ4 as a JNI patch, and compare it directly to Snappy. LZ4 performance was found better than Snappy, even when using Snappy's own set of calibration tests.
In a relatively quick decision process, the LZ4 patch was then integrated into the main Hadoop - MapReduce source trunk.

/* Update : Google's Snappy developer kindly reminds that benchmark figures depend on the tested configuration, and states that on their own test platform, Snappy keeps an edge with regards to compression speed. See comment : */

The advantage of using fast compression algorithms, as does Hadoop, can be replicated into many server-side applications, for example DataBases. Recently, column-oriented databases have been dragging attention, since they make heavy usage of compression to grab some impressive performance advantage. The idea remains the same : compress data to keep more of it into RAM and into CPU cache : it directly translates into better performance.