![]() ![]() tools/hadoop/pig-0.9.1/lib/hadoop-snappy-0.0.1-SNAPSHOT.jarĪlso, you need to point PIG to the snappy native export PIG_OPTS="$PIG_OPTS =$HADOOP_HOME/lib/native/Linux-amd64-64" The pig client here is installed at /tools/hadoop and the jar needs to be placed within $PIG_HOME/lib. Pig requires that the snappy jar and native be available on its classpath when a script is run. This is the machine config of my cluster nodes, though the steps that follow could be followed with your installation/machine configs uname -a (This is possible because the metadata is written after all the blocks have been written, so the writer can retain the block boundary positions in memory until the file is closed.) Therefore, Parquet files are splittable, since the blocks can be located after reading the footer and can then be processed in parallel (by MapReduce, for example).Assuming you have installed Hadoop on your cluster, if not please follow Unlike sequence files and Avro datafiles, where the metadata is stored in the header and sync markers are used to separate blocks, Parquet files don’t need sync markers since the block boundaries are stored in the footer metadata. ![]() The consequence of storing the metadata in the footer is that reading a Parquet file requires an initial seek to the end of the file (minus 8 bytes) to read the footer metadata length, then a second seek backward by that length to read the footer metadata. Snappy would compress Parquet row groups making Parquet file splittable.Įxcellent Tom White's book Hadoop: The Definitive Guide, 4th Edition also confirms this: Property defines Parquet file block size (row group size) and normally would be the same as HDFS block size. Parquet stores rows and columns in so called Row groups and you can think of them as above-mentioned containers: My only question was why they did not mention their favorite Parquet format? I had to dig further to see if Parquet/Snappy combo is indeed splittable. Splittability is not relevant to HBase data. Snappy is intended to be used with a container format, like SequenceFiles or Avro data files, rather than being used directly on plain text, for example, since the latter is not splittable and cannot be processed in parallel using MapReduce. Snappy and GZip blocks are not splittable, but files with Snappy blocks inside a container file format such as SequenceFile or Avro can be split. The recent version of CDH documention fortunately delivers a better message ( link):įor MapReduce, if you need your compressed data to be splittable, BZip2 and LZO formats can be split. Hortonworks docs were even more vague on a subject. This is when I started looking frantically for an answer and ended up spending hours.Įarlier versions of Cloudera documentation were plainly wrong stating that Snappy is Splittable and we know it is not. It means that if HDFS file has more than one block, map/reduce jobs would have to decompress the entire file first (all the blocks) and only one core can do it at the same time hurting parallelism a lot. And if you pay attention, you quickly notice that Snappy IS NOT splittable and next thing you read this is a really bad thing. By the way I do not believe "Splittable" is an actual English word. Once you figure out that Snappy is a way to go and learn about how to tweak the settings for intermediate and output compression, you will stumble upon a notion of a codec being "Splittable" or not. In my tests (and your mileage will vary), Snappy reduced my Parquet files by 2 times at least while improving job processing time by 10-20%. It is still a very good idea to use Snappy compression though. If you've read about Parquet format, you learn that Parquet is already some cool smart compression and encoding of your data by employing delta encoding, run-length encoding, dictionary encoding etc. ![]() The downside of course is that it does not compress that well as gzip or bzip2. Snappy is designed for speed and it does not load hard your CPU cores. There are tons of posts on the web if you want to get more details about various codecs and you will find that both Cloudera and Hortonworks recommend Snappy. The short answer is yes, if you compress Parquet files with Snappy they are indeed splittableįirst off, why should you even care about compression? A typical Hadoop job is IO bound, not CPU bound, so a light and fast compression codec will actually improve performance. I spent a few hours trying to find a definite answer on this question and hopefully my post will save someone time and trouble. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |