Hadoop

What it is and how people use it: my own summary.

Hadoop logo

The web offers plenty of introductions to what Hadoop is about. After reading up on it and trying it out a bit, I wanted to see if I could sum up what I see as the main points as concisely as possible. Corrections welcome.

Hadoop is an open source Apache project consisting of several modules. The key ones are the Hadoop Distributed File System (whose acronym is trademarked, apparently) and MapReduce. The HDFS lets you distribute storage across multiple systems and MapReduce lets you distribute processing across multiple systems by performing your “Map” logic on the distributed nodes and then the “Reduce” logic to gather up the results of the map processes on the master node that’s driving it all.

This ability to spread out storage and processing makes it easier to do large-scale processing without requiring large-scale hardware. You can spread the processing across whatever boxes you have lying around or across virtual machines on a cloud platform that you spin up for only as long as you need them. This ability to inexpensively scale up has made Hadoop one of the most popular technologies associated with the buzzphrase “Big Data.”

Writing Hadoop applications

Hardcore Hadoop usage often means writing the map and reduce tasks in Java programs that must import special Hadoop libraries and play by Hadoop rules; see the source of the Apache Hadoop Wiki’s Word Count program for an example. (Word count programs are ubiquitous in Hadoop primers.) Then, once you’ve started up the Hadoop background processes, you can use Hadoop command line utilities to indicate the JAR file with your map and reduce logic and where on the HDFS to look for input and to put output. While your program runs, you can check on its progress with web interfaces to the various background processes.

Instead of coding and compiling your own JAR file, one nice option is to use the hadoop-streaming-*.jar one that comes with the Hadoop distribution to hand off the processing to scripts you’ve written in just about any language that can read from standard input and write to standard output. There’s no need for these scripts to import any special Hadoop libraries. I found it very easy to go through Michael G. Noll’s Writing an Hadoop MapReduce Program in Python tutorial (creating yet another word count program) after first doing his Running Hadoop on Ubuntu Linux (Single-Node Cluster) tutorial to set up a small Hadoop environment. (If you try one of the many Hadoop tutorials you can find on the web, make sure to run the same version of Hadoop that the tutorial’s author did. The 2.* Hadoop releases are different enough from the 1.* ones that if you try to set up a distributed file system and share processing across it using a recent release while following instructions written using a 1.* release, there are more opportunities for problems. I had good luck with Hardik Pandya’s “How to Set Up a Multi-Node Hadoop Cluster on Amazon EC2,” split into Part 1 and Part 2, when I used the same release that he did.)

Hadoop’s native scripting environments

Instead of writing your own applications, you can take advantage of the increasing number of native Hadoop scripting languages that shield you from the lower-level parts. Several popular ones build on HCatalog, a layer built on top of the HDFS. As the Hortonworks Hadoop tutorial Hello World! – An introduction to Hadoop with Hive and Pig puts it, “The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools.” You can work with HCatalog directly, but it’s more common to use these other tools that are built on top of it, and you’ll often see HCatalog mentioned in discussions of those tools. (For example, the same tutorial refers to the need to register a file with HCatalog before Hive or Pig can use it.)

Apache Hive, according to its home page, “facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.” You can start up Hive and enter HiveQL commands at its prompt or you can pass it scripts instead of using it interactively. If you know the basics of SQL, you’ll be off and running pretty quickly. The 4:33 video Demonstration of Apache Hive by Rob Kerr gives a nice short introduction to writing and running Hive scripts.

Apache Pig is another Hadoop utility that takes advantage of HCatalog. The “Pig Latin” scripting language is less SQL-like (but straightforward enough) and lets you create data structures on the fly so that you can pipeline data through a series of steps. You can run its commands interactively at its grunt shell or in batch mode from the operating system command line.

When should you use Hive and when should you use Pig? It’s a common topic of discussion; a Google search for “pig vs. hive” gets over 2,000 hits. Sometimes it’s just a matter of convention at a particular shop. The stackoverflow thread Difference between Pig and Hive? Why have both? has some good points as well as pointers to more detailed discussions, including a Yahoo developer network discussion that doesn’t mention Hive by name but has a good description of the basics of Pig and how it compares to an SQL approach.

You know what would be cool? A Hive adapter for D2R.

Hive and Pig are both very big in the Hadoop world, but plenty of other such tools are coming along. The home page of Apache Storm tells us that it “makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.” Apache Spark provides Java, Scala, and Python APIs and promises greater speed and an ability to layer on top of many different classes of data sources as its main advantages. There are other tools, but I mention these two because according to the recent O’Reilly 2014 Data Science Salary Survey, “Storm and Spark users earn the highest median salary” of all the data science tools they surveyed. Neither is restricted to use with Hadoop, but the big players described below advertise support for one or both as advantages of their Hadoop distributions.

Another popular tool in the Hadoop ecosystem is Apache HBase, the most well-known of the column-oriented NoSQL databases. It can sit on top of HDFS, and its tables can host both input and output for MapReduce jobs.

The big players

The companies Cloudera, HortonWorks, and MapR have gotten famous and made plenty of money selling and supporting packaged Hadoop distributions that include additional tools to make them easier to set up and use than the Apache downloads. After hearing that HortonWorks stayed closer to the open source philosophy than the others, I tried their distribution and found that it includes many additional web-based tools to shield you from the command line. For example, it lets you enter Hive and Pig Latin commands into IDE-ish windows designed around these tools, and it includes a graphical drag-and-drop file browser interface to the HDFS. I found the tutorials in the “Hello World” section of their Tutorials page to be very helpful. I have no experience with the other two companies, but a Google search on cloudera hortonworks mapr finds a lot of discussions out there comparing the three.

Pre-existing big IT names such as IBM and Microsoft have also jumped into the Hadoop market; when you do a Google search for just hadoop, it’s interesting to see which companies have paid relatively how much for Google AdWord placement.

Hadoop’s future

One of Hadoop’s main uses so far has been to batch process large amounts of data (usually data that fits into one giant table, such as server or transaction logs) to harvest summary data that can be handed off to analytics packages. This is why SAS and Pentaho, who do not have their own Hadoop distributions, have paid for good Google AdWord placement when you search for “hadoop”—they want you to use their products for the analytics part.

A hot area of growth seems to be the promise of using Hadoop for more real-time processing, which is driving the escalation in Storm and Spark’s popularity. Even in batch processing, there are still plenty of new opportunities in the Hadoop world as people adapt more kinds of data for use with the growing tool set. The “one giant table” representation is usually necessary to ease the splitting up of your data for distribution across multiple nodes; with my RDF hat on, I think there are some interesting possibilities for representing complex data structures in Hadoop using the N-Triples RDF syntax, which will still look like one giant three- (or four-) column table to Hadoop.

Cloudera’s Paolo Castagna has done some work in this direction, as described in his presentation “Handling RDF data with tools from the Hadoop ecosystem” (pdf). A more recent presentation Quadrupling your Elephants: RDF and the Hadoop Ecosystem by YarcData’s Rob Vesse shows some interesting work as well, including the beginnings of some Jena-based tools for processing RDF with Hadoop. There has been some work at the University of Freiberg on SPARQL query processing using Hadoop (pdf), and SPARQL City also offers a SPARQL front end to Hadoop-based storage. (If anyone’s looking for a semantic web project idea, you know what would be cool? A Hive adapter for D2R.) I think there’s a very bright future for the cross-pollination for all of these tools.