The Future of Hadoop
“You can always tell the pioneers,” goes the old saying. “They have arrows sticking out of them.” The computer landscape is littered with products that were first movers but were ultimately overtaken by fast followers, who learned from the innovators’ mistakes and went on to thrive. I believe this is the fate that awaits Hadoop, as Spark and Cassandra continue to gain momentum in the Big Data community. In order to explain, a bit of history is required.
"Hadoop as a concept revolutionized the world of data processing, and ushered in the era of Big Data"
Almost 20 years ago, Doug Cutting faced two issues in creating a web search engine: how to reliably store all that information, and how to create a massive lookup index. Thus was born Hadoop, which included a distributed, highly available file system, as well as the Map-Reduce framework for massively parallel computations.
MapReduce was indeed revolutionary previously intractable problems could now be solved in a matter of minutes. But, it did not take advantage of memory to improve performance, and it was terrible at handling incremental changes, e.g., adding the index for a single new tweet to the existing full web index. In time, Hadoop replaced the original MapReduce framework with Tez, which uses a directed acyclic graph for parallel processing, based on Microsoft’s 2010 Dryad paper. But, Tez has been upstaged by another product based on Dryad: Spark. Spark’s implementation is more general purpose, e.g., data at various stages of computation can be efficiently checkpointed and restored. Spark can run in the Hadoop ecosystem (where it will soon replace Tez), or it can run in its own stand alone environment. More and more projects are choosing Spark as their Big Data solution, and then, as a secondary decision, choosing between Spark on Hadoop or Spark standalone. Over 25percent of Spark projects today run outside of Hadoop, and the percentage is rising.
The Hadoop File System (HDFS) is also showing its age. For example, it requires an active NameNode in order to function, and it uses Zookeeper to monitor the NameNode’s availability. As a result, it can experience “brown-outs” of up to a minute while Zookeeper detects that the active NameNode has crashed. Hadoop has evolved mechanisms to improve availability, but other Big Data systems, such as Cassandra’s, achieve high availability without the need for a master node or an external monitoring facility, thus eliminating the risk of brown-out.
The trend is clear. Hadoop as a concept revolutionized the world of data processing, and ushered in the era of Big Data. But, Hadoop as a product ecosystem is certainly showing its age, and, for many use cases, it has been upstaged by more modern technologies like Spark, which had the benefit of learning from Hadoop’s growing pains. Spark has a more generic and extensible programming model, which makes it easier to use for analytics. It also can handle Big Data in Motion, via Spark Streaming, and serves as the basis for a powerful graph database (GraphX) and a full featured data science library (MLib).
Perhaps this explains a recent Gartner report’s finding that, despite the growing demand for Big Data solutions, the demand for Hadoop was not accelerating as expected and that enthusiasm for Hadoop was low. In fact, a clear majority of enterprises surveyed said they had no plans to invest in Hadoop either now or in the future. So, while other Big Data technologies like Spark, Cassandra and MongoDB continue to attract lots of interest from companies, Hadoop appears to be suffering from a decline in demand. The leading Hadoop vendors, Cloudera and Hortonworks, may still have high valuations, but they spend far too much to acquire each new customer, and have yet to break through to the enterprise mainstream.
Why the lack of enthusiasm for Hadoop? Some analysts blame the high total cost of ownership, others blame the difficulty in finding engineers with the requisite skills. To me, these are just different ways of saying that Hadoop is showing its age. Hadoop, like any 20-year old software system, has evolved over the years, and each evolution has made the ecosystem more complex and harder to deploy or maintain. A newer system like Spark has the advantage of a younger and more robust code base, and a more modern programming API paradigm that young engineers find easier to use than Hadoop.
So, here’s to you, Hadoop, for educating the world about the promise and possibilities of Big Data. The Big Data products that will succeed commercially, like Spark and Cassandra, could never have done it without you. As for you, Hadoop, take a seat next to Visicalc (the first spreadsheet), AltaVista and Netscape, as products that were just too far ahead of their time.