Hadoop: A Capital Market Perspective
Technologies relevant to the financial services community come and go with varying degrees of signal to noise, and of course more often than not skew towards the latter end of that spectrum. So a healthy degree of skepticism when approaching the phenomenon of Big Data, and more specifically Hadoop, is well warranted given the fervor.
"Hadoop can be viewed as a massively scalable storage and analytics platform"
However, business leaders can’t afford to ignore technical innovations offering “10x” improvements in price/performance in the systems they depend on, and for that reason an investment of time in understanding Hadoop’s potential is warranted.
From the perspective of a trading professional, Hadoop can be viewed as a massively scalable storage and analytics platform with the promise of dramatic improvements in price/performance over traditional approaches. Born from pioneers in the search world, and fostered in the open source community, its adoption has soared over the past 7 years, and Hadoop implementations can be found in most any industry.
Hadoop has two essential functions:
• Analysis: algorithms which sort and index data in order to make it more easily discoverable. The original and largest use case of this is the MapReduce algorithm which Google built their empire upon
• Storage: file management processes which can store data across masses of servers, while appearing as a single system, and providing the replication necessary to minimize data loss under the hood
Storage and analysis are hardly green fields—trading platforms have been using relational databases and data warehouses in this space for decades. But these systems were born from a time when computing—servers and storage hardware—was complicated and expensive. The opportunity for Hadoop however was born from Kryder’s law and the economic juggernaut of the Intel server, which have continually crammed down the cost and complexity of storage and computing year after year.
Hadoop builds on the foundation of ready access to “cheap and cheerful” servers and disks by binding them—as many as you can get your hands on—into a single “cluster” where data is spread out, forming a big virtual pool (in fact, many have taken to calling these big Hadoop pools “data lakes.”) A couple key points of interest:
•As opposed to traditional storage techniques which require great care in structuring that data before it enters the storage platform, Hadoop does not. Just chuck it in the pool
•The file system (“HDFS”) takes care of spreading the data out and ensuring it’s replicated enough within the pool (typically 3 times) to ensure that the loss of any single disk drive or system won’t result in loss. And this overcomes the limitations that otherwise might preclude you from using cheap off the shelf servers and disks.
•Spreading that data out across many servers (nodes) allows for parallel processing. This when combined with some of the innovative analytics Hadoop offers is what ratchets up performance
So what? So we can throw significantly more data in, get faster results out, and pay a lot a lot less for the whole thing.
Hadoop and Trading
We’ll get to limitations in a moment since they’re real and relevant, but let’s first consider where Hadoop can and is used in trading today. Keep in mind Hadoop is a framework with many components that can be mixed and matched in varying combinations, so beware of assuming one implementation can do it all.
The area of trade data analytics, alerting and reporting is fertile ground for Hadoop. Consider:
•Its flexibility means data from different systems—from different vendors and departments, across different asset classes, spanning different aspects of the trade lifecycle—can be stored in a single system
•Its ability to blend those data sources allow customers to seek correlations and identify results that might otherwise be exceptionally difficult if not impossible across silos
•Its parallel processing nature means those datasets can be analyzed and reporting to very rapidly, even at the scales needed to support banks and broker deals, high frequency outfits or exchanges
To name a few areas where Hadoop is being put to real world use in trading as an example:
• Portfolio pricing and risk analysis: positions and the analytics systems which mark them and factor the risk attributable to them can be readily fused into common views
• Regulatory and business reporting: 15-3c’s requirements for proof of best execution can be readily met by blending trade data and market data into common reports, just as swaps dealers are using Hadoop to meet CFTC compliance demands. And trading desks are using Hadoop to better understand the profitability drivers of their businesses through converged analytics
• Back-testing and strategy development: Market data sourced from myriad systems and vendors are aggregated and queried, exploiting the diversity of the data at hand to derive new insights
Ben Cuthbert, CEO of the multi-asset trading software provider Celer Technologies, has been deploying Hadoop-based trading solutions for the past 5 years. Originally deployed in Hadoop’s infancy with the goal of transforming the economics and practical limits of storing vast amounts of data, Cuthbert noted “we soon realized that we had an amazing analytics platform in our hands.”
As a result Celer is able to store every element of the trade lifecycle—often more than a Terabyte of data per day—in a single repository with billions of rows. With that data in hand and so readily queryable, customers can realize trading analytics that would otherwise be incredibly difficult to product—from trade and customer profitability, to trader workflow efficiency and error rates, to market data and execution correlations. Celer works with Options IT to run individually deployed hosted solutions, where each customer enjoys their own dedicated, private Hadoop cluster, sidestepping the risks involved in attempting to operate in a public cloud.
Business leaders can expect results from an investment in Hadoop based solutions but should take care to avoid being undermined by its limitations. Given how nascent the technology still is and how much investment of engineering and capital is being invested in Hadoop, these are moving targets of course, and we can expect fewer constraints with time. In the meantime, stay particularly attuned to these key factors:
• Data Control: Hadoop was designed with free and open access in mind and historically has lacked the means of controlling access within its storage. Until recently it was there for best suited for managing data where all users and systems leveraging it need not have any restrictions on what portions of the data they may consume.
• Platform Maturity: Hadoop’s APIs are still rather primitive and its developer community has yet to reach its full potential. And while the pool of experienced Hadoop professionals is growing rapidly, not everyone on staff may be able to adapt to supporting it. Says Cuthbert, “You need a team willing—eager, really—to embrace the technology. It’s certainly not for everyone.”
•Platform Consistency: The frenzy of innovation and competition surrounding Hadoop will continue to morph and evolve the platform in unpredictable ways. What may seem like a smart component to implement today may be orphaned in the not too distant future from a support and technical skills availability perspective.