Real Time Data Integration on Hadoop
Data integration? Didn’t we solve that problem about 25 years ago?
Like a B-movie vampire, data integration keeps returning to company to-do lists, no matter how many times it has been vanquished. Most recently, it is showing up in streaming architectures on Hadoop.
Consider the modern insurance business. Every minute, if not every second, meaningful business events occur: insured customer vehicles are moving and generating telematics data on driving behavior; customers are filing claims; adjusters are checking out damage to insured property; customers are clicking ads on their mobile devices and computers; and, agents are generating quotes, to name just a few examples. The company wants to be aware of these events as they occur, and respond to each in the most appropriate time frame. But, whether the response is instant or not, a decision is made immediately.
To do business in real time, streaming architectures are created. Then each business event – as soon as it occurs – is routed to Hadoop for quick action. But now there is a data integration challenge, because the events are all flowing in from different systems, with different data structures and formats. Further, the most appropriate action depends not just on the immediate event, but its context, as shown in Figure 1.
For example, when a happy customer who is under-insured asks a question about coverage – whether on the website or at the call center – the system should be able to generate an offer to increase coverage relevant to the customer’s interest. This can be presented to the customer, either by the call center agent or by the website, at the moment the customer is connected. If done in the right way, this is helpful and provides good service to the customer.
The newest such requirement in many companies is streaming data enrichment
But, if that same happy customer has called in upset because a loved one has just been injured in an auto accident -- and, a few minutes later calls again or turns to the website with a coverage question -- it is NOT the right time to try to cross-sell or upgrade. This is the time to help that customer as much as possible, applying the coverage she has. Promotions and offers can wait until another time.
Because the first call may go to one person in the call center and the later coverage question may be directed to a different person or online to a website, there is a data integration requirement. And because the two events can be just moments apart, the data integration issue is “business real time”.
Streaming Data Enrichment
In the insurance example above, data from the call center and data from the website flow from different systems to Hadoop, where there is a real time decision engine making recommendations.
The necessary data integration has to be accomplished on the fly, as the data streams in. This very quick and focused data integration is often referred to as “streaming data enrichment”. In the insurance example, the company wants each recommendation to be based on the full context of the customer’s relationship with the company. Data integration in near real time is required because the first call provides a critical part of the context for the second call or website visit.
My colleague, NoSQL expert Bryce Cottam, suggests using a low latency NoSQL database, such as HBase, as the repository for the integrated data in this case. Apache HBase is an open source database included in, or available with, most Hadoop distributions. Integration can be further simplified by designing the solution around a specific data integration requirement. For the insurance example, the problem is to integrate the data by customer.
When a new event arrives, it contains an identifier that associates it with a customer. If necessary, this is translated to a global customer identifier by means of a fast lookup, typically in a table kept in memory. The global customer identifier is then used to find the correct records in the customer-organized event database. Relevant information about the event is then added to the customer records.
This customer storage design leverages the schema flexibility of a NoSQL data model, which means that records that are logically related (rather than schema related) are stored next to each other with ease. Thus, the entire history of the customer – to the extent relevant for event processing – can be stored close together in one set of colocated records. NoSQL databases are designed to handle the wide variations, record-to-record, in record length and makeup that result. This capability is critical for providing very fast access to data related to real-time events.
In our example, when the second call comes in – asking for details of coverage under the customer’s policies – the customer event application sees the customer’s entire history without having to pull data from multiple systems, including the call a few minutes earlier in which the auto crash is described. This application can thus make an informed recommendation that no offer be presented when a recently upset customer calls.
Not Only An Insurance Example
The example above does indeed actually occur in the insurance industry…but it is hardly limited to insurance. In fact, similar examples exist in every business that provides products or services to consumers. Thus, event-based customer interaction occurs in retail, banking, telecommunication, travel, health care, home services and many more commercial and government sectors.
The case for near real time data integration is also not limited to interaction with customers. From an information technology point of view, similar requirements exist when manufacturers, retailers and distributors react to events in their supply chains or factories.
Virtually every business now confronts – or soon will confront – an ever greater demand to react rapidly and appropriately to frequent business events. That rapid reaction will rely on streaming data enrichment, such as we have illustrated above.
Like a B-movie vampire, data integration is the Hadoop requirement that won’t go away. It is on the critical path for everything from real time business event processing to data discovery to production batch analytics and data modeling. So, almost any analytic data problem you solve on Hadoop will require data integration.
Perhaps the newest such requirement in many companies is streaming data enrichment. The approach illustrated in this article is one of the most fundamental for real time data integration and should be considered when streaming applications are to be implemented. It is one way that a B-movie vampire, sometimes known as data integration or enrichment, can be vanquished in near real time, on Hadoop.