Harnessing Data in an Uncontrollable Big Data World
Where and how to manage “big data” warehouses, analytical data marts and resulting business insights continues to be a challenge despite a wealth of new Hadoop technologies and vendor capabilities. In fact, the landscape is more confusing than ever.
Data takes an ever increasing number of hops and transformations before it resides in a place where it can deliver business value
As data has been further democratized, analytical data marts are popping up across your organization like weeds. These data marts are driven by more mature open source technology and inexpensive cloud hosting providers, enabling small teams to build and run them closer to the businesses they serve.
Further complicating your data management landscape are many "end point" operational systems like Salesforce and Workday-which have extended their core operational functionality to import data from across the organization, creating even more analytical data marts. Whether hand-crafted or through existing/extended vendor capabilities, these end point analytical marts become yet another place in your organization for data aggregation and advanced analytics.
The organizational placement of the technology and analytical teams building and running these systems has also seen significant shifts. Many IT organizations have some sort of centralized team that manages the traditional enterprise data warehouse, which has now transformed into ever growing Hadoop data lakes or reservoirs. There are obvious benefits to having all of your data in a single place managed by one team however they frequently cannot scale to meet the pace of change required by multiple stakeholders.
In contrast, decentralized teams were formed as a result of more narrow business unit driven projects or objectives-closer to their internal customers, demanding agility that central teams are typically unable to achieve. These more narrowly scoped projects result in quick turn time to achieving business insight, but at the expense of a well-architected, broader data environment that can achieve greater scale across a larger set of internal customers. While the insight gained from these "fit for purpose" data marts is significant, they create a host of other “data” issues that are magnified when multiplied by an ever-increasing number of data marts. A few of these issues include:
• Data latency – data takes an ever increasing number of hops and transformations before it resides in a place where it can deliver business value
• Data veracity – with fit for purpose data marts we gained agility in unlocking new insights but have lost the confidence in the data as these metrics are not consistent across the marts
• Data quality and Data lineage – a web of data marts, sometimes pulling data from each other, with multiple hops and transformations frequently loses data nuisances and creates a change management nightmare-ultimately impacting data quality
• Replication of data – at times in large quantities, causing uncontrolled cloud and internal infrastructure costs
• Data governance – inability to create or enforce any type of data governance or compliance framework
The answer to these seemingly conflicting issues relies on an architectural vision and framework that relies on several key tenants:
• Centralize (most) of your data management – Embrace the data lake/reservoir architecture. Extract and load your data into a central Hadoop platform where most of the truly heavy data management lifting should be performed, done once and the same way for all to share ensuring the highest data quality, especially for the core or critical business metrics.
• Enable a semantic layer, where business users can create their own data views without building yet another data mart.
• Enable your business users with a data-blending tool, for out of the box data management work, while still keeping them in a more defined and governed environment.
• Data governance by design – Enable the concepts of data catalogs, certified data and data lineage. This must be built into your core workflows to ensure usage but also add value beyond pure data governance. The true value is access to highly consumable data assets and a compute platform that enables faster time to insight.
• Keep your analytics and data science closer to the businesses they serve. This is where the true magic comes to life when properly enabled with consumable, clean and certified data.
There is a high degree of organizational buy-in and collaboration that accompanies this sort of transformation, and it begins with a shared vision across the key stakeholders who understand the obvious tradeoffs each approach brings. A sound reusable data management framework should serve as the focal point to bringing scale to the insights your businesses demand. The hyper pace of technology and data science innovation combined with the necessary organizational change will certainly make for an interesting and fun journey.