Home > Big news > News > How To Conquer Your Dataflow Chaos
How To Conquer Your Dataflow Chaos
August 14, 2017 News analytics big data Dataflow


Imagine you’re the owner of a factory. Absent a supply chain management system or industrial controls, you rely on your customers to find and fix your delivery and quality problems. It’s a crazy scenario, yet it’s the model used by many enterprises to deal with big and fast data. It often leads to failures in data timeliness, completeness and accuracy.

Scenarios in which this process might suffice include publishing periodic BI reports based on stable sources like transaction databases. But as we move to a world of instant analysis and automated action, the speed, size and quirkiness of big and fast data require a focus on data operations.

Meet Dataflow Chaos

Over the last several years, the risk within modern data architectures has shifted from data at rest to data in motion. The conversation is moving from cheap and fast storage and analytics to guaranteeing the continuous delivery of timely and accurate data to drive modern apps and decision making.

 The adoption of Hadoop, NoSQL databases and other big data stores on premises or in the cloud is going strong. But it is coupled with stories about the difficulty of delivering business with a key impediment being the lack of a repeatable and reliable process for continuously flowing high-quality data from source to use.

Three factors have emerged to accelerate the complexity for data in motion:

  • Data sprawl: We have seen an accelerated move from centralized monolithic systems to fragmented and loosely coupled systems — islands of specialized technology with their own operational requirements and characteristics. This diversity increases the operational challenge by orders of magnitude, whether it be from elastically spawned microservices containers or managing the intertwined upgrade paths of each infrastructure component. Continually managing the logistics of this data movement is a hurdle that must be overcome to put these systems to good use.
  • Data drift: Unexpected changes to the schema and semantics undermine the quality of incoming data and wreak havoc with applications that rely on that data. Data drift stems from the diversity of new digital systems — from IoT sensors to web clickstreams — that emit the data. These systems are almost always outside the control of the data consumer, and changes are often made upstream without notification. This is a critical side effect of modern infrastructure. It is a fallacy to think that one can control data drift.
  • Data urgency: Business users expect data to be put to use immediately after its creation. This opens up opportunities such as real-time personalization or fraud detection where minutes matter, but it also puts the data architecture system under immense pressure, greatly reducing the margin for error. The flip side of data urgency is that it also carries with it the threat of perpetually falling so far behind the processing of data that dropping unprocessed data is the only way to recover.

Together, data sprawl, data drift and data urgency create a burden in overseeing data movement across the enterprise. The complexity in each layer combines to create dataflow chaos, a situation where data engineers, enterprise architects and data scientists live in reactive mode, having to respond to unexpected outages or doubt in the data. This imperils trust in the applications the data serves.

Managing Dataflow Performance, Centrally

Today’s data movement technologies are ill-equipped to deal with these new dimensions of complexity. The old-world approach of developer-centric tooling and ad hoc custom coding in the face of data sprawl, data drift and data urgency lead to firefighting in order to keep critical data-driven applications fed with current, complete and quality data. It can also lead to the abandonment of otherwise great ideas for new application

Many enterprises institute controls that penalize innovation or attempt to solve the problem of data trust and availability for each application ad hoc. This creates multiplicative efforts that become cost-prohibitive to maintain. A better approach is to manage data movement as a focused and disciplined operational practice — dataflow performance management (DPM).

DPM is analogous to the performance-oriented disciplines that have taken hold in other areas of technology, from network performance management and application performance management to security information and event management. Like DPM, these methodologies were developed to deal with multidimensional complexity and to provide holistic management across numerous point systems and silos, replacing operational blindness and chaos with visibility and control.

DPM spans four phases of the dataflow life cycle: develop, operate, remediate and evolve. First, a dataflow topology must be developed; that is, it must be programmed, tested and placed into production. Once in operation, the dataflow is monitored to ensure it meets specified performance goals. While in operation, two feedback paths trigger changes. First, issues crop up due to data drift, data sprawl or logic bugs, which require remediation. Second, proactive and planned evolution of the dataflows support new requirements. There is some level of development rework for both types of changes.

To execute a DPM approach, enterprises should create a central organizational structure — a data operations center. Similar to an NOC for network operations or an SOC for security operations, a data operations center provides a single point of accountability and control, helping the enterprise move from silos to a well-governed, efficient, comprehensive operation.

The data operations center serves as the center of excellence for managing data, whether at rest or in motion, and coordinates across functions and business units to implement best practices for DPM. It provides a single point of accountability for organization-wide data performance. As data moves across shared resources that connect a web of applications and systems, the data operations center is the team that establishes best practices, ensures day-in/day-out performance and operationalizes business process performance into data performance.

Creating a data operations center helps an organization move from an ad hoc “hairball of data movement pipelines” to a well-governed and comprehensive data operation, allowing the business to master the life cycle of its data movement and leverage more of its data more quickly, more broadly and more efficiently.

This article was originally published on www.forbes.com and can be viewed in full