Home > Big news > Real-time, not batch-time, analytics with Hadoop
Real-time, not batch-time, analytics with Hadoop
January 15, 2016 analytics Hadoop

Today, we often hear the phrase “The 3 Vs” in relation to big data: Volume, Variety and Velocity. With the interest and popularity of big data frameworks such as Hadoop, the focus has mostly centered on volume and data at rest. Common requirements here would be data ingestion, batch processing, and distributed queries. These are well understood. Increasingly, however, there is a need to manage and process data as it arrives, in real-time. There may be great value in the immediacy of that data and the ability to act upon it very quickly. This is velocity and data in motion, also known as “fast data.” Fast data has become increasingly important within the past few years due to the growth in endpoints that now stream data in real-time.

Big data + fast data is a powerful combination. However, adding real-time analytics to this mix provides the business value. Let’s look at a real example, originally described by Scott Jarr of VoltDB.

Consider a company that builds systems to manage physical assets in precious metal mines. Inside a mine, there are sensors on miners as well as shovels and other assets. For a lost shovel, minutes or hours of reporting latency may be acceptable. However, a sensor on a miner indicating a stopped heart should require immediate attention. The system should, therefore, be able to receive very fast data.

Often, data events don’t exist in isolation. For example, we may not be concerned if an expensive piece of equipment wanders outside its normal zone, if there is a work order on it and it is moving to the repair depot. In this case, we can make a smart decision on a sensor event because it is based on other data in our system — an example of combining big data with fast data.

Data are also very valuable when we use real-time analytics to count, aggregate, trend, and so on. In our example, we can analyze data in real-time for two distinct purposes:

  1. We want to see a real-time representation of the mine via a dashboard. This dashboard would show us the number of active sensors, how many items were outside of their normal zone, equipment utilization efficiency, and so on.
  2. Real-time analytics used for automated decision-making. For example, if a reading from a sensor on a miner showed low oxygen for an instant, this could be an anomaly. However, if the system detected a rapid drop in oxygen over several minutes for multiple miners working in the same area, that could be an emergency requiring immediate attention.

Physical asset management in a mine is a real-world use-case to illustrate what is needed from various systems that manage fast data. However, the same pattern exists for distributed denial of service (DDoS) detection, log file management, optimizing advertisement placement, and so on. When dealing with fast data we need to:

  • Ingest data in a way that makes the data accessible and fast
  • Make a decision on each event at its point of single highest value — as soon as it arrives
  • Analyze data in real-time to enable automated decision-making and create human-readable dashboards

In short, companies today understand that leveraging information for its total worth means extracting value from data at all points in the data lifecycle. The mining example is one at human time scales of seconds (life-threatening), minutes (stolen equipment?) and hours (we’ll need that shovel back by tomorrow), but companies like Emagine, a digital marketing agency, are blending fast and big at sub-second scales, with mobile platforms that need to respond at less than 250 milliseconds, literally less than the blink of an eye.