Apache Hadoop is a mature development framework, which coupled with its large ecosystem, and support and contributions from key players such as Cloudera, Hortonworks, and Yahoo, provides organizations with many tools to manage data of varying sizes.
In the past, Hadoop’s batch-oriented nature using MapReduce was sufficient to meet the processing needs of many organizations. However, increasing demands for faster processing of data have emerged. These demands have been driven by recent developments in streaming technologies, the Internet of Things (IoT) and real-time analytics, to name just a few. These new demands have required new processing models. One significant new technology today that is being used to meet these demands and is gaining considerable interest and widespread support is Apache Spark. Spark’s speed and versatility make it a key part of today’s big-data processing stack in industries from energy to finance.
Spark is an open source, general-purpose computational framework with more flexibility than MapReduce. Spark brings to Hadoop the productivity of functional programming with the speed of in-memory data processing. For example, as shown in Figure 1, in a Logistic Regression performance test, Spark ran several orders of magnitude faster than Hadoop MapReduce in memory.
Figure 1: Logistic Regression Performance Test. Image source: Apache Spark, used with permission.
Some of the key characteristics of Spark include:
- It leverages distributed memory.
- It supports full Directed Acyclic Graph (DAG) expressions for data parallel computations.
- It improves the developer experience.
- It provides linear scalability and data locality.
- It supports fault-tolerance.
Spark offers benefits to many different types of users: Information Technology developers can benefit from Spark’s support for popular programming languages, such as Java, Python, and R, while data scientists can benefit from Spark’s support for machine learning (ML) through its own distributed ML library.
There is also a large and growing list of third-party packages for Spark, enabling integration with a wide variety of other tools, environments, frameworks, and languages, and adding complexity along with capability.
Spark use cases in production include a large technology company where Spark is used for search personalization using new machine learning investigations; a financial system processing millions of stock positions and future scenarios in a matter of hours, where previously it took nearly a week to complete using Hadoop MapReduce; genomics research in an academic environment; video systems where Spark and Spark Streaming are used for both streaming and analysis; and in health care, where Spark is used for predictive modeling of disease conditions.
While this gives a sense of the breadth of problems that are being successfully tackled using Spark, the importance of optimizing a Spark architecture for any given use case is paramount. As powerful as Spark can be, in short, it remains complex. Therefore, to obtain the best out of Spark, it needs to be an integrated part of a broader Hadoop-based data management platform. Furthermore, in order to benefit from real-time or predictive analytics, it is vital to optimize the entire data supply chain.