Apache Spark is a rapidly evolving open source engine for large-scale data processing and analytics. It has been in development for a number of years at UC Berkeley’s AmpLab, and is now being driven by Databricks, a Berkeley spin-out founded by Ion Stoica and Matei Zaharia. It is also reaching a level of maturity that moves it beyond pure experimentation — with imminent availability of a stable 1.0 release and inclusion (current or planned) in all major Hadoop distributions.
There’s a good reason for all of the interest. Traditionally Hadoop was the combination of a distributed file system (HDFS) with a specialized parallel processing framework (MapReduce). Hadoop’s MapReduce is very scalable and fault-tolerant, but has some issues —
- It achieves its fault-tolerance by inefficiently saving all intermediate results to disk, which can make it slow.
- It is difficult to understand and program, and..
- It only has one processing style – Map then Reduce – which is better at some problems than others. Technologies such as Hive, which brings SQL-like queries to Hadoop, layer on top of MapReduce and must work within these limitations.
More recently the industry has seen a number of newer analytic and SQL-on-Hadoop engines emerge, and these have used a range of custom approaches to process against HDFS directly by bypassing MapReduce.
This seems like a good thing, but the proliferation of vendor-specific processing layers to Hadoop, each with its own strengths and limitations, highlights the need for a best-of-all-worlds industry standard layer. That is the role that Apache Spark will play.
We’re hearing Spark come up at almost all customers we speak to today. Most are still in learning mode, and a few are actively running experiments in the lab, but the interest is real and building fast.
They love that it is faster than classic MapReduce, that it is vendor neutral and is supported on all Hadoop distributions, that it provides a simple programming model, and that it is architected to support a wide variety of processing models (and includes extensions libraries for machine learning, graph and a simple SQL-like interface).
At Platfora, we’ve found ways to push MapReduce hard, and natively woven it into our scale-out in-memory processing model with a very unique closed-loop architecture.
This gives analysts a workflow to analyze Petabytes of data and get sub-second performance against the ‘lenses’ they are interested in. We’ve found this to be by far the best solution to putting massive amounts of data in the hands of analysts without long IT bottlenecks.
We’re thrilled to be working with Spark (and Databricks) to further accelerate this architecture. We expect to get 10x or more speedups against raw data by using the new industry-standard for doing low-level Hadoop processing.
We’re heavily committed to Spark and its maturation to production commercial levels over the coming 9-12 months and will be releasing Platfora with certified Spark support in coming months.
Lastly, let me address a question that we’ve heard a few times — is the emergence of Spark good for Platfora or does it erode our lead by making it easier to emulate what we do? It is the former — by making the Hadoop ecosystem stronger, we’re better able to serve the needs of more customers and provide results with even higher performance.
Spark itself is great but is a low-level development environment and an entirely different animal than the Platfora platform that provides a complete visual interactive platform and full-stack solution that lets business analysts get productive against volumes and varieties of data in hours instead of weeks or months. Spark makes our differentiation even stronger, and we’re embracing it wholeheartedly.
Contact us today to learn more more about Platfora and our coming support for Spark.