In the know.

The Platfora blog.

Platfora, Impala, and the Future of Hadoop

Roll back a mere six months and put yourself in the shoes of an IT director looking for a platform to store and analyze large volumes of data. You’d have a stark choice:

Option 1 — buy into the traditional MPP database approach of Teradata, Oracle Exadata, etc., and spend 12+ months modeling and implementing schemas, ETL pipelines, aggregation jobs, maintenance scripts and more. It is expensive to purchase, and once implemented is rigid and hard to evolve, but on the plus side provides standard SQL support and good performance. Let’s call this the “Legacy Database Option.”

Option 2 — store the data in Hadoop, and query it directly using MapReduce, Pig and Hive (Hadoop’s SQL-ish syntax). This option is low cost, scalable and requires no decisions or modeling up front — just write raw data files in any format and figure out how you want to use the data later. However, querying it down the road is complex and requires highly specialized IT staff. Keeping track of what data you have in the cluster can be just as hard and performance is inadequate. Each question can take minutes or hours across tens or hundreds of nodes. Let’s call this one the “Hadoop-Centric Option.”

Both options fail to achieve what customers are telling us they want now: more rapid (but secure) self-service exploration and analysis that allows business users to answer questions that span disparate datasets in unanticipated ways. The solution must be consistently fast, responding to user queries in sub-second time frames, and feeding our appetite for today’s powerful data discovery tools.

Impala solves a part of the problem

Based on Google’s F1 architecture, Impala is an open source Cloudera-led project to improve the performance of Hive by bypassing MapReduce and layering an MPP database-style architecture on top of HDFS. In many ways a hybrid of Options 1 and 2, it improves Hive query performance by about 5-10x. It’s a clear shot across the bow of the traditional MPP database vendors. Impala will lag behind them on raw performance, but it has the potential to replace them by making ad hoc queries on smaller datasets nearly interactive (seconds instead of minutes).

The success of Impala will be great for the Hadoop ecosystem. However, it takes us back into the land of Option 1 (the Legacy Database), with the need for DBAs to manage transformation and maintenance jobs, design and implement aggregations, tune performance, etc.

Impala is quick at querying small amounts of data, but laws of physics dictate that querying terabytes or petabytes of data, is slow. It is slow even with the highly tuned I/O subsystems of Teradata and Oracle Exadata, and far slower in any Hadoop-centric architecture like Impala that needs to pull unoptimized raw files from HDFS. This isn’t news to any DBA, and in response, they manually build and maintain ‘aggregate’ tables — much smaller tables of rolled-up data that can be queried in seconds but lack fine-grained details — and instruct their data analysts to use them rather than the big slow tables that clog up their systems when queried.

But the presumption that a DBA knows what data will be important up-front is usually wrong, and it creates a never ending loop of change orders between analysts and the DBAs building the aggregates. This is exactly the challenge that popular BI tools have when they operate against big data. If a desktop BI user finds interesting data that they want to drill in on, or use as a dimension to slice a different dataset, they must go back to their DBAs for new aggregations that include the data they want; a tedious and slow process that slows down their work.

Worse, if a desktop BI user hits the wrong tables (i.e. the raw data), or submits a complex query, they can chew up vast amounts of cluster resources and dramatically impact other users. This is not the scalable big-data architecture of the future, and it is exactly the painful world that every customer we talk to is trying to escape.

Platfora makes Hadoop consistently fast and self-service

Here is where Platfora comes into the picture. Our platform instantly turns raw data in Hadoop into interactive in-memory business intelligence. Platfora connects in minutes to any Hadoop distribution and automatically generates MapReduce jobs (w/ added Impala acceleration on the roadmap) to build and maintain scale-out in-memory aggregates.

Our scale-out middle tier is simultaneously an ‘aggregate cache’ of the data below, and a lightning fast in-memory analytical query engine to the users above. It holds ‘lenses’ — automatically materialized data marts that roll up raw Hadoop data and can be refined with a click of a button (using our Fractal Cache(TM) technology) to hold whatever level-of-detail is most interesting to users at any point in time. Performance is consistently sub-second, and offloads work from the Hadoop cluster to turn it into a scalable enterprise resource.

The front-end is a completely web-based (HTML5 Canvas) exploratory BI framework, in the spirit of modern data discovery BI tools, but natively built for Hadoop. Now users can interactively explore and visualize, build dashboards, collaborate and storytell seamlessly against any volume and diversity of Hadoop datasets. The front end is tied directly into the middle-tier, giving analysts the first closed-loop exploratory framework that lets them reshape their aggregations or add additional dimensions in a self-service manner without IT involvement.

This is the future

Lets not pine for Hadoop to emulate Option 1, with DBAs architecting and managing every aspect of the data warehouse — constantly modeling, tweaking, and maintaining systems for better performance while hopelessly guessing the needs of their business users. Nor should we put up with Option 2, which is slow, complex and provides inconsistent performance. The future can truly be better than the past — a world of consistently fast, scalable, and modern business analytics and BI for all of your data.