In the know.

The Platfora blog.

Data Lake – Or Data Landfill?


Establishing a “data lake,” where datasets of all shapes and sizes coexist in a harmonious elephantine reservoir has become a compelling vision for many enterprise IT departments over the last eighteen months or so.  Yet many organizations struggle with creating true business value around the data lake – it’s not clear to executives how Hadoop can or will enable the business to engage in the marketplace more effectively.

There’s a good reason for this.  Most organizations take a “bottom-up” approach to establishing a data lake. “Build it and they will come” is the typical strategy of technology teams intent on establishing a Hadoop presence, but without a clear business-driven use case in mind.

Spinning up a cluster and getting data into it are nontrivial tasks in the world of Hadoop. Exhausted after this initial effort, many IT teams focus on using Hadoop to store large amounts of infrastructure-generated log data, the value of which is often not obvious to business users.  Even with obvious cost reductions associated with data-warehouse offload, six to nine months after initialization, the business value created by Hadoop is most frequently represented by a question mark.  No wonder Gartner states that Hadoop is entering the trough of disillusionment.

It doesn’t help that Hadoop is a complex technology stack, written by programmers, for programmers.  There is no obvious interface for nontechnical users; the closest thing to it is Hive, which provides a lightweight SQL-like query interface.  For someone without SQL chops, getting data out of a Hadoop cluster appears to be a daunting task, friendly only to developers who prefer command-line over drag-and-drop. The difficulty of getting insight out of Hadoop is another reason why the bottom-up approach to the Hadoop journey can dump unsuspecting executives into a data landfill, providing no apparent value to the business.

For those thinking “There must be a better way!” – there is.  A more business-engaged approach to big data involves identifying one or more use cases designed to address specific problems core to business users.  For example, in contrast to the data warehouse cost-reduction approach often used by IT, Marketing, Sales and Product Management functions are typically looking for better ways to connect advertising spend to brand engagement, expand pipeline, or understand product utilization, respectively.  Data stored in a Hadoop cluster can deliver powerful insights to all of these questions – provided you are able to correlate the multi-structured data stored within, as well as leverage a business-friendly interface usable by business analysts.

Correlating transactional data, device data, and social media (for example) is no small feat.  Existing analytical solutions are not designed to deliver this capability.  The vast majority of them aggregate data into a relational database and point analysts to it via a visualization interface.  While this approach is quite suitable for operational BI, it is insufficient for exploratory BI, for which correlating large quantities of multi-structured data is a primary requirement.  Bolting on a Hadoop back-end to the relational datastore – in effect, creating a three-tiered architecture – won’t do the trick, because the end-user is still working with aggregated data that hasn’t been correlated across more than a narrow, preconfigured silo of perhaps two discrete data sets. It also imposes workflow limitations which inhibits the analyst’s ability to iterate through many different possible views of the data and making changes on the fly; with an aggregated approach, re-modeling the data is required to enable each pass, adding weeks or months to the analyst workflow.

Bolting a spreadsheet interface onto SQL-over-Hadoop solution also comes up short, because now you’re turning big data into small data, sipping it through an ODBC straw.  This approach might work if you know the questions you want to ask Hadoop, but what if you don’t?  Without the ability to interactively engage with data and go back in time – months or even years ago, the ability for line-of-business analysts to gain insight in an exploratory fashion remains elusive.

Platfora was designed to address this very issue.  Users of Platfora engage directly with multi-structured data, discovering trends of which they were previously unaware without having to engage in a complex workflow that includes data modelers, ETL specialists, DBAs and Hadoop gurus each addressing just one stage of the entire slice-and-dice analytical exercise.  Because Platfora users are empowered to work directly with data at each stage of the game, insights can be delivered in hours, not months, and IT is freed up to engage on higher-value tasks.

Platfora customers like Citigroup, Disney, Autotrader, American Express and many leading companies have seen sufficient value from this approach. They  bake Platfora into their big data reference architectures, precisely because doing so enables strategic business value to be delivered immediately from a Hadoop-centric big data stack.

Engage your Platfora representative for a discussion on how Platfora can ensure your Hadoop journey results in a data lake, not a data landfill.