Other times, I put too much pasta in the pot for the amount of water I have, and end up with a mess on my hands since it didn’t all cook evenly. What’s really funny about the situation is that I osculate between the two extremes, over-compensating for my last failure the next time I make spaghetti for dinner. You’d think I’d figure it out, but I am always completely confident that this time I’ll get it right, but I never do.
As it turns out, companies have the same kinds of trouble when they’re getting started with Big Data, as I do cooking pasta. Either they hear the word ‘Big’ in Big Data and think they have to start by loading all of the data into Hadoop (too much water), or they think that they need to solve huge problems (too much pasta) to make it worth their investment. Of course, neither of these things are true – there is a lot of value in doing some relatively simple things using Hadoop, assuming you start out the right way.
The best way to start this journey is with a use-case, and then move backwards from there.
The use-case should drive things like the initial size of your Hadoop cluster, what ecosystem tools you use, and the data sources you’ll need. Here’s an exercise you can try before you embark on your foray into Big Data:
Take out a pen and some paper, and write down as many questions as you can that you can’t answer today about your business. It doesn’t matter how crazy they seem. Then take that list and pull off the 5 most valuable ones, and ask yourself three questions about each of them:
- What kinds of data would I need to look at to do my analysis?
- Do I actually have access to all that data today?
- What would make it worth spending some time and money on it, if I know I can get my answer?
It’s okay if you need some help answering some of those follow on questions. As an example, let’s consider the case of “Why do prospects come to my website, start to sign up for my service, but never complete their registration?” for a fictional Internet company. Let’s take the questions above and think it through.
What kinds of data would I need to look at to figure this out? Well, I would need to start with clickstream and registration data for my website. I should probably also look at my weblogs to make sure that prospects aren’t running into technical difficulties with my site, and if I have it, I’d like to include data from my call-center as well, for times when a prospect actually reaches back out to us to discuss whatever it was that caused them to abort their registration.
Do I actually have access to all that data today? Well, I have weblog and registration data from my website, and we pay a company to provide us with clickstream data for our website to manage our advertising spend already, so I have that. I outsource my call-center to a company that specializes in that kind of stuff, but I reached out to them and they’re happy to provide us the raw data from their call management and ticketing systems for calls related to my company.
What would make it worth spending some time and money on it, if I know I can get my answer? Well, if I get 5% more people who start to register on my site to sign up for my services, my company will make an additional $10M/year, so definitely!
If you do this exercise for your top five questions, you’ll probably have at least one where the value is there, and you have the data to approach the problem. From there, you’re off to the races, because instead of some Big Data science project that you aren’t sure will ever provide any value, you’re solving a real problem that you know is worth the investment!
At Platfora, we specialize in taking a use case just like this one and using Big Data Analytics to ask not just the initial question, but all of the follow-on questions that come from that initial insight. That’s where it gets really exciting, because instead of making business decisions based on an opinion, you can base those decisions on the data your business is already collecting, but never knew how to leverage.
Keith McClellan leads up Federal Engineering at Platfora, and has been focused on Big Data and related technologies for most of his career. If you’re interested in his random musings, he tweets @keithmcc (https://twitter.com/keithmcc) and occasionally writes for the Platfora blog.