In this vodcast interview – three of the main people involved in our Big Data Journey discuss battle scars and what we’ve learned. They discuss everything from the 4vs, data capture, storage and security to Hadoop, Redshift and Elasticsearch. Interview text below.
It’s been two years and we’ve now delivered our third big data project. Two years ago we left a client who was looking to undertake some analysis. They are a major online business model, data is their business and they gave us an analytical challenge with all that data. The big data journey has taken us across the four Vs – volume, variability, variety and velocity. Some of that stuff comes at you pretty fast.
Q: What do you think has been the biggest challenge? Has it been the integration and unification of some of that data? And the different variety of data we’re dealing with? Is it being able to handle it fast? Is it being able to store it?
I think the first challenge was actually getting our head around where the best place to start was. There’s quite a lot of buzz words and quite a lot of different ideas which are all related to big data. Each project is different. It took us a long time to get to understand all the different components that make up that ecosystem and decide which bits are used first. So when you go into a certain type of project should you start with Hadoop or should you start with Elasticsearch? They do different things. It’s getting yourself up that learning curve of what each of those niches are and where you use them for different things. That was a real learning curve for us. Now we feel that we’re there and we’ve done a lot with different systems and feel a lot more comfortable to choose the right tools for the job.
Would you choose Hadoop? Would you choose Redshift? Or if you’re doing visualisation maybe you choose to base it on Elasticsearch because of the quickness. The speed of return of the queries and things like that.
Ecosystem is a great description for working with big data because it’s not a linear process. It’s an ecosystem you have to grow. And there isn’t one tool that solves everything. As we’ve been evaluating different tools they’ve been on a journey too. Like Elasticsearch, they’ve come a long way from when we first started using them. They’ve learnt from our use cases in which they saw our speed issue, our storage issues and in terms of some of the adhoc querying as well. But then in terms of ecosystem, we have a lot of SQL analysis tools here as well. People who know SQL. So we have played quite a bit with Redshift as well.
There are different stages in projects. We often start with getting things going really quickly and trying to understand things. And there are certain tools that are a lot quicker to get going with. We do some map reduce analysis and batch queries with Hadoop and Redshift. Redshift is a really great new tool which is SQL based. We’ve learnt that within a project you start with your adhoc understanding of what you need to do and get going with it and to process or understand that large amount of data. Then to put it into a production system you might use a different set of tools entirely and we often use things like Elastic in the final step to actually output the analysis to the users.
Q: So all of this is constantly changing – we’re constantly evaluating. But how has it been with clients? In my experience it’s been trying to get clients to understand that their data policies need to change. Changing their perception of storage and retention so they are able to defensively delete certain aspects that they don’t need. But also being able to capture things that they might not have captured in the past so as to give a rich story, what they want to read from their analysis. How’s your experience been from a technical perspective?
Even beyond a technical perspective it’s about understanding what data you need to capture and how long you need to store it for. We found that actually a lot of companies need to go through the process of trying to understand what they need to get the best value from their data. What they need to keep, how they best store it and the security they need to put in place. Deciding the teams as well, to work with it. With the technical aspects, things like security are not as hard as deciding policies in the beginning and getting that structure in place. We found that going through that process of understanding what you need can be the most difficult thing.
The questions come. Why do we have to do certain things? Why do you want that much data? How are you actually going to transfer that data? So all 4 Vs come into play, not just for us to understand but for the client to understand.
Tackling how you use cloud computing if you have sensitive data. Do you mask the data? What actually is or what isn’t sensitive? The IT infrastructure of a company will have their policies but how do they evaluate if something is secure or not? It’s been an issue we’ve had to work through a number of times.
How you move these massive sums of data between different cloud solutions or even within regions within cloud solutions can often be a problem if you have terabytes and terabytes of data. It can be pretty difficult when keeping security in mind and ensuring that process is secure. We came across some products and tools which can move absolute bucketloads of data in seconds and with security in place which we wouldn’t have come across if we weren’t in this space. That sort of thing’s only possible once you get in there and go on that journey.
Q: How has the leap from traditional SQL to dealing with big data been?
I think everyone here’s much happier since Redshift arrived. Redshift makes people happy.
Q: Are statistics capabilities easier with the big data ecosystem?
We have a number of stats guys and they’re getting into that role of doing things not just in small scale but on large scale and using the whole set available rather than sampling. Definitely advances in that area as well.
There’s the adage that big data is a bit like teenage sex – everyone’s talking about it, nobody really knows how to do it. I like to think we’ve graduated.
Of the three people we interviewed experienced with big data, Pamela Edmond – Associate Director, was on her last week with us at the time of filming. We wanted to capture her input into a discussion about a journey she was very much instrumental to realising.