Mind The Gap

People: Forest Baskett, PhD; Greg Papadopoulos, PhD

Companies: Millennial Media; BloomReach; Lattice Engines; Pentaho

Topics: Big Data; Investment

By Forest Baskett, PhD and Greg Papadopoulos, PhD

There is no question that Big Data is, well, big. In just about every market domain, there are efficiencies, insights and new business to be had by doing the right kind of analysis on an increasingly vast collection of underlying multidimensional data. The emphasis here is on the right kind of analysis. You can have the biggest, baddest big data infrastructure that can happily munch through brontobytes and be monstrously inefficient (or useless, or worse) if you don’t have the right models running on the right features.

What we keep seeing is a huge distance between being able to imagine and postulate something about your business that you know is hidden in the data — maybe even to the point of running some sample models on a subset — and actually getting the infrastructure to be useful in production. The recurrent issues range from figuring out the simple (or so it seems) recoding of the experimental models to fit the underlying infrastructure (e.g., MapReduce), to getting the right data representation and features, to processing an analysis that doesn’t take until the end of civilization to compute. The cycle times can be really big (days to weeks) and that puts a serious crimp in the whole exploratory process of postulating something about your business, proving it really was insight into reality, and taking action on that insight. It’s still way too hard to bridge Big Data imagination with Big Data implementation.

We call this The Gap.

Mostly, we see companies deal with The Gap by building custom workflows to enable modelers, data scientists, coders, and operators of the big data infrastructure to work together effectively. In our portfolio, companies such as BloomReach, Lattice Engines and Millennial Media do this exceptionally well, but their processes are homegrown and customized to their verticals. We certainly get encouraged from folks like Pentaho making BI workflows efficient, or our friends at Wibidata who are pioneering the concept of a Big Data application server. Even so, The Gap still looms large.

And thus we are excited to see Carlos Guestrin and friends dedicating their future to GraphLab, a solution for The Gap. Prof. Guestrin started the project in 2009 as a way of bridging the educational gap for his Machine Learning students at CMU. The essential problem was to find a way that students could rapidly explore many different kinds of machine learning algorithms on substantial (read: Big) amounts of data. If The Gap is a challenge for teams at companies, it is virtually uncrossable for anyone but the most specialized and dedicated in academia.

A seminal choice by Carlos was the representation for the underlying data: graphs. His big motivator was that a lot of the interesting domains (say, social networks) are very neatly and naturally organized as graphs. In fact, it’s a very effective organizing principle and a bunch of things that we care about fit nicely into (natural) graph representations. This is powerful because it makes the data much easier to understand and reason about.

As pedagogically nice as graphs are (okay, pedagogy was the goal!), they can also be enormously challenging to compute efficiently upon, especially in parallel. Not to get mired in the details, but one thing that makes it tough is that many really interesting graphs are highly non-uniform, where a few nodes account for most of the connectivity (just think of the number of people following a celebrity on Twitter versus the number of followers for an average Twitter user). The GraphLab team has been chipping away at this computing problem since day one of the project. After many iterations, they’ve addressed this issue through a number of key design choices (e.g., it’s all in memory) and solved many other problems with graphs along the way. What they’ve engineered is something rather special and…

…wicked fast.

So fast, in fact, that it is now practical to close the exploration loop from days to hours. And so fast that you can begin to think of analytics being updated in real-time. That gets our attention, because here is this tool with an increasingly rich set of Machine Learning algorithms that works really well on a powerful and intuitive representation. Yes, there is a lot of engineering to make it truly accessible and easy-to-use, so it hasn’t bridged The Gap just yet. However, we are super excited about the possibilities and even more excited about going on the journey with the stellar team at GraphLab.