Big Shoulders, Big Data

Well, folks, it’s happening. It’s been said that when a technology finally makes its way from the coasts to Chicago, it must be here to stay. It took Rails a lot of years to firmly take root here – long after our brethren in New York and California had been going stark raving productive with it.

So what’s next, Chicago? Big Data. And technologies to manage it. See, we’re not home to dozens of household-name dot-coms. Sure, there’s Groupon, and GrubHub, and a few others, but Silicon Valley still outnumbers us thousands to a handful. That, however, doesn’t mean we don’t have our share of big-data problems to solve (or products to invent) right here in the Windy City.

Empirically, we at Redpoint are seeing a sudden and noticeable increase in demand for consultants with experience working with Big Data – and, in particular, with the Hadoop MapReduce framework. And herein lies a problem.

Chicago is simply not bursting at the seams with seasoned Hadoop warriors. And until weather patterns change, I don’t expect to see a huge influx of talent heading our way from The Valley any time soon. So, Chicago, what do we do?

We do what we have always done. Stand tall, stretch out those big shoulders, and get there ourselves. Take our formidable talent, breadth and depth of experience, and unmatched work ethic, and simply learn this stuff. And guess what? It’s not rocket science. (Of course not, it’s computer science. But I digress.)

After rolling up my sleeves and spending a little time (just a little) with Hadoop, I couldn’t help but think about how I might advise a client who needed someone to work on a big data project using Hadoop. It’s very difficult to find experienced Hadoop talent here, and if you can find it, you’ll pay a premium. Is the premium worth it?

Not in my book. A seasoned software engineer with a Java background, who has enterprise experience and knows what batch processing is all about, can springboard into the Hadoop world with just a modest learning curve to slog through. As I was learning about Hadoop myself, I noted that nearly every “new” concept was very easy to pick up, and started to think about why that was. Before long, I found myself building a list of specific skills, traits and areas of expertise that enabled me (and would enable anyone) to quickly master Hadoop and MapReduce.

Here’s what I came up with:

  • Solid OO fundamentals. Because learning to write code within a new framework is much easier if you truly understand inheritance, abstraction and polymorphism.
  • Solid Java language fundamentals. Because you’ll be several steps ahead if you are familiar with things like Java’s I/O classes (though you’ll use Hadoop variants) and generics, and you’ll want to be able to write reasonably high-performing code on the first pass.
  • Experience with batch processing of files of appreciable size. Not necessarily Big Data size, but the kinds of files that you inspect with head or tail because loading them up in anything else would give you an instant 15-minute coffee break. Experience on the ETL side of a data warehouse, for example, would prepare you nicely for working with Big Data.
  • Familiarity with all things *NIX, because you’ll likely be installing and configuring software on Linux boxes and developing on a Mac. You’ll want to be able to hop into vi to edit config files quickly. Create, delete and traverse directories from the command line. Quickly find and/or repair unexpected data in input files using things like grep, sed and awk.
  • Experience working in a truly distributed system, where processing is done on multiple nodes and data doesn’t live in one monolithic database. Because if you’ve spent any time in an environment like that, you understand at a visceral level the fundamental tenets of a tool like Hadoop – that code needs to operate on co-resident data to run efficiently, and that it’s far more efficient to move code to where data resides (when the data is reasonably large) than vice versa. You’ll have an instant internal reaction to terms like CPU-bound and I/O-bound.

So, if I happened to be building an elite team of Big Data consultants (ahem, wink, nudge), I’d be looking for folks with this set of experiences. Looking at it this way, Hadoop isn’t this big, mysterious and complicated thing. It’s just a framework. A new tool for solving a relatively new problem. And knowing how to use it effectively requires a rich set of fundamental knowledge building blocks that many of us already have.

So come on, Chicago. Let’s do this. Let’s not let our friends in California have all the fun.

Leave a Reply

Your email address will not be published. Required fields are marked *