Monday, August 8, 2016

Big Data: Modern Day Miracle or Same Old Snake Oil?

We've been hearing a lot about Big Data in news and social media. What exactly comprises it and why is it a big deal? This wiki article says "Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate." That certainly sounds ominous, but why is it being collected and what do you use it for? 

According to my research Big Data should be able to predict when a machine will fail, identify manufacturing bottlenecks, point out waste and care variability in healthcare, stop fraud on credit card purchases, predict drive times in high volume areas, even forecast the winners of sporting events and tune race cars. That's all great but how does it happen? I heard the terms "Hadoop" and "NoSQL" a lot while researching this article, so let's dive a little deeper into the technology. 

In "Hadoop: what it is And How It Works" Brian Proffitt explains: "It's a way of storing enormous datasets across distributed clusters of servers and then running distributed analysis applications in each cluster." Great, but that doesn't really tell me much. To me, that sounds like a regular everyday cluster and that technology has been around for decades. Mr. Proffitt goes on to explain that the Hadoop File System (HDFS), the Data Processing Framework and MapReduce are the secret sauce that makes things work. So you dump your data into the HDFS and you can use MapReduce to do the data processing. Great, but where does the data actually reside? Apparently it resides on the local machine in whatever place the Hadoop developer dumps it, so you are reduced to network speeds to transfer amazingly large amounts of data from one machine to the next in the cluster. Anyone who has ever used Neflix Streaming Service knows what a stupendously bad idea that is. Here is what Mr. Proffitt has to say about MapReduce: "(it) runs as a serious of jobs, with each job essentially a separate Java application that goes out into the data and starts pulling out information as needed." 

Danger Will Robinson. Here there be monsters. So you are telling me that we are going to use procedural programming tools to solve manufacturing bottleneck problems across a distributed network? 'Horse hockey.  Sufferin' Saddle Soap.  Buffalo Chips.  BULL COOKIES!  I've used this analogy before and I'm going to use it again: If you want to find a rock about the size of a quarter, build two boxes: one with holes lightly larger than a quarter, one with holes slightly smaller than a quarter and run all the rocks in the world through both boxes. What you are left with are all the rocks in the world about the size of a quarter. This is set based programming. Procedural programming on the other hand takes the proverbial quarter and lines up all the rocks in the world and compares the size one by one. In database circles this is known as RBAR (REE-bar, like the stuff that holds concrete together) and is the acronym for ‘Row By Agonizing Row.' I have run into this literally a hundred times in my career. Would you rather build two boxes or would you rather have a pair of tweezers in one hand and quarter in the other? So what is the conclusion? 


There is no such thing as Big Data

If you can't process it with traditional databases, it can't be processed. Take whatever data you have and structure it. Yes, some NoSQL alternatives are document based. You don't need documents. Even if you have documents (like this HTML page) take the contents add it to a full text index and then you can even search the documents, without using tweezers and a quarter. You have noticed that Microsoft didn't rush out and create a new Big Data engine, haven't you? Nope, Microsoft will let you use Hadoop because they will take your money if you are foolish enough to spend it that way, but there is already a better solution than RBAR. Structuring the data is going to take about as long as that first MapReduce query, but every query after that is basically free. the second MapReduce 'job' takes precisely as long as the first one, and for the same dumb reason: procedural code running against huge amounts of data. 

So what do we get and what do we lose by using a traditional Relational Database instead of Hadoop/NoSQL? First the data has to be structured. It is going to take time to look at the data and figure out how to get it into your Relational Database Management System (RDBMS). Of course, it would take the same amount of time to get it into your Hadoop train wreck as well, so that is a wash. SQL is pretty easy to use and MapReduce is mind-bogglingly complicated. Mr. Profitt (remember him?) even stated in his article above that some map reduce instances take a SQL like language and translate it for you. That seems a little backwards. If you can describe what you are trying to do using the tools you have, why not just use the tools that you have? 

You already have a relational database. It is how the world has worked to this point. You can use SQL queries to ask your existing database questions. If you need advanced analytics, maybe you should abandon Boyce and Codd and say hello to Kimball and Inmon. Probably your data isn't big enough to need even a data warehouse (read about the Kimball and Inmon guys above) but if it is, don't just scrap the Ferrari you've had built and go buy a Yugo with a Hadoop bumper sticker. The conclusion is that the whole Big Data movement is complete snake oil designed to get you, dear reader and holder of cash, to spend money on something you don't really understand and certainly do not need.

No comments:

Post a Comment