Tuesday, April 1, 2014

Analytics = Manipulation of Data Structure

In What the $&@#^ is Applied Big Data, venture capitalist Greg Sands raises an issue about which I as a scientist (and not just a "data scientist") have often expressed concern. 
"The world is awash in data. Figuring out what to do with it is the problem. The press is littered with reports about Big Data. Many CIOs report that their CEOs have come to them and said, “We need some of that Big Data.” That often means make sure we’re collecting all the available data, often deploying a new Hadoop-based infrastructure to store and analyze it. After this elaborate process and extensive investment, they’ll start mining to figure out if there are critical insights that come out of the data. We see many entrepreneurs that start the same way. Aggregate data and look for a problem."

I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.

Note that "store and analyze" comes up first. It implies that whatever form data are in, they can just be "dumped" into Hadoop as is and analyzed at will.
Second, Sands is correct that science does not usually work this way, but rather: 
"We prefer to start the other way -- with a business or technical problem that you’re trying to solve. Then asking, “What data assets and analytics can we bring to bear on this problem?” is the more fruitful way to narrow in on an important and high-value problem, [e.g.] ... use data to help improve the productivity of human capital by focusing initially on the call center, where every action is already instrumented ..."
The process by which analytics can be used to help improve productivity involves four steps:
  1. Definition of measurable human productivity and actionable factors that affect it,
  2. Formulation of a causal model of the effect of the latter on the former,
  3. Generation or location of data representing those variables,
  4. Testing the model by running a suitable analysis that, if it fails to reject the model, is assumed to validate it.
This is, essentially, modeling a business aspect of interest. Variables are usually attributes of some entities -- and organizing the corresponding data into a structure that lends itself to the chosen analytical operations (manipulation). Sands continues: 
"Traditional Enterprise data was structured, meaning it fit neatly into rows and columns like an Oracle database (or even a spreadsheet). Facebook’s relationship data or Google’s logfile data is unstructured and no longer fits that approach. Open Source tools like Hadoop, Cassandra, Mongo DB and Hive were developed to store, manage, search and analyze both structured and unstructured data – at massive scale."
But data do not sort of magically "fit neatly into rows and columns." Reality must be modeled to be represented by some data structure that can be manipulated to extract information, which is what analytics are. The so-called NoSQL products that Sands mentions have been promoted as "schema-less," which, for all practical purposes, means "unstructured data." As I explained in an earlier post, "unstructured data" is a contradiction in terms: No manipulation can extract information from random noise. It does not carry any. 

Software developer Chandermani Arora describes this phenomenon in his "Random Thoughts" blog: 
"Recently in one of the projects we planned to use some NoSQL Document database. One of the reasons we though [a] document database would be a great fit was that we could get started without setting up any DB schema and start shoving entities into the document database. Nothing can be further from the truth. Data modeling is as essential and as fundamental an exercise for NoSQL stores as it is for RDBMS"
The real question is, what is the optimal structure that can serve most analytical needs with least complexity? Any business reality can be modeled to “fit in rows and columns,” and there are significant reasons why that is optimal for data management. I'll discuss this in more detail in future posts. 

Related posts:

No comments:

Post a Comment

View My Stats