Thursday, November 21, 2013

Structuring the World With 'Unstructured Data'



Database management depends on structure -- of reality and of data representing it in databases -- which determines the data manipulation and integrity enforcement by database management systems (DBMS).

Having argued this for decades I have been, predictably, quite skeptical of the hype of systems that manage and extract information from so-called "unstructured data," purportedly obviating the need for business modeling and database design. It's also why I've been skeptical of the criticism of SQL-based DBMS as inflexible because they force big data, much of it text, into tabular schemas in which it doesn't fit or that is difficult to envision upfront. 



--------------------------------------------------------------------------------
I have been using the proceeds from my monthly blog @AllAnalytics to maintain DBDebunk and keep it free. Unfortunately, AllAnalytics has been discontinued. I appeal to my readers, particularly regular ones: If you deem this site worthy of continuing, please support its upkeep. A regular monthly contribution will ensure this unique material unavailable anywhere else will continue to be free. A generous reader has offered to match all contributions, so please take advantage of his generosity. Thanks.
--------------------------------------------------------------------------------- 

My skepticism is being validated by examples such as Load First, Model Later -- What Data Warehouses Can Learn from Big Data:
"The traditional extract, transform, load approach... requires a pre-defined data model to be implemented as a set number of tables. The data being loaded is then mapped to the existing data model represented by the tables in the database. If the data the user wants to load... does not match the existing tables, changes must be made... [that are] expensive and time consuming, especially for businesses that deal with complex data or work in a highly dynamic environment."
While this article goes on to say that data professionals get excited about big data tools because they allow data to be loaded first and modeled later, this is often interpreted as a promise that equivalent intelligent information can be extracted without the upfront effort of a database schema. "
But then there’s "Modeling for NoSQL, Schemaless & Unstructured Data":
    "Join me and three data experts in my Big Challenges in Data Modeling webinar... Since I’m attending the NowSQL Now! conference, our topic this week will be how data architects and modelers should be involved in requirements and modeling efforts on NoSQL, Schemaless, and Unstructured Data projects... or is there a role for them at all?"
But modeling is structuring, and the result must be some schema, so what exactly does “modeling for schema-less databases” mean? Here’s a thought from Ashok Chandra, a Microsoft distinguished scientist and general manager of the Interaction and Intent Group at Microsoft Research Silicon Valley, in a post about the company's new Leibniz Platform for “entity resolution”: 
"Search technology began with words,” says Chandra. “We built a whole search infrastructure around words. But in this new era of search, we are working with entities, because people think in terms of them, such as a hotel, a movie, an event, a hiking trail, or a person. The Leibniz platform is designed from the ground up to deal in entities, with the goal of making it easier for people to accomplish the tasks they set out to do."
Those who read my initial posts here should recall entities as an element of business modeling or, in other words, structuring. So to get search engines more useful, we need to... guess what?
"How do you organize all the world’s information? A decision made by the editors in charge of Wikipedia’s newest, biggest project reveals the difficulty of such a task... During the early part of its development, Wikidata used a hierarchical taxonomy to organize its data entries... originally meant to organize bibliographic information across library systems, though it was expanded recently by Internet technologists to work for non-library systems, too... groups everything into huge taxonomical categories. Those categories are: 
  • Person. 
  • Organization. 
  • Place. 
  • Event. 
  • Work. 
  • Term."
Organizing is also structuring and the list does look very much like one of entities. But without the Wikipedia data itself also structured accordingly, how much better would the search engine be than a DBMS?

The point is that the questions that can be asked of and answered by a computerized database system depends on the data structure and, as I have argued here, "unstructured data" is a contradiction and all structures aren't created equal.

Are you a skeptic like me when it comes to claims about "unstructured data" and the DBMS? Share with me below.



No comments:

Post a Comment