Thursday, March 28, 2013

Social BigData and Relational Denial

Note: Minor edits 3/29/13.

In an online discussion initiated by the question Does It Matter If Data is BIG or not? MQ commented:
I still feel the discussion around Relational Modelling is confusing the point, and should be put aside until the problem is understood. If a company came to me and said 'Help me solve my big data issue - I have a billion emails I want to analyse' my answer is not 'just create a logical model using relational model theory' because this does not supply an answer. I will make more ground if I say 'right, lets discuss what this is, what technology you have, where the fail points and choke points are, etc and model (relational model) that as part of the process'.
I've built data models for 25 years (all levels) and firmly believed in Relational Theory across this entire period, so I am not saying drop Relational Models, just saying don't start there. Interestingly, I don't get any backlash against relational modelling using this approach - so perhaps the issues mentioned are about how the concept is sold to clients (a weapon rather than an intellectual concept)?

I prefer to reserve the term 'modeling' for the conceptual level and use 'database design' at the logical level. A logical model (which can be relationally designed)--is a database representation of something in the real world. It is impossible to produce without establishing that something--a conceptual model--first. Given a bunch of emails and tasked with "analyzing them", it is impossible to "use relational model theory" directly on the emails as is, without a conceptual model that lends itself to such use. (Incidentally, the relational data model is relational theory, of which there is only one).

The important point to remember, however, is that regardless of what the informational/analytical needs are, they require some data model: the manipulation of some data structure. The structure determines the operations that can be applied to it to serve those needs. So the question is what is the logical data structure whose manipulation "best" satisfies a given set of needs.

If there is valuable information in emails, will the kind of manipulation to which the email structure is amenable--and make no mistake about it, emails have one!--satisfy the needs? In some instances it may, but if it does not, what structure will do so cost-effectively?

If the analytical need is logical inferencing, that is, deriving facts that are logical implications of the facts embedded in the emails, then the manipulation to which its structure lend itself will not produce the desired result. Business Modeling for Database Design specifies a set of criteria to use in assessing whether any structure other than the relational one--the R-table--is superior, or a trade-down.

How solutions are sold to clients is a completely different ballgame. The exchange was among data management professionals. If clients are end-users, there is no reason for them to be exposed to anything other than business concepts and terminology. But if they are IT personnel, they are expected to know and understand the data model alternatives--structure and, therefore, manipulation--and their pros and cons for the given needs. If they do, they would know when a relational approach is optimal and why others are not. "Backlash" against relational is more often than not an indication that they don't.
I don't like the BigData term, but it has drawn the industries focus to a looming issue - that of trying to gain value out of social media data - because other than that we have happily been motoring along with technology improvements allowing us to generally keep up with demands for increases in data processing, and allowing established paradigms to flourish. Emails, Blogs, etc. are an enticing target for corporates who are looking for an edge - there is a lot of research in this area. This is not an anti-intellectual or foundation knowledge lack - intellectuals are looking in different directions rather than accept a long-held theory, they will either reinforce or add to our foundation knowledge in the process.
From a data fundamentals perspective, social media data is not different from any other data: extracting information from it involves the manipulation of some data structure and, of course, that is not affected by volume. The question is still what data structure is "best". Consider what MQ describes as the challenges posed by "social BigData":
BigData ... seems more about extending our thinking to deal with a style of data that has many new paradigms or contexts - depending on the degree to which we've come across these issues previously. Looking at a set of blogs to try and extract 'Sentiment' analysis as an example, the challenges are not just with the structure / unstructure nature of the data. As samples of this:
- Not everyone blogs - so there is no guarantee that we are looking at everyones data, there is actually no guarantee we are looking at a fair distributed sample either
- Those that do blog, blog at different rates - in some cases people only blog when they have something to say, others blog about cleaning their teeth
- Some people only Blog when they want to make a point - they may only make 'happy about...' blogs - but we can't interpret they are therefore always happy.
- Linguistics - different education systems, levels, care factors etc. all influence how people type - blogs are viewed as informal language, with all the issues that implies for analysis
- Hidden Knowledge - depending on audience, there is often a level of expected reader knowledge - 'what concert are they talking about' may need knowledge of previous communications and worse.
- Abbreviated language like what 'lol' mean. And these can be sarcastic as well as contextual.
These are structural/analytical/manipulation issues. Some infer from this the following:
RS: As far as the other types of data goes is where relational theory falls short, but I am certain no one said relational is a size that fits all. Because and only because of this, we have to model an approach. I am deliberately saying an approach because modeling data alone may not be sufficient in this case (non-relational data). We don't want to feel like mice left in a horrendous maze trying to get to the cheese on the other end. That is why I proposed that we simplify this problem by gleaning useful metadata for analytics first, which will lend itself to some structure. Then the unstructured data (some or most of it) would be left - but we have probably made a dent already. Much like solving a jigsaw puzzle - all the pieces are unstructured, but maybe we can still make a square out if it.
  • It is not relational theory that falls short, just the opposite: the "non-relational" structures of the social BigData do not lend themselves to the kind of manipulation that produces provably correct logical inferencing. So informational demands must be relaxed to comply with the manipulative possibilities of "non-relational" data. Any alternative structure simply does not offer the advantages of the relational model.
  • If the demands cannot be relaxed, there is no way around modeling the reality embedded in the social BigData and mapping it to a relational database structure. This is often interpreted as a deficiency of the relational model, but that too is backwards. If you need provably correct logical inferencing, social BigData structures will fail you.
  • The implication of the last point is that the notion that the relational model is limited with respect to what realities it can represent is wrong. Reality neither is inherently relational, nor non-relational;  it can be represented either way in the database. If the information to be extracted is valuable, the question is whether it is valuable enough to justify  conceptual modeling and database design. What  should be avoided is the illusion that you can get results from "non-relational data" that are equivalent to those from "relational data" without conceptual modeling and database design.
  • "[G]leaning useful metadata for analytics first, which will lend itself to some structure" is nothing but admission that whatever structure the data have--and they do have some, or they would not be data--it does not satisfy the analytical purpose. Extracting the "metadata" is essentially modeling part of the reality embedded in the emails so that it maps to a structure and manipulation that satisfies the analytical purposes that the data model underlying email cannot. Guess which one does.

No comments:

Post a Comment

View My Stats