Thursday, July 22, 2021

Documents and Databases

'These new data technologies were developed because there are new usage scenarios for data — which do not fit into the relational model.'

Don't let the NoSQL label fool you. It's the relational model (RDM), not SQL, that its proponents are really dismissing. The main argument, as advanced in a recent LinkedIn exchange, is that lots of information "cannot be represented in rows and columns". IOW, the RDM is not general enough -- there are certain types of information that it is not suited for. Ignoring the tabular nonsense, the response from David McGoveran, is important enough to restate here.
“Information consists of facts (i.e., propositions asserted to be true) about objects, properties, and relationships among objects and properties. We have shown that a database relation -- which a R-table visualizes -- is constrained to represent a set of facts about (properties of) a group of entities with within-group relationships among properties and entities and cross-group relationships. Yet we are told that document information "do not fit" in a relational structure. They are referred to as "unstructured" (which, if they were, they would contain random noise, not information).

But documents don't lack structure. Rather, they are multistructured: have complex multi-level/type structures -- lots of content, metadata, interpretations, and internal relationships (formatting, semantic, structural or syntactic, and so on). At one level of analysis, they are just documents that have subject matter or content involving objects, properties and relationships. At another they might relate to that of other data (e.g., other documents). How we represent knowledge and in how much detail is determined by which of the structures we choose to represent and that always partially determines the class of queries we can express. This is precisely what Codd understood and tried to address via the RDM.
--David McGoveran

And there's the rub: which type of data (facts) at which document level is of interest? Take this post. There are facts about it (e.g., author, title, date and so on). There are facts in it (its content). Either can can be readily represented relationally, for example:


where CONTENT is a column defined on a text, PDF, or HTML domain with built-in operators applicable to values of either of those types (e.g., a substring operator for text). Facts at other levels (e.g., grammatical, or semantic) could be of interest and would require multi-table representation. One must choose the type/level of information of interest to represent relationally in a database. We can choose to not do the analysis and modeling of the content of documents, but that does not mean that they are unstructurable as facts. More often than not data professionals don’t know what type of facts are to be represented, or are unfamiliar with data modeling and relational fundamentals. Product advocates avoid to say that without investing time and effort in analysis and modeling one cannot ask the same questions of and produce results equivalent to those from relational databases (i.e., make precise inferences from data that are guaranteed to be correct -- logically valid and semantically consistent). In fact, the use of such products trades upfront structuring effort for subsequent prohibitive manipulation effort.

As David points out, "complaints about RDM are not about knowledge representation, but knowledge discovery -- the problem, for example, that Google Search, analytics and data integration face and attempt to solve. It's an expensive, imprecise, and difficult problem", but it is distinct from what database management does and the two should not be confused.





Saturday, July 10, 2021

Relational Misconceptions Part 2: RDM is Applied Theory

In Part 1 we showed (yet again) how even those with their heart in the right (relational) place can't help being affected by the common and entrenched industry misconceptions, in this case about relationships, relations and tables. More often than not authors exhibit the very misconceptions they try to debunk.

We left the author distinguishing sets (with unordered, unique elements) from tables (lists of ordered, possibly duplicate rows). 

Thursday, July 1, 2021

OBG: Experimental Science and Database Design

Note: To demonstrate the correctness and stability conferred by a sound theoretical foundation relative to the industry's fad-driven, ad-hoc "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments of then hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

The following is an email exchange from 2001 that I recommend reading jointly with my Data Meaning and Mining post (itself a revision of an article originally published at the old "All Analytics" website). I have slightly touched my replies for pedagogical purposes and clarity. You can substitute any data structure for XML hierarchy.

View My Stats