Thursday, July 22, 2021

Documents and Databases

'These new data technologies were developed because there are new usage scenarios for data — which do not fit into the relational model.'
--Reddit.com

Don't let the NoSQL label fool you. It's the relational model (RDM), not SQL, that its proponents are really dismissing. The main argument, as advanced in a recent LinkedIn exchange, is that lots of information "cannot be represented in rows and columns". IOW, the RDM is not general enough -- there are certain types of information that it is not suited for. Ignoring the tabular nonsense, the response from David McGoveran, is important enough to restate here.
“Information consists of facts (i.e., propositions asserted to be true) about objects, properties, and relationships among objects and properties. We have shown that a database relation -- which a R-table visualizes -- is constrained to represent a set of facts about (properties of) a group of entities with within-group relationships among properties and entities and cross-group relationships. Yet we are told that document information "do not fit" in a relational structure. They are referred to as "unstructured" (which, if they were, they would contain random noise, not information).

But documents don't lack structure. Rather, they are multistructured: have complex multi-level/type structures -- lots of content, metadata, interpretations, and internal relationships (formatting, semantic, structural or syntactic, and so on). At one level of analysis, they are just documents that have subject matter or content involving objects, properties and relationships. At another they might relate to that of other data (e.g., other documents). How we represent knowledge and in how much detail is determined by which of the structures we choose to represent and that always partially determines the class of queries we can express. This is precisely what Codd understood and tried to address via the RDM.
--David McGoveran

And there's the rub: which type of data (facts) at which document level is of interest? Take this post. There are facts about it (e.g., author, title, date and so on). There are facts in it (its content). Either can can be readily represented relationally, for example:

POSTS (AUTHOR,TITLE,DATE,CONTENT)

where CONTENT is a column defined on a text, PDF, or HTML domain with built-in operators applicable to values of either of those types (e.g., a substring operator for text). Facts at other levels (e.g., grammatical, or semantic) could be of interest and would require multi-table representation. One must choose the type/level of information of interest to represent relationally in a database. We can choose to not do the analysis and modeling of the content of documents, but that does not mean that they are unstructurable as facts. More often than not data professionals don’t know what type of facts are to be represented, or are unfamiliar with data modeling and relational fundamentals. Product advocates avoid to say that without investing time and effort in analysis and modeling one cannot ask the same questions of and produce results equivalent to those from relational databases (i.e., make precise inferences from data that are guaranteed to be correct -- logically valid and semantically consistent). In fact, the use of such products trades upfront structuring effort for subsequent prohibitive manipulation effort.

As David points out, "complaints about RDM are not about knowledge representation, but knowledge discovery -- the problem, for example, that Google Search, analytics and data integration face and attempt to solve. It's an expensive, imprecise, and difficult problem", but it is distinct from what database management does and the two should not be confused.

 

 

 

 

Saturday, July 10, 2021

Relational Misconceptions Part 2: RDM is Applied Theory

In Part 1 we showed (yet again) how even those with their heart in the right (relational) place can't help being affected by the common and entrenched industry misconceptions, in this case about relationships, relations and tables. More often than not authors exhibit the very misconceptions they try to debunk.

We left the author distinguishing sets (with unordered, unique elements) from tables (lists of ordered, possibly duplicate rows). 

Thursday, July 1, 2021

OBG: Experimental Science and Database Design

Note: To demonstrate the correctness and stability conferred by a sound theoretical foundation relative to the industry's fad-driven, ad-hoc "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments of then hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

The following is an email exchange from 2001 that I recommend reading jointly with my Data Meaning and Mining post (itself a revision of an article originally published at the old "All Analytics" website). I have slightly touched my replies for pedagogical purposes and clarity. You can substitute any data structure for XML hierarchy.

Thursday, June 10, 2021

Entities, Properties and Codd's Sleight of Hand

A recent LinkedIn post tried to illustrate in graphic form some old bunk that had been debunked to death decades ago: the purported inferiority of relational relative to directed graph database management. The plethora of "Great viewpoint!", "great representation", "love this!", "like", "Nice!, "Very clever!" reactions it triggered confirm the lack of foundation knowledge and familiarity with history of the field in the industry, without which there cannot be any progress. I will not waste any time debunking it. Instead, I will use a quote by a participant in the exchange from a Date and Darwen (D&D) book to bring up an important insight by David McGoveran.

“Overall, we believe the most appropriate design will emerge if careful consideration is given to the distinction between (a) declarative sentences in natural language, on the one hand, and (b) the vocabulary used in the construction of such sentences on the other. As we showed in Chapter 2 (but simplifying slightly here), it is unencapsulated tuples in relations that stand for those sentences, and it is encapsulated domain values in attributes in those tuples that stand for particular elements — typically nouns — in that vocabulary. To say it slightly differently (and to repeat what we said in Chapter 2, albeit in different words): Domains, or types, give us values that represent things we might wish to make statements about; relations give us ways of making those statements. Consider once again the EMP relvar of Design R. Suppose that relvar includes the tuple:

TUPLE {EMP# 'E7',ENAME 'Amy',DEPT# 'D5',SALARY MONEY(60000)}

The existence of this tuple in the relvar means, by definition, that the database includes something that assert  that the following declarative sentence (statement) is true:

Employee E7, named Amy, is assigned to department D5 and earns a salary of $60,000."

Sunday, May 16, 2021

(TYFK) Data Model, Logical Model and Schema


Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“Doesn't the data model come before the schema? I was tasked to build a data model and one of the resources was a schema. Isn't the schema made from the data model?”

“A data model can be different things. A schema can be a data model. Before that, there's a conceptual model, derived from the problem domain, then a logical model, capturing the entities, attributes, and relationships. After that, a schema is implemented based on those two models.”

“Yes, but if the system evolved, in practice you will have the schema (the structure of physical tables) as the ground truth, and you need to extract the logical model from it. In teaching environment of we tend to begin with the logical model and then create tables based on that.”

“this makes a little more sense to me. i thought a default data model would be out there but i can't find one. so i'm basically "recreating" one from the schema. then i assume i'll be adding on to it with third party products.”

                                                               --Reddit.com

Monday, May 10, 2021

(TYFK) What Domains Are and Are Not

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“...a Data Domain refers to all the valid values which a data element (column) may contain. The rule for determining the domain boundary may be as simple as a data type with a list of possible values. For example, a database table that has information about people, with one record per person, might have an "age" column. This gender column might be declared as a SMALLINT data type, and allowed to have a value between 0 and 120. The data domain for the age column is hence 0 - 120. In a normalized data model, the reference domain is typically specified in a reference table. Following the previous example, the age reference table could have exactly 120 records, one per allowed value. Reference tables are formally related to other tables in a database by the use of foreign keys. A better way would be to enforce the data domain through a check constraint. For example, the age column would require positive numeric values between 0 and 120. I have found that the best way to figure out all of your data domains and constraints is to spend some time designing and normalizing all of your tables.”
--Quora.com

Misconceptions

  • There are no tables and, thus, no columns in relational databases;
  • Domains are not (programming) data types;
  • It is not the data model that is normalized;
  • A referenced relation does not reference domains;
  • A SQL CHECK constraint is not "better enforcement" of a referential constraint;
  • Constraints are not determined BY logical design;
  • Logical database design does not involve explicit normalization (to 1NF) or further normalization to 5NF.

Fundamentals

  • Relational databases consist of relations with attributes defined on domains; tables with columns visualize relations with attributes, but play no part in the RDM.
  • A relational domain represents a real world property and is a database object under DBMS control and, thus, is distinct from a programming data type which is an application object under programmer control that may not represent anything in the real world.
  • 1NF (normalization) and 5NF (full normalization) are properties of relations (which comprise logical models), not of the data model (i.e., the RDM).
  • An attribute which is a foreign key in a referencing relation references a primary key which is an attribute  in a referenced relation.
  • A constraint can be expressed in syntactically different ways by a data sublanguage. The CHECK constraint is a syntactic alternative in SQL to declare referential constraints.
  • Database relations are semantically constrained to be consistent with (i.e., represent faithfully) the corresponding conceptual model. Properties and properties in context (i.e., of specific entity types) are identified during conceptual modeling. Domain and attribute constraints respectively are specified during logical design to ensure consistency with the properties and properties in context they represent in the database.
  • Database design that adheres the three principles mandated by the RDM produces 1NF and 5NF databases that do not require explicit normalization and further normalization.

Note: The difference between relational domains and programming data types are specified in Codd's RM/V2 book. SQL tables are not relations and SQL data types are not relational domains.


Recommended reading

Domains: The Database Glue

Understanding Domains and Attributes

The Interpretation and Representation of Database Relations





Monday, May 3, 2021

(OBG) Don't Confuse Levels of Representation Part 2

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is a continuation of an email exchange with readers in response to my post Normalization and Performance: Never the Twain Shall Meet started in Part 1

On Normalization, Performance and Database Correctness

(originally posted 03/22/2001)

“I read your article on Normalization and database speed. The rules of normalization are used to provide a guide when designing a logical approach to managing information. If that information management plan (IMP) is to be implemented using a database, then the ultimate "test" of the IMP is its performance on the physical level. The IMP architecture is commonly expressed using an Entity Relationship Diagram or ERD. Sure CPU, RAM, and RAID all play an important role in database performance. But when a DBA changes the way tables, columns, or indexes are structured he is changing the IMP and corresponding ERD. Now there are two sides to this story.
  • The Pro De-Normalization group viewpoint: If the above change results in a "faster" database then denormalization has been "proven" to be necessary.
  • The Pro Normalization group viewpoint: A change is only valid if it maintains the integrity, and flexibility that are inherent in a fully Normalized database. For example, violation of the second rule "Eliminate Redundant data" can result in corrupt data when data in two or more locations is not properly update, changed, or inserted in all locations.
When the second rule is violated then data corruption must be safeguarded against. The safeguard must be implemented on the database level because that is the only way corrupt data can be prevented regardless of the source (i.e. data loaded from a flat file, DBAs and users logged onto SQL*Plus changing information, other application software that also has access to the database, and last but not least; application developers that are not aware of the duplicate data stored in more than one location). I usually implement the safeguard by using database triggers so that when a record is changes in one table a trigger will make the corresponding change in another table. I have found that most "performance gains" achieved by denormalization are eradicated by the use of the trigger needed to safeguard the integrity of the data. The bottom line is that De-Normalization can often increase database speed. But this is most often achieved at the expense of data integrity.”
View My Stats