Thursday, June 10, 2021

Entities, Properties and Codd's Sleight of Hand

A recent LinkedIn post tried to illustrate in graphic form some old bunk that had been debunked to death decades ago: the purported inferiority of relational relative to directed graph database management. The plethora of "Great viewpoint!", "great representation", "love this!", "like", "Nice!, "Very clever!" reactions it triggered confirm the lack of foundation knowledge and familiarity with history of the field in the industry, without which there cannot be any progress. I will not waste any time debunking it. Instead, I will use a quote by a participant in the exchange from a Date and Darwen (D&D) book to bring up an important insight by David McGoveran.

“Overall, we believe the most appropriate design will emerge if careful consideration is given to the distinction between (a) declarative sentences in natural language, on the one hand, and (b) the vocabulary used in the construction of such sentences on the other. As we showed in Chapter 2 (but simplifying slightly here), it is unencapsulated tuples in relations that stand for those sentences, and it is encapsulated domain values in attributes in those tuples that stand for particular elements — typically nouns — in that vocabulary. To say it slightly differently (and to repeat what we said in Chapter 2, albeit in different words): Domains, or types, give us values that represent things we might wish to make statements about; relations give us ways of making those statements. Consider once again the EMP relvar of Design R. Suppose that relvar includes the tuple:

TUPLE {EMP# 'E7',ENAME 'Amy',DEPT# 'D5',SALARY MONEY(60000)}

The existence of this tuple in the relvar means, by definition, that the database includes something that assert  that the following declarative sentence (statement) is true:

Employee E7, named Amy, is assigned to department D5 and earns a salary of $60,000."

Sunday, May 16, 2021

(TYFK) Data Model, Logical Model and Schema


Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“Doesn't the data model come before the schema? I was tasked to build a data model and one of the resources was a schema. Isn't the schema made from the data model?”

“A data model can be different things. A schema can be a data model. Before that, there's a conceptual model, derived from the problem domain, then a logical model, capturing the entities, attributes, and relationships. After that, a schema is implemented based on those two models.”

“Yes, but if the system evolved, in practice you will have the schema (the structure of physical tables) as the ground truth, and you need to extract the logical model from it. In teaching environment of we tend to begin with the logical model and then create tables based on that.”

“this makes a little more sense to me. i thought a default data model would be out there but i can't find one. so i'm basically "recreating" one from the schema. then i assume i'll be adding on to it with third party products.”

                                                               --Reddit.com

Monday, May 10, 2021

(TYFK) What Domains Are and Are Not

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“...a Data Domain refers to all the valid values which a data element (column) may contain. The rule for determining the domain boundary may be as simple as a data type with a list of possible values. For example, a database table that has information about people, with one record per person, might have an "age" column. This gender column might be declared as a SMALLINT data type, and allowed to have a value between 0 and 120. The data domain for the age column is hence 0 - 120. In a normalized data model, the reference domain is typically specified in a reference table. Following the previous example, the age reference table could have exactly 120 records, one per allowed value. Reference tables are formally related to other tables in a database by the use of foreign keys. A better way would be to enforce the data domain through a check constraint. For example, the age column would require positive numeric values between 0 and 120. I have found that the best way to figure out all of your data domains and constraints is to spend some time designing and normalizing all of your tables.”
--Quora.com

Misconceptions

  • There are no tables and, thus, no columns in relational databases;
  • Domains are not (programming) data types;
  • It is not the data model that is normalized;
  • A referenced relation does not reference domains;
  • A SQL CHECK constraint is not "better enforcement" of a referential constraint;
  • Constraints are not determined BY logical design;
  • Logical database design does not involve explicit normalization (to 1NF) or further normalization to 5NF.

Fundamentals

  • Relational databases consist of relations with attributes defined on domains; tables with columns visualize relations with attributes, but play no part in the RDM.
  • A relational domain represents a real world property and is a database object under DBMS control and, thus, is distinct from a programming data type which is an application object under programmer control that may not represent anything in the real world.
  • 1NF (normalization) and 5NF (full normalization) are properties of relations (which comprise logical models), not of the data model (i.e., the RDM).
  • An attribute which is a foreign key in a referencing relation references a primary key which is an attribute  in a referenced relation.
  • A constraint can be expressed in syntactically different ways by a data sublanguage. The CHECK constraint is a syntactic alternative in SQL to declare referential constraints.
  • Database relations are semantically constrained to be consistent with (i.e., represent faithfully) the corresponding conceptual model. Properties and properties in context (i.e., of specific entity types) are identified during conceptual modeling. Domain and attribute constraints respectively are specified during logical design to ensure consistency with the properties and properties in context they represent in the database.
  • Database design that adheres the three principles mandated by the RDM produces 1NF and 5NF databases that do not require explicit normalization and further normalization.

Note: The difference between relational domains and programming data types are specified in Codd's RM/V2 book. SQL tables are not relations and SQL data types are not relational domains.


Recommended reading

Domains: The Database Glue

Understanding Domains and Attributes

The Interpretation and Representation of Database Relations





Monday, May 3, 2021

(OBG) Don't Confuse Levels of Representation Part 2

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is a continuation of an email exchange with readers in response to my post Normalization and Performance: Never the Twain Shall Meet started in Part 1

On Normalization, Performance and Database Correctness

(originally posted 03/22/2001)

“I read your article on Normalization and database speed. The rules of normalization are used to provide a guide when designing a logical approach to managing information. If that information management plan (IMP) is to be implemented using a database, then the ultimate "test" of the IMP is its performance on the physical level. The IMP architecture is commonly expressed using an Entity Relationship Diagram or ERD. Sure CPU, RAM, and RAID all play an important role in database performance. But when a DBA changes the way tables, columns, or indexes are structured he is changing the IMP and corresponding ERD. Now there are two sides to this story.
  • The Pro De-Normalization group viewpoint: If the above change results in a "faster" database then denormalization has been "proven" to be necessary.
  • The Pro Normalization group viewpoint: A change is only valid if it maintains the integrity, and flexibility that are inherent in a fully Normalized database. For example, violation of the second rule "Eliminate Redundant data" can result in corrupt data when data in two or more locations is not properly update, changed, or inserted in all locations.
When the second rule is violated then data corruption must be safeguarded against. The safeguard must be implemented on the database level because that is the only way corrupt data can be prevented regardless of the source (i.e. data loaded from a flat file, DBAs and users logged onto SQL*Plus changing information, other application software that also has access to the database, and last but not least; application developers that are not aware of the duplicate data stored in more than one location). I usually implement the safeguard by using database triggers so that when a record is changes in one table a trigger will make the corresponding change in another table. I have found that most "performance gains" achieved by denormalization are eradicated by the use of the trigger needed to safeguard the integrity of the data. The bottom line is that De-Normalization can often increase database speed. But this is most often achieved at the expense of data integrity.”

Friday, April 23, 2021

Relational Misconceptions Part 1: Relationships and Tables

Amid the plethora of industry misconceptions, an article titled "What if I told you there are no tables in relational databases?" is surprising. That it starts with:

“I’ve seen one sentence about relational databases repeated on the Web many, many times. I’ve seen it in countless comments, I’ve seen it in few articles. Recently I’ve even seen it in one book — which finally made me write this article. The sentence in question goes like this: "Ironically, relational databases deal poorly with relationships". Read it carefully. Think about it for a moment. I’m sure it must sound perfectly reasonable — for anyone who doesn’t understand the meaning of both "relational" and "irony".”

is practically shocking. Be that as it may, my regular readers know that I refer to pronouncements on the RDM as "heart in the right place": correct statements are not a guarantee of full grasp thereof, or effective explanations to practitioners lacking the necessary foundation knowledge. Hence this two part debunking.

Sunday, April 11, 2021

(TYFK) Relations, Tables, and Semantic Consistency

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).


“A relation, or table, in a relational database ... must have a set of columns or attributes, and it must have a set of rows to contain the data. A tuple (or row) can be a duplicate. In practice, a database might actually contain duplicate rows, but there should be practices in place to avoid this, such as the use of unique primary keys (next up). Given that a tuple cannot be a duplicate, it follows that a relation must contain at least one attribute (or column) that identifies each tuple (or row) uniquely. This is usually the primary key. This primary key cannot be duplicated.”

 

Misconceptions

 
  • A relation is not a table and, thus, has neither columns, nor rows (certainly not fields);
  • "Duplicate tuples" is a contradiction in terms -- a table with duplicate rows does not visualize a relation (i.e., is not a R-table) -- and a database with duplicated data is not relational;
  • Unlike a mathematical relation, there is no such thing as a database relation without a PK, which would be semantically inconsistent (i.e. it would not be a faithful representation of group of entities, which are distinguishable in the real world);
 

Fundamentals

 

  • A database relation:
- is a relationship among domains (sets of values) -- a subset of their cross-product -- a set of tuples (sets of values drawn from the domains) or, in other words, a set of sets from sets;
- has attributes, which are representations (1:1 mappings) of the domains in the relation;
- is semantically constrained to be consistent with (i.e., represent faithfully) the entity group in the conceptual model represented by the database), including by a PK constraint:
. domains represent properties;
. relations represent entity groups;
. attributes represent entity properties;
. tuples represent (facts about -- properties of) entities;
. some constraints represent relationships among properties, entities and groups (some properties are relationships);
  • Duplication would violate the RDM and mean semantic inconsistency with (inaccurate representation of) the group.
  • A R-table visualizes a relation on some physical medium -- it plays no part in the relational model.


Further Reading


What Relations Really Are and Why They Are Important

Understanding Relations series

What Is a Relational Database




Thursday, March 25, 2021

(OBG) Don't Confuse Levels of Representation Part 1

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is an email exchange with readers in response to my article Normalization and Performance: Never the Twain Shall Meet.

View My Stats