May 2021 - DATABASE DEBUNKINGS

Sunday, May 16, 2021

TYFK: Data Model, Logical Model and Schema

Note: Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can review references -- reflecting the current understanding of the RDM, distinct from whatever has passed for it in the industry to date -- which explain and correct the misconceptions. You can acquire further knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

“Doesn't the data model come before the schema? I was tasked to build a data model and one of the resources was a schema. Isn't the schema made from the data model?”

“A data model can be different things. A schema can be a data model. Before that, there's a conceptual model, derived from the problem domain, then a logical model, capturing the entities, attributes, and relationships. After that, a schema is implemented based on those two models.”

“Yes, but if the system evolved, in practice you will have the schema (the structure of physical tables) as the ground truth, and you need to extract the logical model from it. In teaching environment of we tend to begin with the logical model and then create tables based on that.”

“this makes a little more sense to me. i thought a default data model would be out there but i can't find one. so i'm basically "recreating" one from the schema. then i assume i'll be adding on to it with third party products.”

--Reddit.com

TYFK: What Domains Are and Are Not

Follow @DBDebunk Follow @ThePostWest

“...a Data Domain refers to all the valid values which a data element (column) may contain. The rule for determining the domain boundary may be as simple as a data type with a list of possible values. For example, a database table that has information about people, with one record per person, might have an "age" column. This gender column might be declared as a SMALLINT data type, and allowed to have a value between 0 and 120. The data domain for the age column is hence 0 - 120. In a normalized data model, the reference domain is typically specified in a reference table. Following the previous example, the age reference table could have exactly 120 records, one per allowed value. Reference tables are formally related to other tables in a database by the use of foreign keys. A better way would be to enforce the data domain through a check constraint. For example, the age column would require positive numeric values between 0 and 120. I have found that the best way to figure out all of your data domains and constraints is to spend some time designing and normalizing all of your tables.”
--Quora.com

Misconceptions

There are no tables and, thus, no columns in relational databases;
Domains are not (programming) data types;
It is not the data model that is normalized;
A referenced relation does not reference domains;
A SQL CHECK constraint is not "better enforcement" of a referential constraint;
Constraints are not determined BY logical design;
Logical database design does not involve explicit normalization (to 1NF) or further normalization to 5NF.

Fundamentals

Relational databases consist of relations with attributes defined on domains; tables with columns visualize relations with attributes, but play no part in the RDM.
A relational domain represents a real world property and is a database object under DBMS control and, thus, is distinct from a programming data type which is an application object under programmer control that may not represent anything in the real world.
1NF (normalization) and 5NF (full normalization) are properties of relations (which comprise logical models), not of the data model (i.e., the RDM).
An attribute which is a foreign key in a referencing relation references a primary key which is an attribute in a referenced relation.
A constraint can be expressed in syntactically different ways by a data sublanguage. The CHECK constraint is a syntactic alternative in SQL to declare referential constraints.
Database relations are semantically constrained to be consistent with (i.e., represent faithfully) the corresponding conceptual model. Properties and properties in context (i.e., of specific entity types) are identified during conceptual modeling. Domain and attribute constraints respectively are specified during logical design to ensure consistency with the properties and properties in context they represent in the database.
Database design that adheres the three principles mandated by the RDM produces 1NF and 5NF databases that do not require explicit normalization and further normalization.

Note: The difference between relational domains and programming data types are specified in Codd's RM/V2 book. SQL tables are not relations and SQL data types are not relational domains.

Recommended reading

Domains: The Database Glue

Understanding Domains and Attributes

The Interpretation and Representation of Database Relations

Monday, May 3, 2021

OBG: Don't Confuse Levels of Representation Part 2

Follow @DBDebunk Follow @ThePostWest

Note: To demonstrate the correctness and stability due to a sound theoretical foundation relative to the industry's fad-driven "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my arguments hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

This is a continuation of an email exchange with readers in response to my post Normalization and Performance: Never the Twain Shall Meet started in Part 1.

On Normalization, Performance and Database Correctness

(originally posted 03/22/2001)

“I read your article on Normalization and database speed. The rules of normalization are used to provide a guide when designing a logical approach to managing information. If that information management plan (IMP) is to be implemented using a database, then the ultimate "test" of the IMP is its performance on the physical level. The IMP architecture is commonly expressed using an Entity Relationship Diagram or ERD. Sure CPU, RAM, and RAID all play an important role in database performance. But when a DBA changes the way tables, columns, or indexes are structured he is changing the IMP and corresponding ERD. Now there are two sides to this story.
The Pro De-Normalization group viewpoint: If the above change results in a "faster" database then denormalization has been "proven" to be necessary.
The Pro Normalization group viewpoint: A change is only valid if it maintains the integrity, and flexibility that are inherent in a fully Normalized database. For example, violation of the second rule "Eliminate Redundant data" can result in corrupt data when data in two or more locations is not properly update, changed, or inserted in all locations.
When the second rule is violated then data corruption must be safeguarded against. The safeguard must be implemented on the database level because that is the only way corrupt data can be prevented regardless of the source (i.e. data loaded from a flat file, DBAs and users logged onto SQL*Plus changing information, other application software that also has access to the database, and last but not least; application developers that are not aware of the duplicate data stored in more than one location). I usually implement the safeguard by using database triggers so that when a record is changes in one table a trigger will make the corresponding change in another table. I have found that most "performance gains" achieved by denormalization are eradicated by the use of the trigger needed to safeguard the integrity of the data. The bottom line is that De-Normalization can often increase database speed. But this is most often achieved at the expense of data integrity.”

POSTS