Friday, August 28, 2020

TYFK: Denormalization Part 1 - Fundamentals


““Main Question: How do we trade-off while doing denormalization? 
  • Sub-question 1: the standard to implement
- Do we always have to denormalize a model? For what kind of project must we use denormalization techniques while others may not?
- Since denormalization has its gains and losses, how well should we denormalize a data model? Perhaps, the more complete we denormalize, the more complex, uncertain and poor the situation will be.
  • Sub-question 2: the characteristics of normalization
-Does denormalization have several levels/forms the same as that of normalization? For instance: 1DNF, 2DNF...
- Given we can denormalize a data model, it may never be restored to the original one because to do normalization, one can have many ways while to build a data model, you can have multiple choices in determining entities, attributes, etc.””

Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

In Part 1 we discuss the relevant fundamentals in which we will ground the debunking in Part 2.

Monday, August 24, 2020

Oldies But Goodies: Data Sublanguage vs. SQL


Note: To demonstrate the stability of a sound foundation relative to the industry's fad-driven cookbook practices, I am re-publishing under "Oldies But Goodies" material from the old DBDebunk.com (2000-06), so that you can judge for yourself how well my debunkings hold up and whether the industry has progressed beyond the misconceptions they were intended to dispel. I may break long pieces into more pointed parts, and add comments and references to further reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.
“Crucial to making the dream of relational databases a reality was the development of a new query language which took significant advantage of relational theory. Then contemporary query languages like CODASYL usually required incredibly complicated code to answer even the simplest of questions from a database. Codd himself developed a relational language called ALPHA that he often used to communicate his ideas in an academic context [Actually, this is not true], but it attracted almost no commercial interest. In 1974 and 1975 Raymond Boyce and Don Chamberlin of IBM designed a new fourth generation language to extract information from systems based on Codd's relational model known as Structured English Query Language, or SEQUEL. This would later be shortened to SQL but is still most correctly pronounced "sequel." As a relational language, SQL was designed to be descriptive rather than procedural. SQL allowed users to specify the results they wanted rather than having to specify the procedure to obtain the desired results. Theoretically this would absolve the user from having to figure out the best way to execute the program and make information available to a whole class of non-programmers. The model's scope and elegance soon drew many converts.”
It is somewhat of a stretch to deem SQL a full fledged data [sub-]language, and I am not sure how significant is the advantage it takes of relational theory. Codd, Date and others, myself included, have amply demonstrated that not only does SQL fail to support, or violates important relational features, but that it is actually a poorly designed language as well. It can be argued that at least one reason relational technology has not reached its potential is that SQL was practically its only concretization (an arguably better language, QUEL, could not compete due to lack of the then critical IBM compatibility). Not only do most practitioners think that SQL DBMSs are relational, but they actually blame the problems due to SQL's violations of, or lack of adherence to relational principles on the relational model itself!

The claim of non-procedurality (the term is declarative, not descriptive) is rather questionable. While SQL is, relatively speaking, less procedural than programming languages, or the quasi-data languages preceding, it is by no means as declarative as it should and could have been and in many cases (particularly those when it eschews relational features) the SQL formulation can be extremely procedural (see, for example, the chapters on normalization, redundancy and integrity in my latest book). One case in point are sub-queries (the "structured" in SQL's name comes from them): they were put in the language to avoid direct expressions of relational operations such as JOIN, INTERSECT, or DIFFERENCE. Unfortunately, they make SQL a highly redundant language--the same query can be expressed in many ways. Codd warned IBM about the optimization implications of such redundancy, but was ignored. In 1983 I ran multiple formulations of the same query against several SQL PC DBMSs and got response times ranging from 2 to 2500 seconds. Ironically, not all direct expressions could be avoided (e.g. UNION); what is more, explicit expression of all relational operations were later added to SQL anyway. So now there is double redundancy and the optimization complications worse!


Further Reading

Data Sublanguage series

Domains, R-tables, and SQL

Not Worth Repeating: Duplicates

To Really Understand Integrity, Don't Start with SQL

NULL Value is a Contradiction in Terms

Precision, Procedurality and SQL

DISTINCT and ORDER BY Are Not Relational
 

Language Redundancy and DBMS Performance: A SQL Story





Thursday, August 20, 2020

TYFK: Relations, Tables, Domains and Normalization


“The most popular data model in DBMS is the Relational Model. It is more scientific a model than others. This model is based on first-order predicate logic and defines a table as an n-ary relation. The main highlights of this model are:
  • Data is stored in tables called relations.
  • Relations can be normalized. In normalized relations, values saved are atomic values.
  • Each row in a relation contains a unique value.
  • Each column in a relation contains values from a same domain.”


Each "Test Your Foundation Knowledge" post presents one or more misconceptions about data fundamentals. To test your knowledge, first try to detect them, then proceed to read our debunking, which is based on the current understanding of the RDM, distinct from whatever has passed for it in the industry to date. If there isn't a match, you can acquire the knowledge by checking out our POSTS, BOOKS, PAPERS, LINKS (or, better, organize one of our on-site SEMINARS, which can be customized to specific needs).

Friday, August 7, 2020

Oldies But Goodies: Data Models and Physical Independence


Note: To appreciate the stability of a sound foundation vs the industry's fad-driven cookbook practice, I am re-publishing some of the articles and reader exchanges from the old DBDebunk.com (2000-06), giving you the opportunity to judge for yourself how well my claims/arguments hold up and whether the industry has progressed at all. I am adding comments on re-publication where necessary. Long pieces are broken into smaller parts for fast reading.

From "Little Relationship to Relational" originally posted on March 29, 2001.

 
“E.F. ("Ted") Codd conceived of his relational model for databases while working at IBM in 1969. Codd's approach took a clue from first-order predicate logic, the basis of a large number of other mathematical systems and presented itself [sic] in terms of set theory, leaving the physical definition of the data undefined and implementation dependent. In June of 1970, Codd laid down much of his extensive groundwork for the model in his article, "A Relational Model of Data for Large Shared Data Banks" published in the Communications of the ACM, a highly regarded professional journal published by the Association for Computing Machinery. Buoyed by an intense reaction against the ad hoc data models offered by the physically oriented mainframe databases, Codd's rigid separation of the logical model, with its rigorous mathematical underpinnings, from the less elegant realities of hardware engineering was revolutionary in its day. Codd and his relational ideas blazed across the academic computing landscape over the next few years.”

View My Stats