Sunday, December 2, 2018

What Is a Data Model, and What It Is Not




“The term data model is used in two distinct but closely related senses. Sometimes it refers to an abstract formalization of the objects and relationships found in a particular application domain, for example the customers, products, and orders found in a manufacturing organization. At other times it refers to a set of concepts used in defining such formalizations: for example concepts such as entities, attributes, relations, or tables. So the "data model" of a banking application may be defined using the entity-relationship "data model". This article uses the term in both senses.”
--Data Model, Wikipedia

What a True Data Model Is


Few practitioners realize that Codd invented the Relational Data Model (RDM) as the first exemplar of a data model, a concept that he formalized in 1980 as follows:


Sunday, November 25, 2018

Data and Meaning Part 1: The RDM Is Applied Theory




“Fabian - With respect, maybe it's time to' shake the formal foundations' of data management, especially given the rising costs and increasing segregation of silos.”
“John, if I were to say what I really think, I would be accused of insulting, so I won't. You don't need to respect me, but you better respect formal foundations. Since they are what gives SOUNDNESS to data management practice, what you are really saying is that you don't care about soundness -- do you really intend to take this position? I would not be surprised, because the industry has long "shook" the formal foundations and lack of soundness is precisely what characterizes it. But because there is no longer proper education, practitioners are totally unaware of the relationship between formal foundations and soundness, everything is ad-hoc and arbitrary, yet they fail to recognize the consequences.”[1]
--LinkedIn.com
Thus an exchange with John Gorman on LinkedIn, in which he posed several questions (that I answered in the last week's post[2]), the subject being the importance of not confusing levels of representation, and, more specifically, avoiding conceptual-logical conflation (CLC)[3].

Somebody posted a link to my answers on Linkedin and in a comment on it John linked to a Richard Feynman YouTube lecture on "the general differences between the interests and customs of the mathematicians and the physicists". To which I responded that my very point is that, just like physics is not the mathematics used to describe it (a central issue in quantum mechanics), conceptual modeling is not data modeling, the latter is the representation of the former in the database -- they are distinct[2]. This brought to mind some older columns I published on the All Analytics website that no longer exists, so this series is a revision thereof.

Saturday, November 10, 2018

Conceptual Modeling Is Not Data Modeling



“Ok, now that we have those two (Parts 3 and 4 of your series) 'on the table' so to speak, perhaps you would address these questions...
1. Would it be safe to say that facts expressed in a Conceptual model should be verifiable in reality?

2. Are the following facts logically equivalent or are they different:

a) The car with license number 62-JZK-6 has the color aquamarine blue
b) De auto met kenteken 62-JZK-6 heeft de kleur aquamarijnblauw

3. If a previously true fact is found in reality to be verifiably false, would that mean the Conceptual model is wrong or the Logical model, or reality?”

“I'm going to add another:

4. How does RDM handle temporal changes to the 'truth' of statement 2a) when:

a) The owner of the car paints it black.
b) The owner of the license plate legally transfers it to a truck.
c) The owner of the car replaces every single part except the chassis.”

John O'Gorman asked me these questions in a LinkedIn exchange[1] in response to my comments in another exchange on modeling[2], where I alerted to the confusion of levels of representation common in the industry, particularly conceptual-logical conflation(CLC)[3]: calling conceptual modeling data modeling both reflects and induces it.

Online exchanges are not a proper vehicle for learning, particularly foundation knowledge. Which is why I publish free blog posts, and papers and books, to which to refer interested serious data professionals. It just so happened that my just posted four-part series covers the subject at hand[4], so I referred to it, as well as other writings (the answers are already there if one cares to read them). I will not discuss the whole exchanges -- read them and judge for yourself -- but I promised to answer the questions here, where I can do them justice.

John raises primarily conceptual, not data model issues -- the latter are subservient to decisions in the former -- but then asks "how does RDM handle..." From experience, I recognize implicit doubts that the RDM can. As far as we know there is no formal data model[5] that is a superior alternative to the RDM with respect to "handling" conceptual issues (in fact, there is no other formal data model -- i.e., that satisfies Codd's definition -- period).


Since most of the issues involved are covered by McGoveran's work in progress[6] (in which my multi-part series is rooted), to ensure consistency with it I passed the questions by him. As he too pointed out, "Answers that work in all situations require highly complicated discussions and lots of time, and trying to teach someone without proper experience and educational background would be very cumbersome, or an oversimplication via online exchanges." 


Here's what's possible within the constraints of a blog post -- the serious reader is referred to our writings.

Saturday, November 3, 2018

Understanding Conceptual vs. Data Modeling Part 4: Properties-object Modeling



Revised 6/26/19.

In Part 1 and Part 2 we explained that when the RDM (1969-70) and the E/RM (1976) were introduced, there was no distinction between a conceptual and a logical level -- the conceptual-logical-physical distinction of levels of representation emerged in mid 80s. Only in 1980 did Codd specify three components of a formal data model -- structure, integrity, manipulation. While the RDM satisfies the specification, the E/RM does not: it is a conceptual modeling approach, weaknesses of which have been elaborated elsewhere[1]. In Part 3 we presented a common example of conceptual-logical conflation (CLC), and corresponding confusion of types of model (conceptual, logical, physical, and data).

As promised, here we outline a new conceptual modeling approach derived by David McGoveran from his work formalizing Codd's RDM. It makes an ontological commitment different from that by conventional modeling, which requires revision and extension of the RDM -- an objective of David's effort.


Sunday, October 28, 2018

Understanding Conceptual vs. Data Modeling Part 3: Don't Conflate Reality and Data




In Part 1 and Part 2  we explained that between 1975-81, when the E/RM and RDM were introduced, there was no distinction between an informal conceptual and a formal logical level. In 1980, however, Codd defined a formal data model and in the later 80s the conceptual-logical-physical levels of representation emerged. If applied to the two models:

  • Only the RDM satisfies the definition;
  • The E/RM can be used at the conceptual level to model reality, the latter can be used to model data at the logical level (i.e., formalize conceptual models as logical models for database representation).
Current practitioners, however, continue to confuse levels of representation and confuse/conflate types of model. So much so, that in my presentations I used to draw an imaginary line dividing the room into two sections, and move to the right section to discuss one level/model, and to the left section to discuss another.

Consider the question "does data modeling slow down an application development process?". I will set aside the notion of "speeding up" application development by skipping altogether "data modeling" (whichever way it is meant), and focus on the response.

Saturday, September 29, 2018

Understanding Conceptual vs.Data Modeling Part 2: E/RM Models Reality, RDM Models Data




Re-write 10/17/18
Revised 11/1/18

In Part 1 we explained that when the RDM and the E/RM were introduced, the distinct conceptual-logical-physical levels of representation had not yet emerged, and a data model had not yet been formally defined. But in 1980 Codd defined a formal data model as a combination of (1) data structures, (2) integrity constraints, and (3) operators on the structures[1], and later on the three-fold trinity of levels came into being. Given a conceptual level distinct from the logical, do the RDM and the E/RM satisfy the definition -- are they data models in today's terms?

Recall from Part 1 that the RDM has all three components and is defined in purely logical terms, so it is a data model. But the E/RM definition intermingles conceptual and logical terminology, and therefore is not consistent with two distinct levels. Moreover, as a data model E/RM is incomplete:

“The E/RM is not a data model as formally defined by Codd: no explicit structural component except sets classified in various ways, no explicit manipulative component except implied set operations, and very limited integrity (keys).”
--David McGoveran
Contrary to claims, Date does not exactly say that the E/RM is a data model:
“[It] is not even clear that the E/R "model" is truly a data model at all, at least in the sense in which we have been using that term in this book so far (i.e., as a formal system involving structural, integrity, and manipulative aspects). Certainly the term "E/R modeling" is usually taken to mean the process of deciding the structure (only) of the database, although [it does deal with] certain integrity aspects also, mostly having to do with keys ... However, a charitable reading of [Chen's original E/RM paper] would suggest that the E/R model is indeed a data model, but one that is essentially just a thin layer on top of the relational model (it is certainly not a candidate for replacing the relational model, as some have suggested).”[2]
Note that even if, charitably, the E/RM is considered a data model, it is not up to the RDM.

Tuesday, September 18, 2018

Don't Conflate/Confuse Primary Keys, PK Constraints, and Indexes




“What is the difference between an index and a key? How are they related?”

“There seams to be some confusion between what a Primary Key is, and what an Index is and how they are used. The Primary Key is a logical object. By that I mean that is simply defines a set of properties on one column or a set of columns to require that the columns which make up the primary key are unique and that none of them are null. Because they are unique and not null, these values (or value if your primary key is a single column) can then be used to identify a single row in the table every time. In most if not all database platforms the Primary Key will have an index created on it. An index on the other hand doesn’t define uniqueness. An index is used to more quickly find rows in the table based on the values which are part of the index. When you create an index within the database, you are creating a physical object which is being saved to disk.”

“A primary key by default creates a clustered index. A unique constraint/key by default creates a non-clustered index.”

“An index is a (logically) ordered list of rows. For example, an index on LastName means all values are already sorted in LastName order. Usually index rows contain far fewer columns in them than the table itself (except the clustered index, which is the table). A key is a column or columns that defines the order of an index. For example, on an index ordered by (LastName,FirstName), then LastName and FirstName are the keys. Btw, a primary key is a physical object, not a logical one. The db engine needs physical rows in order to insure unique values in the index.”
--Difference between an index and a key?, SQLTeam.com
I have recently published a paper[1], and posted a multipart series[2] on relational keys. In the latter I stated as follows:
"As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL) adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL[3]."
I have also written extensively on widespread logical-physical confusion (LPC)[4], recently specifically in the key-index context[5]. The replies above are examples -- if any more were needed -- that validate my repeated claim of lack of foundation knowledge in the industry -- can you tell what's wrong with, and what's correct in, them?

Tuesday, September 11, 2018

Wednesday, August 29, 2018

DISTINCT and ORDER BY Are Not Relational




“One of the things that confuse SQL users all the time is how DISTINCT and ORDER BY are related in a SQL query ... most people quickly understand:
   
SELECT DISTINCT length
FROM film

[that] returns results in an arbitrary order, because the database can (and might apply hashing rather than ordering to remove duplicates) ... Most people also understand:
   
SELECT length
FROM film
ORDER BY length

[that] will give us duplicates, but in order ... And, of course, we can combine the two:
   
SELECT DISTINCT length
FROM film
ORDER BY length

[But if] somewhat intuitively, we may want to order the lengths differently, e.g. by title:
   
SELECT DISTINCT length
FROM film
ORDER BY title

[m]ost databases [sic] fail this query with an exception like Oracle’s:

ORA-01791: not a SELECTed expression

At first sight ... this

SELECT length
FROM film
ORDER BY title

works after all ... So, how are these different? We have to rewind and check out the logical order of SQL operations (as opposed to the syntactic order). And always remember, this is the logical order, not the actual order executed by the optimiser.”
--How SQL DISTINCT and ORDER BY are Related, Jooq.org

Wednesday, August 15, 2018

Order Is For Society, Not Databases




8/18/18: I have re-written this post for a better explanation. If you read it prior to the revision, you should re-read it.
 
“I learned that there is no concept of order in terms of tuples (e.g. rows) in a table, but according to wikipedia "a tuple is an ordered list of elements". Does that mean that attributes do have an order? If yes why would they be treated differently, couldn't one add another column to a table (which is why the tuples don't have order)? [OTOH], "In this notation, attribute–value pairs may appear in any order." Does this mean attributes have no order?”
--Do the “columns” in a table in a RMDB have order?
“Is it possible to reorder rows in SQL database? For example, how can I swap the order of 2nd row and 3rd row's values? The order of the row is important to me since i need to display the value according to the order [and] 'Order by' won't work for me. For example, I put a list of bookmarks in database. I want to display based on the result I get from query. (not in alphabet order). Just when they are inserted. But user may re-arrange the position of the bookmark (in any way he/she wants). So I can't use 'order by'. An example is how the bookmark display in the bookmark in firefox. User can switch position easily. How can I mention that in DB?”
--How can I reorder rows in sql database

While some data professionals may know that rows and columns of "database tables" are "unordered", few of them know what that means, and understand why. This is due to two, not unrelated, of the many common misconceptions[1] rooted in the lack of foundation knowledge in the industry, namely that relational databases consist of tables[2], and logical-physical confusion (LPC)[3]. They obscure understanding of the RDM and its practical implications, which is reflected in the answers to the above questions. Instead of debunking them, this post fills the gap in knowledge such that you can debunk them yourself -- try it before and after you read it.



Sunday, August 5, 2018

No Such Thing As "Primary Key Tuning"




“The choice of good InnoDB primary keys is a critical performance tuning decision. This post will guide you through the steps of choosing the best primary key depending on your workload ... You would be surprised how many times I had to explain the importance of primary keys and how many debates I had around the topic as often people have preconceived ideas that translate into doing things a certain way without further thinking.”
--Yves Trudeau, Principal architect, Percona.com

I will be labeled "pedantic" and a "purist" for saying this, but there is no such thing as "PK tuning". "Choice of good PK for performance" reflects logical-physical confusion (LPC), which, in turn, is rooted in lack of foundation knowledge, and failure to grasp the RDM -- not just by database practitioners, but even by DBMS designers.

Monday, July 30, 2018

Lenin, Trotsky, Data Management, and the Tyranny of Knowledge and Reason -- Version 2




14 years ago I published an editorial with this title in response to a Slashdot.com exchange triggered by one of my articles at DBAzine.com. The objective was to illustrate some of the damage done by the collapse of education to the IT industry, reflecting a trend in the society  as a whole. Things have only gotten worse since then, and for quite a while I considered a re-write. A recent Linkedin exchange provided the impetus, so here it is.

Monday, July 23, 2018

Industry Practice Is No Substitute for Foundation Knowledge




“A short time ago a colleague asked me where he could find a "Databases 101" guide for the non-technical professional. As it turns out, the internet is littered with information, and mis-information, regarding data and databases. This makes it difficult for someone entering the field of data and databases to understand the bigger picture. He wanted something to help make sense of data and databases, specifically relational versus NoSQL. So, that’s what I decided to work on. This is the start of my Databases 101 guide for the non-database professional; something to help anyone understand why the word “database” is an overloaded term. We use the term "database" to describe a great many things ... I use the terms database and database engine interchangeably in this article.”
                            --Databases 101, ThomasLaRock.com

To “understand the bigger picture”, and “make sense of data and databases” requires foundation knowledge. In its absence, efforts to educate defeat their own purpose, because they are grounded in the very misinformation they should be correcting. This is the mechanism by which misconceptions[1] are perpetuated and reinforced.

Sunday, July 15, 2018

Understanding Relations Part 3: Debunking Conventional Wisdom




(See Part 1 and Part 2)

“A common term used in database design is a "relational database" -- but a database relation is not the same thing and does not imply, as its name suggests, a relationship between tables. Rather, a database relation simply refers to an individual table in a relational database. In a relational database, the table is a relation because it stores the relation between data in its column-row format. The columns are the table's attributes, while the rows represent the data records. A single row is known as a tuple to database designers.”
“A relation, or table, in a relational database has certain properties.”

“First off, its name must be unique in the database, i.e. a database cannot contain multiple tables of the same name.”

“Next ... as with the table names, no attributes can have the same name.”

“Next, no tuple (or row) can be a duplicate. In practice, a database might actually contain duplicate rows, but there should be practices in place to avoid this, such as the use of unique primary keys (next up). Given that a tuple cannot be a duplicate, it follows that a relation must contain at least one attribute (or column) that identifies each tuple (or row) uniquely. This is usually the primary key. This primary key cannot be duplicated. This means that no tuple can have the same unique, primary key. The key cannot have a NULL value, which simply means that the value must be known.”

“Further, each cell, or field, must contain a single value. For example, you cannot enter something like "Tom Smith" and expect the database to understand that you have a first and last name; rather, the database will understand that the value of that cell is exactly what has been entered.”

“Finally, all attributes—or columns—must be of the same domain, meaning that they must have the same data type. You cannot mix a string and a number in a single cell.”

“All these properties, or constraints, serve to ensure data integrity, important to maintain the accuracy of data.”
  --Definition of Database Relation

It is easy to discern when explanations of relational features are not grounded in the formal foundations of the RDM[1], but in industry practices. Here are some further clarifications and corrections.
 

Wednesday, July 4, 2018

N-ary vs. Binary Relations: Anatomy of a Compound Misconception




Note: This is a re-write of an older post, to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM, which includes refinements, corrections, and extensions of his own.
“Codd has based his model on n-ary relations and that is the key mistake he has made; that leads to complex structure (absolutely not necessary) and situations where there are no values known and as a consequence the need of the concept we know too well -> the null pointers; binary relations (smallest possible) are sufficient to express any predicate/sententional formula and there is no possibility to have something like null; if a value is unknown then we do not know it thus it is not a fact for us thus it is not in our database.”
We suggest you test your foundation knowledge by figuring what's wrong with this picture before you proceed.

Friday, June 29, 2018

Understanding Relations Part 2: Beware the Misconceptions




Note: This is a re-write of two older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation of Codd's real RDM, including his own refinements, corrections, and extensions[1].

(Continued from Part 1)



I started Part 1 with comments exposing confusion (among the many misconceptions rampant in the industry[2]) of database relations with tables (reinforced by SQL), and of levels of representation. Debunking them requires foundation knowledge lacking in the industry, so I provided some: I explained the difference between mathematical relations, database relations, and relational tables. We are now in a position to debunk the comments.

Sunday, June 24, 2018

Understanding Relations Part 1: Tables? So What?




Note: This is a re-write of two older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation of Codd's real RDM, including his own refinements, corrections, and extensions[1]

“Put simply, a "relation" is a table, the heading being the definition of the structure and the rows being the data.”
“In simple English: relation is data in tabular format with fixed number of columns and data type of each column. This can be a table, a view, a result of a subquery or a function etc.”
“Practically, a "Relation" in relational model can be considered as a "Table" in actual RDBMS products(Oracle, SQL Server, MySQL, etc), and "Tuples" in a relation can also be considered as "Rows" or "Records" in a table.”
“In common usage, however, when someone refers to a "relation" in a database course, they are referring to a tabular set of data either permanently stored in the database (a table) or derived from tables according to a mathematical description (a view or a query result).”
“In SQL RDBMSes (such as MS SQL Server and Oracle] tables are permently stored relations, where the column names defined in the data dictionary form the "heading" and the rows are the "tuples" of the relation. Then from a table, a query can return a different relation.”
“Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables.”
“I read [that] "Relations are multidimensional. They are not flat. They are not two dimensional. Don't let the term table mislead you." on the back cover of CJ Date's DATABASE IN DEPTH. Can anyone help how to visualize this multidimensional nature of relations?”
Because SQL DBMSs have been sold as relational databases (which they are not), and in SQL the data structure is the table, in the absence of foundation knowledge[2] most practitioners think that relational databases consist of tables, but do not ask themselves why and how is that significant for database practice. The subtitle of this post is a question I used to ask in presentations years ago that always got silence. I see no evidence of improvement -- in fact, it's gotten worse. To emulate Feynman, "Nobody understands the RDM".

That such a simple and commonly understood structure can visualize relations is an advantage of the RDM, but a table is not a relation and, SQL notwithstanding, confusing the two reflects a lack of understanding of the RDM, misses its significance for database practice, and prevents taking full advantage of its benefits.

Note: The table is the preferred way to picture relations, there are others (e.g., array).

First, the fundamentals.

Sunday, June 17, 2018

Foreign Keys Part 2: Beware of Misconceptions




Note: This is the second part of a multipart re-write of several older posts to bring them into line with the McGoveran formalization and re-interpretation of Codd's real RDM, including revisions, refinements, and extesions of his own[1].

(Continued from Part 1)

Part 1 started with an online exchange triggered by the question “Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of?” Both the question and the replies exhibit misconceptions about FKs (there are misconceptions about almost everything in the RDM[2]) rooted in lack of foundation knowledge, so we provided some FK fundamentals. We are now in a position to debunk the replies.


Sunday, June 10, 2018

Foreign Keys Part 1: Understanding the Fundamentals




Note: This is the first part of a two-part re-write of several earlier posts, to bring them into line with McGoveran's formalization and re-interpretation of Codd's true RDM, which includes his own corrections, refinements and extensions[1]. For a more in-depth treatment see the series of papers available here.
“Do I Have to Use Foreign Keys? If I am already manipulating data properly, are foreign keys required? Do they have another purpose that I’m just not aware of? I appreciate the guidance!”
“... [we] wish to make a point. There is something which is bad design/good design/mandatory/optional. Please stop insisting that Primary and Foreign keys are mandatory. They are good design habits but by no means mandatory. However, life is much more complex than a Normalized DB structure. This includes tables serving as event logs; tables, serving as User maintained materialized query tables, tables, serving as supporting structures, reflecting state of complex transactional databases; persistent tables serving as Result Set or Session keepers. And I personally believe that if they were truly mandatory, Sybase, Oracle, SQL Server, Ingres, DB2, etc. would require them. Oh, sorry, forgot the SQL standard itself. This is not the relational model we're talking about. These are commercially available RDBMSs which, not surprisingly, DO tend to listen to their customers. If they didn't, they wouldn't be in business!! Since Sybrand is unlikely to get FKs required by the SQL standard or the major RDBMS vendors, it seems that mandatory means that his answer to the question "Do I have to use foreign keys?" is "You would if you worked in my shop!". I'm inclined to agree with that.”
“Databases can work with or without primary keys and foreign keys. The choice is yours... However ... enforcing referential integrity can be done by many methods ... TMTOWTDT = There is more than one way to do this ... It all depends on your approach... In the last ten years... every one is enforcing referential integrity with help of primary and foreign keys but before this ... a lot of applications were working without primary and foreign keys to enforce referential integrity and to avoid orphaned rows/avoid duplicate records.”
“We don't have every possible logical relationship enforced by the database. Sometimes you have to compromise for performance reasons, as too many foreign key validations can slow down high volume inserts. Other times you have to create breakpoints just to keep the web of relationships from becoming too tangled and connecting hundreds or thousands of tables.”
“I think it is preferable to have FK constraints as an additional security layer and they can be disable[d] during loading if required; however, you need to be 100% certain your ETL is enforcing the constraints. It is best to do both - have the ETL reject records which fail FK checks and report on these whilst also enforcing FK intergrity on insert/update, if appropriate. The only additional thing I can add is - when you delete from a FK enabled DB, make sure you do it in the correct order.”
“I think, you have to learn about data structures and logical data design (not only database, which is nowadays interpreted mainly as only RDBMS), to be clear about usage primary, alternate, and foreign keys, normal forms, data integrity-and database integrity, because your database will work suboptimally without these knowledge if it will work at all.”
Data practitioners have high levels of tool knowledge, but lack a good grasp of fundamentals, for which reason they cannot be considered data professionals. Now, do not get me wrong: I do not mean that good knowledge of tools is unimportant -- if you work with them you gotta know them real well -- but the ability to fully assess them, use them optimally, and compensate for any shortcomings is limited in the absence of foundation knowledge. So let's have some before tackling the exchange.

Friday, June 8, 2018

DBDebunk on Social Media




I have deleted my Facebook account and am focusing my presence on social media on two Twitter pages, DBDebunk, and The PostWest.


Follow: @DBDEBUNK @THEPOSTWEST


On DBDebunk  I will tweet links to new posts on this site, To Laugh or Cry? and What's Wrong with This Picture? that I am bringing back there, and occasional links to and comments on items of interest. I have created a #RelModel hashtag for use with it.

On ThePostWest I will tweet links to evidence for, and my take on (1) Dystopian Western Decadence and Decline, including (2) the equivalence of The Only Acceptable Racism Left and The Weaponized Myth of a "Palestinian Nation". I will create a #ThePostWest later on for use with it.



Please make a note of it, disseminate, and follow.





Sunday, April 29, 2018

A New Understanding of Keys Part 3: Surrogate Key Illusions




Note: This the third of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades. 


(Continued from Part 2)
"When defining a surrogate primary key for a [SQL Server] table, two options are the most common: Integer and UniqueIdentifier (aka Globally Unique Identifiers, or GUID's) ... Historically, Integer has been the logical choice. It’s human-readable, requires minimal storage, and can be set as an identity (auto-incrementing) to prevent the need for additional application logic. UniqueIdentifier comes with significant disadvantages. The most immediately noticeable is that it’s user-unfriendly. You’ll never hear a user or developer ask you about record “A78383A3-4AB1-42CF-B3FC-A4A23AD10398”. With high availability and replication becoming highly prevalent, UniqueIdentifier is being chosen more often, but has caveats that mean it isn’t always the optimal solution."
--Jeffrey J. Keller, Vertabelo.com

As we explained in Parts 1 and 2, keys can be properly understood only within the RDM. We revealed a new perspective on keys, discussed relationally valid kinds of keys, and revised definitions of natural (NK) and surrogate keys (SK).

As we have seen, the formal PK mandate is distinct from PK selection, which may be pragmatic. A PK must represent a name -- either pre-assigned, or generated only when there is no simple name CK. Generated keys must ensure entity integrit and are managed by the DBMS transparently to users.

All this is absent from conventional wisdom and database practice, as the above example illustrates: generated SKs are overused for the wrong reasons, the most common being emulation of OIDs (a SK -- often database-wide and, so, unique across relationsn), followed by performance.

Note: While OIDs have unique values, they often also have some physical significance.


Sunday, April 22, 2018

A New Understanding of Keys Part 2: Kinds of Keys




Note: This the second of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades. 


(Continued from Part 1)
"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."

"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys."
--Martijn Evers,dm-unseen.blogspot.com
I've deplored the misuse and abuse of terminology due a general lack of foundation knowledge in the industry [3] for longer than I care to remember, and keys are not an exception. If "the discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys", then there is no understanding of relational keys whatsoever: whatever it is, a data structure that contains NULLs is not a relation, one reason for which SQL tables are not relations, SQL databases are not relational and SQL DBMSs are not RDBMSs (for a relational solution to missing data without NULLs see[4]).

We sure can disambiguate, but the key (pun intended) to keys is that they are a relational feature and, thus, can only be properly understood within the dual theoretical foundation of the RDM, which is an adaptation and application of simple set theory (SST) expressible in first order predicate logic (FOPL) to database management. Thus, their "nomenclature on properties and behavior" should reflect what from the real world they represent, and what function they fulfill in the RDM. Which is precisely what the industry disregards.


Sunday, April 15, 2018

A New Understanding of Keys Part 1: Primary Key Formal Mandate and Pragmatic Selection





Note: This the first of three re-writes of older posts to bring them in line with McGoveran's formalization and interpretation[1] of Codd's true RDM. They are short extracts from a completely rewritten paper #4 in the PRACTICAL DATABASE FOUNDATIONS series[2] that provides a new perspective on relational keys, distinct from the conventional wisdom of the last five decades.

"The Internet is full of dogmatic commandments for choosing and using keys in relational databases. At times it verges on a holy war: should you use natural or artificial keys? Auto-incrementing integers, UUIDs? After wading through sixty-four articles, skimming sections in five books, and asking questions on IRC and StackOverflow I think I’ve put the pieces together and have a recommendation to harmonize the various camps. Many arguments about keys boil down to false dichotomies and failures to acknowledge other points of view."
--Joe Nelson, begriffs.com

As a relational feature, keys can only be properly understood within the formal foundation of the RDM, which is simple set theory (SST) expressible in first order predicate logic (FOPL), adapted and applied to database management. Yet that is precisely what is ignored and dismissed in the industry -- including by the authors of SQL. Dogma and holy war are products of ignorance. What Nelson did "piece together" from "sixty-four articles, five books and asking questions" is conventional wisdom, which cannot produce understanding because it has been off for decades.

Saturday, April 7, 2018

Name the Relational Violation Part 2: Self-defeating Constraint




Note: This two part series is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part 1)

In Part 1 I how several data practitioners failed to pinpoint the relational violation by a a conditional uniqueness constraint that should have been obvious with foundation knowledge. The closest one came was "more than one kind of business entity here [that] share the same properties (not attributes)", but still missed the implications. 


Tuesday, March 27, 2018

Name the Relational Violation Part 1: Conditional Uniqueness Constraint




Note: This is a rewrite of of an older post (which now links here), to bring it into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"I'm seeing more [data] professionals implementing the following type of conditional unique constraints, typically related to the use of 'soft deletes'):
TABLEX (SK,A,B,C,ACT_FLAG)
Uniqueness is defined for (A,B) iff ACT_FLAG='Yes'. Any row with a ACT_FLAG = 'No' is excluded from the unique requirement."

"Most SQL DB implementations I know of do not allow this type of constraint to be enforced declaratively. Instead it relies on tricks within the index specification for enforcement."

"This conditional application of unique constraints troubles me. The prevalent use of a surrogate primary key avoids duplicates in the table as a whole. But this approach seems to declare a business rule that can be turned off and on based on the value of a non-key column. It definitely feels there is more than one kind of business entity here -- but they do share the same attributes. I also see a similar design pattern when OO classes are mapped to a table during implementation (when the implementation approach is to combine classes into a single table and an attempt is made to enforce two types of 'uniqueness'."

"However, I cannot definitively find a specific rule/guideline within relational theory that it violates. I know there are design alternatives. The fact that there is no true always-on business key other than the surrogate key IS an issue, but the fact that technically the surrogate PK prevents duplicates is almost always presented as a counter argument."


The heart is in the right place, but the violation is obvious. Yet, the several responses did not pinpoint it.

Saturday, March 24, 2018

Data Modeling and NoSQL




Revised 3/25/18
"To the question “How relevant is data modeling in the world of NoSQL?” I give the following answer.

The main purpose of data modeling is to understand the business, some application domain, some users world. The model becomes a representation of that world -- the "things" in it, the relationships among those things and any constraints on those things or relationships. A secondary purpose is to build a database to contain information which pertains to and describes that domain."

"Generally we speak of the model coming first, then the implementation, and finally, the data gets collected and stored according to the model. Hence, the business data model should not be concerned with issues of physical stored representation, or the transformations/manipulations/constraints which are imposed to facilitate implementation in some data (storage) management system. That could be a relational DBMS, or a NoSQL tool".

" ... increasingly the data already exists in some form. Which leaves us with the task of figuring out what it means, what it represents -- that is, understanding the data as it represents some user domain. NoSQL tools are often designed to deal with existing data and to process it more efficiently (that may be an oversimplification!). Either way, you must understand the business in order to make sense of the data." 

--Gordon Everest, LinkedIn.com
I have written extensively on the three levels of representation and four types of model and I won't repeat it here -- readers can refresh their memory if necessary[1,2]. Everest's comments are at best ambiguous with respect to the levels and models (e.g., by data modeling he means business modeling, and his "business data model" lumps together business model and data model). It is to avoid such ambiguities and the resulting confusion that I recommend the three-fold terminology of conceptual modeling, logical database design and physical implementation, eschewing data modeling[3]. Here I will rely on my earlier writings to address strictly the issue of data modeling in the NoSQL context raised by Everest.

Saturday, March 17, 2018

Physical Independence Part 2: Logical-physical Confusion



Note: This is a rewrite of older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

Revised 3/17/18

(Continued from Part 1)

This is the second part of my response today to an old DBDebunk query:

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?" --DBDebunk.com
In Part 1 I explained physical independence (PI) and claimed that the  industry has failed to internalize its importance. Here I provide evidence to that effect and discuss some consequences.

Monday, March 5, 2018

Physical Independence Part 1: Don't Mix Model with Implementation




Note: This is a rewrite of several older posts (which now link here), to bring them into line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

"You constantly remind us that the relational model is a logical model having no connection to any physical model (so I infer). You also indicate how no commercial product fully implements the relational model. Therefore, how do we make use of the relational model when dealing with the physical constructs of a commercial database program (Oracle, Access, DB2, etc.)?" --DBDebunk.com query

Sunday, February 18, 2018

This Week



1. Database Truth

"The RDM is a formal system. It has two parts. Semantics its outside the formal language (which is Deductive Subsystem), but not outside the interpretation (i.e., application) of that language (Interpretation Subsystem). Without an Interpretation Subsystem there is no possibility of applying the formal system and it remains an abstract game of symbols."

"Codd's 1979 paper described a way to "capture" semantics using the relational formalism. That formalism doesn't tell you how to discover semantics, but if you have them, then he showed (at least to some degree) how to express those semantics relationally."

"Semantics is about applying the RDM to some subject. In effect, what you do is restrict the power of the abstract formalism so that it is more closely aligned with your intended use. In my terminology, that means you:

  • Create axioms (expressed as constraints), limiting the vocabulary to the subject matter (and making it finite and usually fairly small); and,
  • Restrict the possible interpretations that can be used consistently with the resulting subset of the formalism."
                                                          --David McGoveran


2. Do You Know What's Wrong With This Picture?

"When someone refers to a relation in a database course, what does that mean?"

"It means that it is time to go to Wikipedia."

"A relation in the context of modeling a problem will include the fields and possibly the identification of fields which have relationships with other relations."

"A relation is a data structure which consists of a heading and an unordered set of tuples which share the same type."

"I was trying to show that, in SQL, a relation is more than just a table. Queries return relations. And within a query, relational math is happening, with many intermediate results, that themselves are relations."

"A relation is an abstract structure which contains a set of attributes, and a relvar is the dataset status in a particular moment of this relation. The first one can be considered as the table definition with columns, and the second one is dataset in this table." 

"Tuples need not have a key (or any way of locating them?) Having tried to answer this question so that I could explain it to my students, I am forced to the conclusion that the theory has nothing whatsoever to do with "data" in the usual sense. Perhaps information Theory would have been a better basis for data systems, rather than Mathematics? Computer applications are notably different from Physics, which math was created to model, and its child, Engineering. I think data was never intended to be "true", it must be useful."
                  --What is a relation in database terminology?, StackOverflow.com

Sunday, February 11, 2018

The Key to Relational Keys - A New Understanding




Version 3 of paper #4 in the PRACTICAL DATABASE FOUNDATION series is now available to order here. The Key To Relational Keys: A New Understanding is a completely new, re-written version that incorporates David McGoveran's formalization and interpretation of Codd's true RDM and introduces a new perspective. It is distinct from and supersedes all previous versions.

Abstract


The dual theoretical foundation of the RDM — simple set theory (SST) expressible in first order predicate logic (FOPL) — is applicable to database management because it is a theory of unique objects and objects are unique in the real world that databases represent. In the real world objects are uniquely identified by (1) a combination of one or more defining properties and/or (2) names assigned to them as members of various groups. A database relation is a formal representation of an object group, with tuples representing (facts about) and attributes representing properties and/or names of object members. Relational keys are attributes that represent formally in the database those identifying properties and names.

Their fundamental database role notwithstanding, relational keys are poorly understood. 70% of hits @dbdebunk.com are about keys and misconceptions about their necessity, what kinds of keys are relational, their functions, their selection, and so on, abound.
This paper defines and explains:

  • The relational key concept
  • The kinds of relational keys, their properties and functions
  • The formal PK mandate
  • PK selection
  • RDBMS key support

and debunks common misconceptions.


Sunday, January 28, 2018

This Week




1. Database truth of the week

"Relvars introduce a concept of assignment, which has no counterpart in either FOPL or set theory. If you add it to those formalisms you introduce computational completeness, which destroys both decidability (the existence of a general algorithm by which you can determine if an expression is or is not logically valid) and the guarantee that there exists a (query) evaluation procedure that will halt (the existence of a general algorithm by which you can evaluate the truth or falsity of every instantiated predicate expression given those instantiations from any given database). Therefore we must forbid relvars." --David McGoveran


2. What's wrong with this database picture?


"Many data and information modelers talk about all kinds of keys (or identifiers. I'll forego the distinction for now). I hear them talk about primary keys, alternate keys, surrogate keys, technical keys, functional keys, intelligent keys, business keys (for a Data Vault), human keys, natural keys, artificial keys, composite keys, warehouse keys or Dimensional Keys (or Data Warehousing) and whatnot. Then a debate rises on the use (and misuse) of all these keys ... The foremost question we should actually ask ourselves: can we formally disambiguate kinds of keys (at all)? Of all kinds of key, the primary key and the surrogate key gained the most discussion."
"If we take a look at the relational model we only see of one or more attributes that are unique for each tuple in a relation -- no other formal distinction is possible. When we talk about different kinds of keys we base our nomenclature on properties and behavior of the candidate keys. We formally do not have a primary key, it is a choice we make and as such we might treat this key slightly different from all other available keys in a relation. The discussion around primary keys stems more from SQL NULL problems, foreign key constraints and implementing surrogate keys. --Martijn Evers, Kinds of Keys: On the Nature of Key Classifications, dm-unseen.blogspot.com

Sunday, January 21, 2018

How to Think (and Not to Think) During Database Design



"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:
 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

Using a relational database to "maintain lists" probably does not merit attention and I actually considered canceling the debunking of this example. But it provides an opportunity to demonstrate the gap between conventional wisdom, database practice and SQL DBMSs and
Codd's true RDM, as formalized and interpreted by McGoveran [1]. Such use is induced by lack of foundation knowledge, so for the purpose of this discussion I treat the example as a case of "how not to think when performing database design".

Note: Certainly logical database design should not be contaminated with physical implementation considerations such as performance [2].


Monday, January 15, 2018

This Week and The End of Empire




1. Database truth of the week

"ALL names are human created, either by non-algorithmic assignment, or via some algorithm. We ONLY know that two types of objects are distinct because they have different sets of defining properties and, for a given object type, we ONLY know that two objects are distinct because the values (observed or measured) of that object type's defining properties are distinct. Names (of objects of some type) allow us to distinguish two such entities ONLY when they are 1:1 with the values of the object defining properties. Two sets of names (whether human assigned or machine generated) consistently identify the same set of entities ONLY when they are 1:1." --David McGoveran


2. What's wrong with this database picture?

"I have to maintain some lists in DB (SQLServer, Oracle, DB2, Derby), I have 2 options to design underlying simple table:

"1st:
 NAME   VALUE
=================
 dept   HR
 dept   fin
 role   engineer
 role   designer
-----------------
UNIQUE CONSTRAINT (NAME, VALUE) and some other columns like auto generated ID, etc.
"2nd:
 NAME  VALUE_JSON_CLOB
==================================
dept   {["HR", "fin"]}
role   {["engineer", "designer"}]
----------------------------------
UNIQUE CONSTRAINT (NAME) and some other columns like auto generated ID, etc.
"There is no DELETE operation, only SELECT and INSERT/UPDATE. In first advantage is only INSERT is required but SELECT (fetch all values for a given NAME) will be slow. In second SELECT will be fast but UPDATE will be slow. By considering there could be 10000s of such lists with 1000s for possible values in the system with frequent SELECTs and less INSERTs, which TABLE design will be good in terms of select/insert/update performance." --SQL TABLE to store lists of strings, StackOverflow.com

View My Stats