Sunday, September 29, 2013

Testing Your Foundation Knowledge




Expertise in a field and ability to convey it to others are distinct and the latter requires different motivation, skills and talent. Many top technical experts are more often than not poor communicators, whether verbally or in writing, for some inherent reasons, Codd being an excellent example. That's one of the core reasons for poor foundation knowledge in data management in general, and the appreciation of the relational model in particular.

In a previous post I started a little experiment: I asked both readers who think they know and understand the relational model (RM) and those who do not but want to, to comment on whether a theoretically correct explanation of data fundamentals offered by reader PK was helpful and, if not, why not. I promised to draw some conclusions regarding the difficulty of dispelling misconceptions without losing either theoretical rigor, or the audience--a non-trivial task for an educator in an industry that deems theory impractical.

I can't say the response exactly answered my question (I recommend reading the comments, though). But let me, as promised, try my hand at making better sense of both the explanation and the comments (for an in-depth treatment see paper #1, Business Modeling for Database Design). Let me know if it helps..


PK: For each base table the dba gives a parameterized statement (aka meaning aka predicate) that takes a row and yields a statement (aka meaning aka proposition) that either holds or doesn't. That is all one needs to update or query a database.
Base table meanings determine constraints. Which are just expressions that are true in every situation. Constraints determine CKs and FKs. Which do not determine or affect meanings. Updating/querying does not need constraints or keys. Which are however used in normalization and to allow a dbms to help maintain integrity and optimize queries. (And I'm not arguing against the importance of any of those roles.)
When a row is submitted by a user/application for insertion into an R-table, the DBMS must decide whether to accept or reject it. Acceptability is based on two criteria: Does the row
(1) represent a proposition that is true--a fact--in the real world?
(2) satisfy all the business rules in effect?
With respect to 1, a submitted row represents a true assertion by some user. Since the DBMS has no way of determining the truth of the assertion, it operates under the Closed World Assumption (CWA): if the user/application are authorized to update the R-table, it takes the user at his word.

However, this assertion is not the only one made by users. There are several other types of assertion made by a special user, the DBA, that must hold true at all times: the integrity constraints that represent the business rules in the database. They are the best formal approximation of those rules that the DBMS "understands". For example, if an R-table represents propositions about employees and the entity identity rule is:
Every employee is uniquely identified by an employee number.
A DBMS cannot understand it semantically, like users do. Its formal representation in the database, however, is a key constraint, an assertion by the DBA, say:
CREATE TABLE employees
...emp# PRIMARY KEY...
The DBMS can guarantee this algorithmically, by checking the EMP# value of a submitted row against the EMP# values of rows already in the R-table and accepting it only if it satisfies the constraint.
PK: Unfortunately most presentations of the RM (especially ER and ORM ones & "tools like ER Studio, ERWIN, and Visual Studio ...") focus on certain special cases of meanings involving keys and entities. So people think (wrongly) that keys are needed to even be able to interpret or manipulate a relational database. (So JH is in a certain sense right about keys being artifacts of the RM *as commonly (mis)presented*.) Also they think (wrongly) that keys or the associated constraints (if they even know that keys have associated constraints) are more fundamental than or different in kind from other constraints (rather than just being special cases of constraints).
If a person doesn't understand that constraints and keys are not needed to use a database (and PKs have no theoretical foundation) and that the meaning a query is built only from base table meanings then they're not going to understand constraints, keys, integrity or the RM. (And thus that FKs and constraints are needed to use one *safely*.)
I am in agreement, of course, with the equally fundamental nature of all constraints and that any misconception in that regard is due to poor education.

Are keys "artifacts" of the RM? That is not how I would put it, because it does not advance understanding in the industry. In abstract relational theory a relation on domains is the Cartesian product of the domains. That is, it is the set of tuples that are all the possible unique combinations of every value of each domain with each and every value of all other domains. In other words, a tuple is unique by definition, by virtue of being a unique combination of the values of all domains. This is neither correct, nor practical if tuples are used to represent (propositions about) entities in the real world (why?). So in RM, which is a database application of the theory
  • an R-table is a current subset of the Cartesian product
  • rows represent true propositions about entities in the real world
and entities in the real world have a distinct identity. A CK/PK is an adjustment to tuples that turns them into R-table rows, useful in database practice. Is it artifact? In a strict sense yes, but it exists for empirical practicality and in an industry that deems theory impractical it is more important to stress that, rather than its artifact nature.
PK: There are propositions in ... [Codd's] papers. The word "proposition" just isn't in the text. To have "a model based on n-ary relations" (1970 paper) or "the n-ary relation chosen as the single aggregate structure for the relational model" (Turing paper) *means* that one evaluates the truth of statements (aka propositions) generated from present and absent rows via a parameterized statement (aka predicate). Unfortunately the use of propositions & predicates ie what one is *doing* when querying is generally not taught clearly so people do not understand the RM.
Again, agreed; Codd's mathematical genius was not--to put it politely--in explaining RM to practitioners. But I would put it differently: Under the CWA, the rows present in an R-table represent the set of all corresponding propositions that are true in the real world and all rows representing false propositions are absent. What is not "generally taught" in this particular context is Codd's real genius: the exploitation of the equivalence of first order predicate logic and set theory for database management. It means that
  • R-tables can be manipulated mathematically as sets, yielding results that are also R-tables
  • The rows in the results represent true propositions that are logical implications of (inferences from) the true propositions represented by the rows recorded in the database
Otherwise put, logical inferencing via mathematical manipulation. This is what querying is.
PK: In his Turing paper Codd discusses "a fresh look at the addressing of data" in terms of the RM. And in the RM one can via superkeys identify multiple parts of a database seen as a set of rows. This is useful for updating and other times a relation happens to embody a function. However such "addressing" has no relational meaning without an understanding of rows as true propositions, and isn't the basis of querying. So "addressing" ends up being an impediment to understanding, reducing the relation to a mere data structure and losing the quintessential RM property of relation expressions corresponding to predicate expressions.
See what I mean? Unless you understand the equivalence--which is almost never taught--it is harder to understand and appreciate the practical value of RM and the costly implication of flouting it: query results are no longer guaranteed by the DBMS to be logically correct with respect to the real world.



No comments:

Post a Comment

View My Stats