Monday, August 5, 2013

Test Your Foundation Knowledge

The gap between theory and practice, said one sage, is greater in theory than it is in practice. It is not easy to find a better validation of this assertion than in database management. Particularly because it uses the term 'theory' in the two different senses, the confusion of which inhibits the ability of data professionals to understand and appreciate the value of theory--specifically, relational theory--for database  practice. The first sense is "sound theoretical foundation", the second is "just a theory" and carries a somewhat pejorative connotation ("not practical").

This confusion is rooted in systemic and cultural factors that make education on data fundamentals--without losing either the rigor and precision of the theory that give it its usefulness, or the audience--one of the most difficult tasks. 

Thus, when a reader started his comments with "... a very common (taught) fundamental misunderstanding/misconception of constraints and (primary, foreign & candidate) keys is an impediment to understanding the RM", I could not  agree more. I have been trying to dispel misconceptions for more than three decades and it is easier said than done. I wondered whether the formal accuracy of the reader's attempt to clarify the issue did not lose the audience.

So I decided to try a little experiment. The reader's comments are reproduced below. I would like to hear from readers--both those who think they know and understand the relational model and those who do not, but want to--what they understand the comments to mean, whether they are helpful and, if not, why not.

Feel free to comment publicly below this post, or privately via email (I will respect requests for anonymity). Succinctness is at a premium.

Subject to the responses I receive, if any, I will try to draw some conclusions with respect to the above mentioned difficulty and perhaps and perhaps try my hand at how I would express the same idea and let readers decide which is more helpful and why.

Here now the comments:
PK: For each base table the dba gives a parameterized statement (aka meaning aka predicate) that takes a row and yields a statement (aka meaning aka proposition) that either holds or doesn't. That is all one needs to update or query a database.

Base table meanings determine constraints. Which are just expressions that are true in every situation. Constraints determine CKs and FKs. Which do not determine or affect meanings. Updating/querying does not need constraints or keys. Which are however used in normalization and to allow a dbms to help maintain integrity and optimize queries. (And I'm not arguing against the importance of any of those roles.)

Unfortunately most presentations of the RM (especially ER and ORM ones & "tools like ER Studio, ERWIN, and Visual Studio ...") focus on certain special cases of meanings involving keys and entities. So people think (wrongly) that keys are needed to even be able to interpret or manipulate a relational database. (So JH is in a certain sense right about keys being artifacts of the RM *as commonly (mis)presented*.) Also they think (wrongly) that keys or the associated constraints (if they even know that keys have associated constraints) are more fundamental than or different in kind from other constraints (rather than just being special cases of constraints).

If a person doesn't understand that constraints and keys are not needed to use a database (and PKs have no theoretical foundation) and that the meaning of a query is built only from base table meanings then they're not going to understand constraints, keys, integrity or the RM. (And thus that FKs and constraints are needed to use one *safely*.)

There are propositions in ... [Codd's] papers. The word "proposition" just isn't in the text. To have "a model based on n-ary relations" (1970 paper) or "the n-ary relation chosen as the single aggregate structure for the relational model" (Turing paper) *means* that one evaluates the truth of statements (aka propositions) generated from present and absent rows via a parameterized statement (aka predicate). Unfortunately the use of propositions & predicates ie what one is *doing* when querying is generally not taught clearly so people do not understand the RM.

In his Turing paper Codd discusses "a fresh look at the addressing of data" in terms of the RM. And in the RM one can via superkeys identify multiple parts of a database seen as a set of rows. This is useful for updating and other times a relation happens to embody a function. However such "addressing" has no relational meaning without an understanding of rows as true propositions, and isn't the basis of querying. So "addressing" ends up being an impediment to understanding, reducing the relation to a mere data structure and losing the quintessential RM property of relation expressions corresponding to predicate expressions.

No comments:

Post a Comment

View My Stats