Sunday, January 12, 2014

Data Fundamentals and Education

Conveying the theoretical foundation of database management to practitioners without losing either precision/rigor, or the audience is a non-trivial endeavor, the skill for, or interest in which very few possess. More often than not one or the other are lost.

In Relations and Relationships I referred to the following LinkedIn exchange:
AT: ... William Kent and his book "Data and Reality" [1978] ... is devoted to "attributes", and (in my words) William confesses that he can not distinct [sic] between "relationships" and "attributes". Thus, the later might be completely redundant. 
GE: This confusion of entities vs. attributes has been with us a long time. Several years ago [a student] ... discovered a paper ... that discussed this dilemma. He proposed calling the thing, which we could not determine whether it was an entity or an attribute, an "entribute." Beautiful, eh? That was in 1939.

The same thing can be modeled either as an entity or an attribute, depending on the informational context and purpose e.g., color can be an attribute for a clothing merchant and an entity with attributes of its own for a paint manufacturer. This is a conceptual (business) model decision that must precede logical database design, which is impossible without it.

But note that Kent's problem is not with entities and attributes, but with relationships and attributes. Many view FK's as relationships, often because they think, erroneously, that relational comes "relationships between tables" rather than mathematical relations. This is what a theoretically minded reader tried to dispel in a comment:
P: Relations/tables denote relationships. A tuple/row is in a relation/table R if and only if some user-understood statement R(X,..) holds.
A FK constraint is the representation of a particular kind of constant relationship between relationships. For variables S meaning S(X,...) and T meaning T(Y,...) the declaration of a FK from S {X} to T {Y} means that FORALL X [EXISTS X1,... S(X,X1,...) IMPLIES EXISTS Y1,... T(X,Y1,...) . Eg EMPLOYEE(p,...) IMPLIES PERSON(p,...).

A FK constraint allows the dbms to reject updates that violate it as erroneous. But to read or update the database the users know what S(X,..) and T(X,...) mean in the world so for any situation that arises they know that FORALL X [S(X,...) IMPLIES T(X,...)]. Eg if they observe EMPLOYEE(p,..) then they will observe PERSON(p,...). The database does not need to represent this redundant and constant relationship; the dbms does, conceptually in the metadata, via the constraint declaration.
FKs are NOT the relations/relationships of a database, its relations/tables are. Sadly this is widely misunderstood.
He is, of course, right, but my guess is many practitioners will fail to appreciate his explanation. Let me see if I can make it more digestible.
  • The important relationship behind relational databases is the R-table--a relationship between a collection of N pools of values--domains: columns are subsets of the domains and rows are collections of N column values, one from each domain. Domains and columns represent in the database properties and entity attributes, respectively. Rows represent propositions that are true (facts) about those entities in the real world (see my paper, Business Modeling for Database Design).
Note: To those formally inclined, the relationship is, specifically, a subset of a mathematical relation, itself a relationship--the Cartesian product of the domains--the set of all possible unique combinations of the values of the N domains.
  • A foreign key (FK) is a column of one R-table whose values reference (point to) values of the primary key (PK) column in another R-table. As columns, both PK's and FK's represent neither entities, nor relationships, but attributes.
  • The referential integrity (RI) constraint represents in the database a business rule mandating that the FK and PK values are consistent. Both R-tables are subject to it and, since they are relationships, the constraint is a "relationship between relationships".
Consider now, the second example, from another LinkedIn exchange on normalization, in which I participated.
LH: Art you have obviously been lucky to work with fast disks and customers not demanding super fast responses. Today its not so much an issue but it has been. At the end of the day all I have ever done is get the job done. I don't care if its not right according 'to the theory' and neither does the customer, all they want is a performing system and now so does the consumer.
AK: And neither do I care whether "theory" is violated. What I care about is my clients' applications persisting and retrieving consistent and safe data ... over hundreds of systems at dozens of organizations over three decades, I haven't ever needed to denormalize below 3NF, which I consider the minimum to maintain data consistency and integrity, in order to get acceptable performance.
PP: ... Your attitude may deliver short term gains but at long term major costs. The best example is Germany, which is a industrial power house as it takes a long term approach even though it may have a greater short term cost.
Here is an edited version of my response.

"I don't care about theory either" misleads the uninformed rather than enlightens. The theory is there for the very purpose of "consistent and safe data". Violate it and the loss is precisely data integrity and the guarantee of provably logically correct answers to queries. It is a bad idea to agree that it's not theory that counts--it will be misinterpreted to reinforce the common confusion of 'sound theoretical foundation' with "just a theory" and the notion that theory is somehow at odds with practice. The gap between theory and practice is greater in theory than it is in practice.

While his warning is correct, PP too fails to address the core misconception. It is not just a matter of "short- and long-term concerns"--denormalization trades data integrity (AK's very objective!) for (uncertain) performance gains and biases the database for some applications and against others (see my paper The Costly Illusion: Normalization, Integrity and Performance).
Both AK and PP replied.
AK: I think we agree at base. But my point stands. LH stated that she doesn't care about theory. I said that don't either. It is the "purpose" of consistent, quality data that I care about. A minimum normalization level of 3NF gives me the minimum level of consistency and quality of data that is required for a sane application design.
No, we don't agree. Understanding that theory exists for practical reasons is "at the base." And misconceptions ought to be dispelled, not reinforced. A major reason for lack of true RDBMS's is precisely that neither DBMS designers, nor users understand and appreciate the practical value of theory. Proprietary ad-hoc products such as those referred to as NoSQL are rooted in disregard for theory.

The comment on 3NF is also misleading. While, fortunately, in most cases 3NF designs are also fully normalized (in 5NF), if and when they are not, consistency is not guaranteed; what is more, most practitioners don't know when that is the case. Suggesting that 3NF is enough will likely also be misinterpreted.
PP: There are many ways to achieve integrity and most databases exist to support a business which survive based on profit. Implicitly my statement supports that contention that more normalisation is better for long term profitability. It makes no suggestion as to what level. To suggest a business should make decisions solely based on theory is naive and lead to the destruction of the business. Prime example was "Betamax" versus "VHS". Even though "Betamax" was theoretically much superior technically, "VHS" won the day basically due to price.
From my own experiences I have found many people don't understand the cost-benefit of normalisation, or don't understand how to tune to support normalised structures. Total normalisation, doesn't provide the optimal cost with today's technology and technically. I have not seen a RDBMS that provides maximum normalisation as understood in today's terms.
  • I do not understand how a DBMS can "provide maximum normalization". It must be provided by users: it is essentially optimal logical database design. And the problem is that it is not well understood today.
  • I know of only one way to guarantee integrity of denormalized databases: controlling redundancy, either via additional integrity constraints, or application code. Both defeat the whole purpose of denormalization and are even likely to hurt performance rather than enhance it (again, see my paper).
  • I interpret "decisions solely based on theory are naive" to mean that trading off integrity for performance is sometimes necessary and justified in practice. Even assuming that all implementation factors affecting performance are at optimum--they rarely are--it is one thing to make such compromises consciously, being fully aware of the risk, and quite another to be unaware that they are compromises and oblivious to the risk, as so many practitioners are.
I am in agreement with PP's argument in his second paragraph, but the begged questions are: Why is the state of current technology he deplores the way it is and how does he expect it to improve in the context of ignorance of and disregard for theory?

No comments:

Post a Comment

View My Stats