Monday, March 24, 2014

Simplicity: Forgotten, Misunderstood, Underrated Relational Objective

In a LinkedIn exchange I argued that an optimal generality-to-simplicity ratio (ability to represent a maximal range of reality with minimal complexity) and a 1:1 correspondence between informal business modeling constructs and formal logical database constructs are beneficial. And I claimed that insofar as data models that are formally defined are concerned, the relational model scores best on both.

One of the responses I got was
GE: Though I might come up with slightly different lists, in general, I agree with your expression of criteria for selecting a primary key and of generality and simplicity, but disagree with your conclusion that "RM scores better than any other modeling scheme."
Let me take his points one by one.

GE: 1) Having to resolve a many to many relationship into two one-to-many relationships violates simplicity. Most people do not have a problem with a many-to-many relationship, even though the relational model does and consequently a relational DBMS.
The RM does not have any problem with M:M relationships. Quite the contrary: its representation of reality encapsulates all relationships, including M:M's, into a single type of structure--the R-table--which, therefore, requires only the relational algebra for all data manipulation and integrity enforcement. This is the epitome of simplicity. Both the hierarchic and the network alternatives, the only ones that come close to a data model in the Codd sense in which RM is one, require at least two, one reason for which they are more complex and must force themselves on reality that is not naturally so. In other words, that M:M relationships are  represented by associative R-tables is a relational advantage, not a drawback, from the simplicity perspective!
GE: 2) Forcing attributes to be single valued, and thus creating another table when it is multivalued also violates simplicity.
The response to this claim is two-pronged. First, when Codd came up with normal forms (NF), he was concerned that multivalued domains (MVD) and columns (MVC) would require second instead of first order predicate logic (FOL), a major complication. Further normalizing R-tables (1NF->5NF) was a way to accommodate them (as a way to "flatten" hierarchic schemas) without complicating the RM in a major way.

Those who perceive simplicity in MVC's make the quite common mistake of disregarding the complications for manipulation and integrity
  • They break the simple relational addressing scheme of R-table name+column name+key value, which guarantees logical access to each and every value;
  • The relational algebra is no longer sufficient; it must be extended to accommodate the MVC's; what is more, the extensions are for operations involving order, for which reason they are notoriously complex and prone to error. Indeed, it is precisely for these reasons that Codd invented further normalization in the first place.
Second, Date has subsequently shown that Codd needn't have worried: FOL can accommodate MVD's as relation-valued-domains/columns (RVA/RVC)--informally, nested R-tables--that have R-tables as values and are, therefore, single valued. Moreover, he and Darwen incorporated in their relational data language Tutorial D extensions to the relational algebra to support them. In other words, MVD's do not necessarily force the creation of new R-tables under RM.

Be that as it may, the question remains what practical benefits do nested R-tables provide that non-nested R-tables do not? Are two kinds of R-table simpler than one? In fact, except from convenience in some narrow circumstances possibly e.g. database catalog (maintained and manipulated by the DBMS), it is not clear that the practical benefits of RVD's/RVC's for users are significant enough to complicate the data model.
GE: 3) Humans having to ensure that a RM is normalized (because it can never be done by the system) certainly violates simplicity.
Computers are capable only of manipulating symbols mathematically. They do not and cannot understand semantics (what those symbols mean) captured in a business model, essentially a collection of business rules that only humans understand. That is why semantics are left out of the R-tables--they reside only in users' mind.

As I demonstrate in my paper Business Modeling for Database Design, a well-defined business model maps to a 5NF database. That requires only a thorough knowledge of the segment of business reality important enough to be represented in the database: if well identified and specified entity classes are mapped exactly to one R-table each, the R-tables will be in 5NF. One resorts to further normalization (1NF -> 5NF) only to repair poor designs. The practical benefits, simplicity included, materialize then at the logical level, but that does not relieve users from business knowledge and modeling. That kind of simplicity no DBMS, regardless of its data model, can provide, NoSQL claims notwithstanding.
GE:... There are lots of other examples I could give. Modeling in terms of ultimate implementation in RM, RDBMS violates another criteria, namely, the model should not be influenced by any aspect of technology or implementation.
This statement is perplexing, because independence of implementation is a core, if not the main advantage of the relational model! There is nothing in predicate logic and set theory that has anything to do with implementation (Ironically, this is one of the most common--and upside down--complaints against the RM, namely, that it "does not address implementation and, therefore, RDBMS's suffer from poor performance"). The only way in which GE's statement can make some sense is if he means SQL, rather than RDBMS's.
GE:... Take a look at ORM or any fact oriented modeling scheme. I would argue they would score better than RM on these criteria. Avoiding the need for normalization entirely is moving in the direction of simplicity, and that is where ORM is.
Halpin's ORM is rooted in Shir Nijssen's NIAM approach to business modeling. Since NIAM is a fact-based approach rooted in linguistics, it corresponds well to RM's predicate logic (predicates are "generalized facts"), which is why business models based on it map well to relational schemas (hence the title of Nijssen's seminal book CONCEPTUAL SCHEMA AND RELATIONAL DATABASE DESIGN, recommended and available via the home page of this site). This, however, does not mean that ORM is more simple than RM, as that would constitute a confusion of levels of representation--conceptual (ORM) and logical (RM).

Otherwise put, assemble your facts well and map them correctly to R-tables--you then don't need to normalize and you can't do simpler than that to boot.

No comments:

Post a Comment

View My Stats