ON POFN* AND POOD* - TWO COMPLEMENTARY DATABASE DESIGN PRINCIPLES
with Fabian Pascal, Hugh Darwen and David McGoveran

 

 

* POFN: Principle of Full Normalization; POOD: Principle of Orthogonal Design

 

 

From: Brian Selzer

Date: Dec 16, 2005

 

During the normalization process, it is asserted that decomposing a relation that exhibits redundancy into separate relations that don't can be done without losing information.  While this can easily be proven for static relations, it appears to me that this is not also true for relation variables.  As far as I can tell, what is lost is the 1:1 relationship for rows in the original table.  A foreign key constraint isn't sufficient (though that is what is commonly recommended in popular database texts), because it allows rows to exist in the primary key table that don't have corresponding rows in the foreign key table.  It appears to me that a circular inclusion dependency is also required in order to maintain the 1:1 relationship that is lost when you use a foreign key constraint.

 

Could you please comment on this?

 

 

From: Hugh Darwen

To: Brian Selzer

 

I'm baffled by the question.

 

Please give an example.  I'm afraid I can't make much sense of the question as written.  In particular, I don't know what is meant by "[losing] the 1:1 relationship with rows of the original table".

 

I will say this, for what it might be worth.  Normalisation always entails decomposition over a join dependency (JD).  By definition, a JD in R is two or more projections of R that, when joined, yield R.  So, if a certain JD holds in relvar R and R is "decomposed" accordingly into S and T, then for every possible value of R every tuple in R is a tuple of S JOIN T and every tuple of S JOIN T is a tuple of R.  That's a 1:1 correspondence, but perhaps not the one in question.

 

 

From: Brian Selzer

To: Hugh Darwen

 

Sorry about the confusion. I will try to be more clear.  Here's my argument:

 

A database schema D consisting of a single relation schema R is not equivalent to a database schema D' consisting of S and T, even if S and T are produced by "decomposition" of R according to Heath's Theorem.  In order to be equivalent, there must also exist a circular inclusion dependency between S and T.  Without the circular inclusion dependency, given an arbitrary value for D, there are an infinite number of values for D' for which S JOIN T = R.  Furthermore, a foreign key constraint between S and T is not sufficient to guarantee equivalence because it doesn't restrict the referenced relation schema, thus you can still have an infinite number of values for D'.

 

I was reading C. J. Date's, An Introduction to Database Systems, Seventh Edition, and though it goes into considerable detail on normalization and normal forms, it doesn't discuss in sufficient detail the kinds of database constraints that I believe are required as a consequence of applying the normalization procedure.  Most other books I've read simply recommend using a foreign key constraint.

 

Am I crazy?  This has bugged me for a long time, and I've run into many situations where I've had to enforce 1:1..n relationships between tables.  (Foreign key constraints work great for 1:0..n relationships.)  The commercial database products I've used don't support multiple assignment, so I've been forced to use other means, such as triggers, views and encapsulating logic in procedures in order to maintain integrity.

 

 

From: Hugh Darwen

To: Brian Selzer

 

I do now see the point.  What's more, when I think about it I realise that your point is one that has bugged me over the years too, except that I have expressed it differently (to myself, at least).

 

Normalisation is billed as helping significantly to avoid redundancy.  It is also billed as overcoming certain "update anomalies" (not a great term, but I don't have an improvement).

 

Take the case of ENROLMENT ( StudentId, Name, CourseId ), with Name determined by StudentId.  It violates 2NF.  We observe the JD * { {StudentId CourseId}, {StudentId, Name } } and decompose accordingly into IS_CALLED and IS_ENROLLED_ON.

 

Now we have avoided the redundancy of having to record a student's name in connection with every enrolment of that student.  But we have also solved some "update anomalies".  For example, we don't lose our record of a student's name when the student's last enrolment is deleted; and we can register a student before enrolling him or her on any courses.

 

Your point is that the constraint implicit in ENROLMENT has not been preserved.  The constraint in question is one to the effect that every named student is an enrolled one and every enrolled student is a named one.

 

I always teach the loss of this constraint as another advantage of normalisation, not a problem with it!  And of course, having been party (with Date) to the introduction of 6NF, I teach that too, as being a good idea the circular constraint does not apply.  A relvar in 5NF does not suffer from redundancy but might suffer from lack of independence.

 

For example, consider the soccer team's fixtures for the 2005-6 season.  The following is in 5NF: FIXTURE{Opponents, Venue, Date, GoalsFor, GoalsAgainst}, keyed on Date, and also on {Opponents, Venue} (where Venue is Home or Away).

 

But this FIXTURE suffers from an update anomaly arising from lack of independence.  The dates, venues and opponents for all fixtures are determined before the season starts, and it is useful to record them independently of the results, before the games in question are played.  My 5NF design prevents that from happening--the relvar would be better named RESULT! 

 

Unfortunately, the 6NF decomposition around Date would be overkill, as we'd have separate relvars for Opponents, Venue, GoalsFor and GoalsAgainst, with a constraint to the effect that every Date having a GoalsFor must also have a GoalsAgainst, an Opponents and a Venue, another to the effect that every Date having a Venue must also have an Opponents and, several more of that ilk.

 

So an optimal design for the soccer club is:

 

FIXTURE { Opponents, Venue, Date }

RESULT { Date, GoalsFor, GoalsAgainst}

 

alternatively:

 

FIXTURE { Opponents, Venue, Date }

RESULT { Opponents, Venue, GoalsFor, GoalsAgainst}

 

(Perhaps the first would be considered more economical.)

 

5-and-a-bit-NF!

 

And that's what I've been teaching in my new course.  But on looking through my lecture slides and notes I see that I do not use the term "nonloss decomposition" at all.  I've never liked it much, but I don't think the decision to omit it was a conscious one.  I do teach join dependency right up front (with a very simple example), but when I point out the overcoming of the "update anomaly" I am at pains to point out that in those circumstances the JD was in some sense a bit of a myth all along.  For FIXTURE JOIN RESULT { Date, Opponents, Venue } is not equal to FIXTURE (at least, not until the last game has been played and its result recorded).

 

Saying that the JD was a bit of a myth is not quite right but I can't think of a better way of putting it.  Given the original design combining FIXTURE and RESULT, the JD does hold.  The real problem is that the JD doesn't apply all the time in the real world.  In the real world, the fact that a fixture has been arranged does not imply that the game in question has been played, though the inverse is true. Thus, when we do the decomposition, the single FK is sufficient and we don't need circular INDs (which in this case would be circular FKs).

 

I think I'll leave it at that.  I'm not sure if I've answered the question but I hope I've shown an understanding of it and also how I avoid the need for multiple assignment (though that cannot always be avoided, as is well known).  If you have an example that is not covered well by my approach and you'd like me to consider, please send it.

 

I do think the theory of normalisation has an astoundingly messy history.  Codd's original 2NF and 3NF were very bad mistakes, with hindsight.  I do teach them, but only because if I don't my students will discover them in textbooks and wonder why I haven't.  I teach them and then say that they won't be examined on them.  BCNF, covering 2NF, 3NF and other cases all in the same manner is so much simpler.  4NF was another mistake.  5NF was good, but not so good as to make BCNF worthless in the way that BCNF makes 2NF and 3NF worthless.  6NF is good for temporal data (it was our study of temporal data that gave rise to it) and possibly good for all data if the following suggestion by me would be agreeable: design to 6NF but when you then find you have two or more relvars with the same key that are subject to circular INDs, combine them.  That's a bit radical as it seems to dispose of any need for 5NF and BCNF!

 

Comments?

 

 

From: Fabian Pascal

To: Brian Selzer

 

Note that I and David McGoveran do not subscribe to explicit relvar semantics in the data language.

 

In The Costly Illusion: Normalization, Integrity and Performance I make a point of stating that normalization has several benefits, only one of which, redundancy, at best a few are aware of:

 

·         redundancy

·         update anomalies (some reasonable operations cannot be performed)

·         database bias for some apps and against others

·         harder to understand databases

·         harder to formulate queries

·         possibly hard to interpret results

 

Except for redundancy, there is no solution for the others as long as databases are not fully normalized. That is, the integrity risks from redundancy can be addressed by additional constraints, but those defeat the purpose of normalization. There is no solution for the other five deficiencies except full normalization.

 

Incidentally, there are at least TWO formal design principles: the Principle of Full Normalization and the Principle of Orthogonal Design. The two together informally mean: exactly one R-table for each entity set; and exactly one entity set for each R-table.

 

 

From: Brian Selzer

 

Thank you for your very detailed reply.  I haven't had a chance to read TEMPORAL DATA AND THE RELATIONAL MODEL yet, but I'm looking forward to it. 

 

 

From: Hugh Darwen

 

I agree with all those.

 

I have not come across the POFN but it looks like my suggested 5-and-a-bit-NF. 

 

I categorically reject the POOD as formerly stated (though not any longer by Chris Date)—and your interpretation looks like that former statement, due to David McGoveran according to my understanding.

I once asked David if the following two relvars would be in violation of the POOD:

R1 { X INTEGER, Y INTEGER }
R2 { A INTEGER, B INTEGER }

(or something like that) and he replied in the affirmative, declaring that he attached no significance to attribute names.  It seems that the fact that both headings are of the same degree and have the same number of attributes of each declared type is what causes the POOD to be violated.  I was flabbergasted.  What's more, I have never been able to understand what problem(s) the POOD was trying to solve.  Note: "never been able to understand" does not imply "never been told"! (but even if I have been told, I certainly can't remember the words.

 

 

From: Fabian Pascal

To: Hugh Darwen

 

I do agree that David's position has not been made very clear. He has been working on a formal exposition of his ideas, but due some personal difficulties has not been able to complete the work. Furthermore, because it is formal, it is very deep hard work, natural language problems included. Some of his ideas are mentioned in my PRACTICAL DATABASE FOUNDATION papers; we will have to wait for his publication otherwise.

 

Ed. Note: A possible paper in the series on POOD is planned. See my update notice on the original paper by Date and McGoveran.

 

 

From: Brian Selzer

 

While I agree in principle with the proposition [of POFN and POOD], "exactly one relation for each entity set; and exactly one entity set for each relation," in practice I have often run across situations where I had to split a table vertically­either for security, performance or logistical reasons.  If it can happen to me, it can happen to students, so it would be a good thing for them to understand exactly what is required to maintain database equivalence (and integrity) in the face of such implementation hurdles.  So far I haven't come across a good description of the types of constraints required to maintain a 1:1 relationship or a 1:1..n relationship nor how to implement them with existing commercial database products.  Do you perhaps know of one?

 

 

From: Fabian Pascal

 

Please note that security, performance, or logistical reasons are not at the logical level. Now, if you need to violate relational design principles for those reasons, you have to accept the consequences, and you may end up defeating the purpose.

 

My instinctive response is that violating the POOD by splitting an entity set into multiple R-tables is not different in essence than violating the POFN by bundling two entity sets into one R-table—it would have similar consequences.

 

In the latter case you would need to add constraints to address corruption risks due to the redundancy without addressing the other consequences e.g. update anomalies, database bias, etc., and those constraints, as I demonstrate in The Costly Illusion: Normalization, Integrity and Performance, defeat the purpose of denormalization. In other words, you add prohibitive work for yourself and you gain nothing. What is more, if you want to do it to gain query (but no update) performance, SQL products don't make it easy for you to declare those constraints, and it may not even be possible.

 

In the former case consequences would be the same and you could probably figure things out in the same way.

 

 

From: David McGoveran

To: Brian Selzer

 

None of "security, performance or logistical reasons" are part of the logical model.

 

I think I need an example of both a 1:1 and a 1:1..n across a split to respond properly.

 

 

From: David McGoveran

To: Fabian Pascal

 

I see your point in a way, but it isn't correct from my perspective as phrased. POOD and POFN pertain only to the logical schema. Any "non-logical reasons" and denormalization can only pertain to the physical schema. The only physical schemata permitted are those that are semantically equivalent to the logical schema.

 

Per above, they aren't really being "added" just translated so that they reference the tables in the physical schema.

 

At least conceptually, but not always in terms of resulting performance because (e.g.) the optimizer may be able to do an index join and never access one or other of the "projections".

 

Design the logical, then the physical, then push the relation predicate down to the projections in the physical as table constraints and a set of a referential constraints among them.

 

 

From: Fabian Pascal

To: David McGoveran

 

One can logically denormalize (violate POFN) or split (violate POOD) if one wants to. IF they do—and of course this is not recommended, but IF they choose to do so—then they introduce redundancy which must be controlled. That requires some additional constraints to ensure that updates do not cause corruption.

 

Note that I am not referring to how to do it right and NOT violate. I am referring to what people usually do in practice: they do violate and in that case all they can do is add those constraints. It's prohibitive, it defeats the purpose, but IF they want to know what to do then that's it.

 

I told Chris Date that I don't like the concept of "physical denormalization" precisely because normalization and its violations are logical concepts. In my book, one can normalize or denormalize logically. Anything else is physical design and is not termed normalization or denormalization.

 

Any kind of DBMS product optimization of joins is unknown by the user at db design time, at which point he must decide whether to fully normalize and rely on optimization (which may not occur or may be poor), or denormalize to avoid the joins altogether.

 

Ed. Note: And since they are working with SQL product, chances are that optimization is not the best possible.

 

 

From: David McGoveran

To: Fabian Pascal

 

This is just a matter of word usage. Denormalize is a bad word in general. It is impossible for it to be "logical"—perhaps "denormalize the logical" is meaningful, but it can only mean to "undesign" or back out of the logical schema. What most people call denormalization is always for physical reasons. If the initial physical design is set equal to the logical, then one might "denormalize" it. Either way, yech!

 

 

From: Fabian Pascal

To: David McGoveran


I understand your view, but I formulate it differently.

Full normalization is correct logical design. Therefore, undoing it—what is done in industry practice—is also logical. It is done for physical reasons indeed, but it is not done physically—as it should be—but rather, incorrectly, at the logical level, due to ignorance, product limitations, etc.
Otherwise put: Do logical design correctly and you will end up with fully normalized databases. Don't mess up logical design, or undo the correct one and end up with logically denormalized databases; instead, make sure the physical design (for which the product should permit maximum flexibility) leads to good performance of fully normalized databases.

 

 

From: Brian Selzer

 

Here are some simple examples:

 

Brian Selzer is a member of the Selzer family.  A family cannot exist without members and a member cannot exist without a family.  There is a 1:1..n relationship between families and members.  It's obvious that a foreign key constraint on members that references families is not sufficient to enforce the 1:1..n constraint because it does not prevent the existence of a family without members.

 

For a 1:1 relationship, the first example I can come up with off the top of my head is splitting an employee into public information like name, department, shift, work center, and phone #, and private information like SSN, marital status, exemptions, and salary.  I think that the split makes sense because only the payroll application needs the private information, so it's easier to secure it in a separate table.  It also boosts the performance of every other application.

 

 

From: Fabian Pascal

To: Brian Selzer

 

I’ll let David respond to the more general question, but please note again that only the payroll application needs the private information, so it's easier to secure it in a separate table is not a formal logical design principle, but rather an arbitrary one based on database bias (see my The Costly Illusion: Normalization, Performance and Integrityfor an explanation of the term).
 
If a certain application needs access only to that information, the solution is the pertinent view (incidentally, access should be to views, not base tables as much as possible anyway; it’s SQL products that inhibit this approach due to update limitations and, to remind you, POOD was discovered while trying to formalize algorithms for view updates; violations thereof cause problems for precisely that). 

 

 

From: Brian Selzer

To: Fabian Pascal

 

Perhaps I didn't express myself clearly.  The private information is only used by the payroll application.  This does not mean that the public information isn't.  I guess I could create a view for every other application to use....

 

I think I mentioned before that from a logical standpoint I generally agree with POOD.  I understand your point about database bias, too.  I disagree, however: In addition to security, there is a measurable advantage (at the physical level) in separating infrequently used columns into their own table--particularly if they're only used by one application.  More rows fit on each page, so fewer reads are required to satisfy most queries.  The only disadvantage is in enforcing the necessary referential constraint.  This will certainly slow down INSERTs.  DELETEs are a bit slower, too, because there must be two index seeks.  But most UPDATEs are faster.  Since queries account for between 85 and 95% of database activity, improving query performance will generally yield better overall performance. 

 

The point I'm trying to make is that because external requirements influence physical database design, there may at times be a need to split a table vertically.  Therefore, it would be a good thing to fully understand the implications of doing so, and to have a thorough understanding of the constraints needed to maintain integrity, since a foreign key isn't sufficient.  That's why I think that the concept of "non-loss" decomposition as applied to relvars, or relation schemas should be debunked, or at least clarified. 

 

 

From: Fabian Pascal

To: Brian Selzer

 

That's the recommended approach in a truly relational environment, which of course you don't have. Therefore, what you are forced to do by SQL implementations, there is no RM to save you from the consequences thereof.

 

You did not understand my point. Security is not a formal logical design principle. Therefore, whatever the implementation forces you to do, it is a violation with consequences, whether you like it or not.

 

The implementation should then permit you to split table data at the physical level, not at the logical level. The problem is in SQL they are tied together, and you’re not given the necessary options (see the various postings on the TransRelational Model).

 

 

From: Brian Selzer

To: Fabian Pascal

 

What difference does it make if it isn't a formal logical design principal.  I already acknowledged that point.  But I have to live in the real world, and commercial database engines don't support multiple assignment, don't support declarative referential integrity constraints beyond primary and foreign keys, and have physical limitations, such as a maximum number of columns for a table.  Thus, there exists today a divide between the optimal logical database design and possible physical database designs.  The challenge is in choosing a physical design that is equivalent to the optimal logical design and meets all of the external requirements. 

 

 

From: Fabian Pascal

To: Brian Selzer

 

If you do not see the difference it makes, I can't explain it to you better. It is not a matter of sheer acknowledgement, but of understanding.

Correct, the real world does not offer you the right solutions; that's been our position for years. So you may have to violate formal principles to do your work. That has consequences which you cannot avoid, but can only try to attenuate as best you can. You seem to be asking for solutions that avoid those consequences.

The Costly Illusion makes it clear how to figure out the necessary constraints in the case of POFN violations, what are the consequences, what is possible to do with SQL systems, and what you end up with. Apply the same approach to violations of POOD and you'll answer your own questions.

 

 

From: David McGoveran

 

Minor historical correction: POOD was "discovered" by myself about 1985 while trying to figure out algorithms to merge (both relational and non-relational) databases into a single relational database and be able to back out of them. POOD is one of 3 logical design principles that I "defined" and have followed to great advantage since that time. It arose from a consideration of how to develop a mathematical representation (i.e., model) that has certain highly desirable transformation properties, and then translating those concepts into database semantics.

 

POOD has been the subject of various attempts at formalization since I informed Chris of the principle when he asked me about a logical conundrum in his early efforts at view updating. While I intuitively knew how to apply the principle, Chris and I engaged in extensive faxes in the early 90s trying to convey the idea to Chris and to reach a usable level of formalization.

 

 

From: Fabian Pascal

To: David McGoveran

 

After discussions with you and with Chris, and after reading Chris' last 3 papers which I sent you, I am more and more convince that, whatever the formal definition, the informal version to be applied by the practitioner is: no cross-table duplication, which means no two tables should have rows representing the same entity (or propositions about the same entity). It also has the nice symmetrical elegance together with the informal version of POFN.

It is quite possible that the difference between you and CJD stems from the difference between his header/body view of a relation, which you (and recently me) do not espouse, and therefore differing definitions of a tuple.

Unfortunately, if my sense is correct, this is not good news for 6NF.

 

 

Posted 2/24/06

© Fabian Pascal 2006 All Rights Reserved