* POFN: Principle of
Full Normalization; POOD: Principle of Orthogonal Design
From: Brian Selzer
Date: Dec 16, 2005
During the normalization process, it is asserted that
decomposing a relation that exhibits redundancy into separate relations that
don't can be done without losing information.
While this can easily be proven for static relations, it appears to me
that this is not also true for relation variables. As far as I can tell,
what is lost is the 1:1 relationship for
rows in the original table. A foreign
key constraint isn't sufficient (though that is what is commonly recommended in
popular database texts), because it allows rows to exist in the primary key
table that don't have corresponding rows in the foreign key table. It appears
to me that a circular inclusion
dependency is also required in order to maintain the 1:1 relationship that is
lost when you use a foreign key constraint.
Could you please comment on this?
From: Hugh Darwen
To: Brian Selzer
I'm baffled by the question.
Please give an example.
I'm afraid I can't make much sense of the question as written.
In particular, I don't know what is meant by
"[losing] the 1:1 relationship with rows of the original table".
I will say this, for what it might be worth.
Normalisation always entails decomposition
over a join dependency (JD). By
definition, a JD in R is two or more projections of R that, when joined, yield
R. So, if a certain JD holds in relvar
R and R is "decomposed" accordingly into S and T, then for every
possible value of R every tuple in R is a tuple of S JOIN T and every tuple of
S JOIN T is a tuple of R. That's a 1:1
correspondence, but perhaps not the one in question.
From: Brian Selzer
To: Hugh Darwen
Sorry about the confusion. I will try to be more clear.
Here's my argument:
A database schema D consisting of a single relation schema R
is not equivalent to a database schema D' consisting of S and T, even if S and
T are produced by "decomposition" of R according to Heath's
Theorem. In order to be equivalent,
there must also exist a circular inclusion dependency between S and T.
Without the circular inclusion dependency,
given an arbitrary value for D, there are an infinite number of values for D'
for which S JOIN T = R. Furthermore, a
foreign key constraint between S and T is not sufficient to guarantee
equivalence because it doesn't restrict the referenced relation schema, thus
you can still have an infinite number of values for D'.
I was reading C. J. Date's, An Introduction to Database
Systems, Seventh Edition, and though it goes into considerable detail on
normalization and normal forms, it doesn't discuss in sufficient detail the kinds
of database constraints that I believe are required as a consequence of
applying the normalization procedure.
Most other books I've read simply recommend using a foreign key
constraint.
Am I crazy? This has
bugged me for a long time, and I've run into many situations where I've had to
enforce 1:1..n relationships between tables.
(Foreign key constraints work great for 1:0..n relationships.)
The commercial database products I've used
don't support multiple assignment, so I've been forced to use other means, such
as triggers, views and encapsulating logic in procedures in order to maintain
integrity.
From: Hugh Darwen
To: Brian Selzer
I do now see the point.
What's more, when I think about it I realise that your point is one that
has bugged me over the years too, except that I have expressed it differently
(to myself, at least).
Normalisation is billed as helping significantly to avoid
redundancy. It is also billed as
overcoming certain "update anomalies" (not a great term, but I don't
have an improvement).
Take the case of ENROLMENT ( StudentId, Name, CourseId ),
with Name determined by StudentId. It
violates 2NF. We observe the JD * {
{StudentId CourseId}, {StudentId, Name } } and decompose accordingly into
IS_CALLED and IS_ENROLLED_ON.
Now we have avoided the redundancy of having to record a
student's name in connection with every enrolment of that student.
But we have also solved some "update
anomalies". For example, we don't
lose our record of a student's name when the student's last enrolment is
deleted; and we can register a student before enrolling him or her on any
courses.
Your point is that the constraint implicit in ENROLMENT has
not been preserved. The constraint in
question is one to the effect that every named student is an enrolled one and
every enrolled student is a named one.
I always teach the loss of this constraint as another
advantage of normalisation, not a problem with it!
And of course, having been party (with Date) to the introduction
of 6NF, I teach that too, as being a good idea the circular constraint does not
apply. A relvar in 5NF does not suffer
from redundancy but might suffer from lack of independence.
For example, consider the soccer team's fixtures for the
2005-6 season. The following is in 5NF:
FIXTURE{Opponents, Venue, Date, GoalsFor, GoalsAgainst}, keyed on Date, and
also on {Opponents, Venue} (where Venue is Home or Away).
But this FIXTURE suffers from an update anomaly arising from
lack of independence. The dates, venues
and opponents for all fixtures are determined before the season starts, and it
is useful to record them independently of the results, before the games in
question are played. My 5NF design
prevents that from happening--the relvar would be better named RESULT!
Unfortunately, the 6NF decomposition around Date would be
overkill, as we'd have separate relvars for Opponents, Venue, GoalsFor and
GoalsAgainst, with a constraint to the effect that every Date having a GoalsFor
must also have a GoalsAgainst, an Opponents and a Venue, another to the effect
that every Date having a Venue must also have an Opponents and, several more of
that ilk.
So an optimal design for the soccer club is:
FIXTURE { Opponents, Venue, Date }
RESULT { Date, GoalsFor, GoalsAgainst}
alternatively:
FIXTURE { Opponents, Venue, Date }
RESULT { Opponents, Venue, GoalsFor, GoalsAgainst}
(Perhaps the first would be considered more economical.)
5-and-a-bit-NF!
And that's what I've been teaching in my new course.
But on looking through my lecture slides and
notes I see that I do not use the term "nonloss decomposition" at
all. I've never liked it much, but I
don't think the decision to omit it was a conscious one.
I do teach join dependency right up front
(with a very simple example), but when I point out the overcoming of the
"update anomaly" I am at pains to point out that in those
circumstances the JD was in some sense a bit of a myth all along.
For FIXTURE JOIN RESULT { Date, Opponents,
Venue } is not equal to FIXTURE (at least, not until the last game has been
played and its result recorded).
Saying that the JD was a bit of a myth is not quite right but
I can't think of a better way of putting it.
Given the original design combining FIXTURE and RESULT, the JD does
hold. The real problem is that the JD
doesn't apply all the time in the real world.
In the real world, the fact that a fixture has been arranged does not
imply that the game in question has been played, though the inverse is true.
Thus, when we do the decomposition, the single FK is sufficient and we don't
need circular INDs (which in this case would be circular FKs).
I think I'll leave it at that.
I'm not sure if I've answered the question but I hope I've shown
an understanding of it and also how I avoid the need for multiple assignment
(though that cannot always be avoided, as is well known).
If you have an example that is not covered
well by my approach and you'd like me to consider, please send it.
I do think the theory of normalisation has an astoundingly
messy history. Codd's original 2NF and
3NF were very bad mistakes, with hindsight.
I do teach them, but only because if I don't my students will discover
them in textbooks and wonder why I haven't.
I teach them and then say that they won't be examined on them.
BCNF, covering 2NF, 3NF and other cases all
in the same manner is so much simpler.
4NF was another mistake. 5NF was
good, but not so good as to make BCNF worthless in the way that BCNF makes 2NF
and 3NF worthless. 6NF is good for
temporal data (it was our study of temporal data that gave rise to it) and
possibly good for all data if the following suggestion by me would be
agreeable: design to 6NF but when you then find you have two or more relvars
with the same key that are subject to circular INDs, combine them.
That's a bit radical as it seems to dispose
of any need for 5NF and BCNF!
Comments?
From: Fabian Pascal
To: Brian Selzer
Note that I and David McGoveran do not subscribe to explicit
relvar semantics in the data language.
In
The Costly Illusion: Normalization, Integrity and Performance I make a
point of stating that normalization has several benefits, only one of which,
redundancy, at best a few are aware of:
·
redundancy
·
update anomalies (some reasonable operations cannot be
performed)
·
database bias for some apps and against others
·
harder to understand databases
·
harder to formulate queries
·
possibly hard to interpret results
Except for redundancy, there is no solution for the others as
long as databases are not fully normalized. That is, the integrity risks from
redundancy can be addressed by additional constraints, but those defeat the
purpose of normalization. There is no solution for the other five deficiencies
except full normalization.
Incidentally, there are at least TWO formal design principles:
the Principle of Full Normalization and the Principle of Orthogonal Design. The
two together informally mean: exactly one R-table for each entity set; and
exactly one entity set for each R-table.
From: Brian Selzer
Thank you for your very detailed reply. I haven't had a chance to read TEMPORAL DATA AND THE RELATIONAL
MODEL yet, but I'm looking forward to it.
From: Hugh Darwen
I agree with all those.
I have not come across the POFN but it looks like my
suggested 5-and-a-bit-NF.
I categorically reject the POOD as formerly stated (though
not any longer by Chris Date)—and your interpretation looks like that former
statement, due to David McGoveran according to my understanding.
I once asked David if the following two relvars would be in violation of the
POOD:
R1 { X INTEGER, Y INTEGER }
R2 { A INTEGER, B INTEGER }
(or something like that) and he replied in the affirmative, declaring that he
attached no significance to attribute names. It seems that the fact that
both headings are of the same degree and have the same number of attributes of
each declared type is what causes the POOD to be violated. I was
flabbergasted. What's more, I have never been able to understand what
problem(s) the POOD was trying to solve. Note: "never been able to
understand" does not imply "never been told"! (but even if I
have been told, I certainly can't remember the words.
From: Fabian Pascal
To: Hugh Darwen
I do agree that David's position has not been made very
clear. He has been working on a formal exposition of his ideas, but due some
personal difficulties has not been able to complete the work. Furthermore,
because it is formal, it is very deep hard work, natural language problems
included. Some of his ideas are mentioned in my PRACTICAL
DATABASE FOUNDATION papers; we will have to wait for his
publication otherwise.
Ed. Note: A
possible paper in the series on POOD is planned. See my update notice on the original paper by Date
and McGoveran.
From: Brian Selzer
While I agree in principle with the proposition [of POFN and
POOD], "exactly one relation for each entity set; and exactly one entity
set for each relation," in practice I have often run across situations
where I had to split a table verticallyeither for security, performance or
logistical reasons. If it can happen to
me, it can happen to students, so it would be a good thing for them to
understand exactly what is required to maintain database equivalence (and
integrity) in the face of such implementation hurdles.
So far I haven't come across a good
description of the types of constraints required to maintain a 1:1 relationship
or a 1:1..n relationship nor how to implement them with existing commercial
database products. Do you perhaps know
of one?
From: Fabian Pascal
Please note that security, performance, or logistical reasons
are not at the logical level. Now, if you need to violate relational
design principles for those reasons, you have to accept the consequences, and
you may end up defeating the purpose.
My instinctive response is that violating the POOD by
splitting an entity set into multiple R-tables is not different in essence than
violating the POFN by bundling two entity sets into one R-table—it would have
similar consequences.
In the latter case you would need to add constraints to
address corruption risks due to the redundancy without addressing the other
consequences e.g. update anomalies, database bias, etc., and those constraints,
as I demonstrate in The
Costly Illusion: Normalization, Integrity and Performance, defeat the
purpose of denormalization. In other words, you add prohibitive work for
yourself and you gain nothing. What is more, if you want to do it to gain query
(but no update) performance, SQL products don't make it easy for you to declare
those constraints, and it may not even be possible.
In the former case consequences would be the same and you
could probably figure things out in the same way.
From: David McGoveran
To: Brian Selzer
None of "security, performance or logistical
reasons" are part of the logical model.
I think I need an
example of both a 1:1 and a 1:1..n across a split to respond properly.
From: David McGoveran
To: Fabian Pascal
I see your point in a way, but it isn't correct from my
perspective as phrased. POOD and POFN pertain only to the logical schema. Any
"non-logical reasons" and denormalization can only pertain to the
physical schema. The only physical schemata permitted are those that are
semantically equivalent to the logical schema.
Per above, they aren't really being "added" just
translated so that they reference the tables in the physical schema.
At least conceptually, but not always in terms of resulting
performance because (e.g.) the optimizer may be able to do an index join and
never access one or other of the "projections".
Design the logical, then the physical, then push the relation
predicate down to the projections in the physical as table constraints and a
set of a referential constraints among them.
From: Fabian Pascal
To: David McGoveran
One can logically denormalize (violate POFN) or split
(violate POOD) if one wants to. IF they do—and of course this is not recommended,
but IF they choose to
do so—then they introduce redundancy which must be controlled. That requires
some additional constraints to ensure that updates do not cause corruption.
Note that I am not referring to how to do it right and NOT
violate. I am referring to what people usually do in practice: they do violate
and in that case all they can do is add those constraints. It's prohibitive, it
defeats the purpose, but IF they want to know what to do then that's it.
I told Chris Date that I don't like the concept of
"physical denormalization" precisely because normalization and its
violations are logical concepts. In my book, one can normalize or denormalize
logically. Anything else is physical design and is not termed normalization or
denormalization.
Any kind of DBMS product optimization of joins is unknown by
the user at db design time, at which point he must decide whether to fully normalize
and rely on optimization (which may not occur or may be poor), or denormalize
to avoid the joins altogether.
Ed. Note: And since
they are working with SQL product, chances are that optimization is not the
best possible.
From: David McGoveran
To: Fabian Pascal
This is just a matter of word usage. Denormalize is a bad
word in general. It is impossible for it to be "logical"—perhaps
"denormalize the logical" is meaningful, but it can only mean to
"undesign" or back out of the logical schema. What most people call
denormalization is always for physical reasons. If the initial physical design
is set equal to the logical, then one might "denormalize" it. Either
way, yech!
From: Fabian Pascal
To: David McGoveran
I understand your view, but I formulate it differently.
Full normalization is correct logical design. Therefore, undoing it—what is
done in industry practice—is also logical. It is done for physical reasons
indeed, but it is not done physically—as it should be—but rather,
incorrectly, at the logical level, due to ignorance, product limitations, etc.
Otherwise put: Do logical design correctly and you will end up with fully
normalized databases. Don't mess up logical design, or undo the correct one and
end up with logically denormalized databases; instead, make sure the physical
design (for which the product should permit maximum flexibility)
leads to good performance of fully normalized databases.
From: Brian Selzer
Here are
some simple examples:
Brian
Selzer is a member of the Selzer family. A family cannot exist
without members and a member cannot exist without a family. There is a
1:1..n relationship between families and members. It's obvious that a
foreign key constraint on members that references families is not sufficient to
enforce the 1:1..n constraint because it does not prevent the existence of a
family without members.
For a 1:1
relationship, the first example I can come up with off the top of my
head is splitting an employee into public information like name,
department, shift, work center, and phone #, and private information like
SSN, marital status, exemptions, and salary. I think that the split makes
sense because only the payroll application needs the private
information, so it's easier to secure it in a separate table. It
also boosts the performance of every other application.
From: Fabian Pascal
To: Brian Selzer
I’ll let David respond to the more general question, but
please note again that only the payroll application needs the private
information, so it's easier to secure it in a separate table is not a
formal logical design principle, but rather an arbitrary one based on database
bias (see my The
Costly Illusion: Normalization, Performance and Integrityfor an
explanation of the term).
If a certain application needs access only to that information, the solution is
the pertinent view (incidentally, access should be to views, not base
tables as much as possible anyway; it’s SQL products that inhibit this approach
due to update limitations and, to remind you, POOD was discovered while trying
to formalize algorithms for view updates; violations thereof cause
problems for precisely that).
From: Brian Selzer
To: Fabian Pascal
Perhaps I didn't express myself
clearly. The private information is only used by the payroll
application. This does not mean that the public information
isn't. I guess I could create a view for every other
application to use....
I think I mentioned before that
from a logical standpoint I generally agree with POOD. I understand your
point about database bias, too. I disagree, however: In addition to
security, there is a measurable advantage (at the physical level) in separating
infrequently used columns into their own table--particularly if they're only
used by one application. More rows fit on each page, so fewer
reads are required to satisfy most queries. The only
disadvantage is in enforcing the necessary referential constraint. This
will certainly slow down INSERTs. DELETEs are a bit slower, too, because
there must be two index seeks. But most UPDATEs are faster. Since
queries account for between 85 and 95% of database activity, improving query
performance will generally yield better overall performance.
The point I'm trying to make is that because external
requirements influence physical
database design, there may at times be a need to split a table
vertically. Therefore, it would be a good thing to fully understand
the implications of doing so, and to have a thorough understanding of
the constraints needed to maintain integrity, since a foreign key isn't
sufficient. That's why I think that the concept of
"non-loss" decomposition as applied to relvars, or relation
schemas should be debunked, or at least clarified.
From: Fabian Pascal
To: Brian Selzer
That's the recommended approach in a truly relational
environment, which of course you don't have. Therefore, what you are forced to
do by SQL implementations, there is no RM to save you from the consequences
thereof.
You did not understand my point. Security is not a formal
logical design principle. Therefore, whatever the implementation forces you to
do, it is a violation with consequences, whether you like it or not.
The implementation should then permit you to split table data
at the physical level, not at the logical level. The problem is in SQL
they are tied together, and you’re not given the necessary options (see the
various postings on the TransRelational Model).
From: Brian Selzer
To: Fabian Pascal
What
difference does it make if it isn't a formal logical design principal. I
already acknowledged that point. But I have to live in the real world,
and commercial database engines don't support multiple assignment, don't
support declarative referential integrity constraints beyond primary and
foreign keys, and have physical limitations, such as a maximum number of
columns for a table. Thus, there exists today a divide between
the optimal logical database design and possible physical database
designs. The challenge is in choosing a physical design that is
equivalent to the optimal logical design and meets all of the external
requirements.
From: Fabian Pascal
To: Brian Selzer
If you do not see the difference it makes, I can't explain it
to you better. It is not a matter of sheer acknowledgement, but of
understanding.
Correct, the real world does not offer you the right solutions; that's been our
position for years. So you may have to violate formal principles to do your
work. That has consequences which you cannot avoid, but can only try to
attenuate as best you can. You seem to be asking for solutions that avoid those
consequences.
The Costly Illusion
makes it clear how to figure out the necessary constraints in the case of POFN
violations, what are the consequences, what is possible to do with SQL systems,
and what you end up with. Apply the same approach to violations of POOD and
you'll answer your own questions.
From: David McGoveran
Minor historical correction: POOD was "discovered"
by myself about 1985 while trying to figure out algorithms to merge (both
relational and non-relational) databases into a single relational database and
be able to back out of them. POOD is one of 3 logical design principles that I
"defined" and have followed to great advantage since that time. It
arose from a consideration of how to develop a mathematical representation
(i.e., model) that has certain highly desirable transformation properties, and
then translating those concepts into database semantics.
POOD has been the subject of various attempts at
formalization since I informed Chris of the principle when he asked me about a
logical conundrum in his early efforts at view updating. While I intuitively
knew how to apply the principle, Chris and I engaged in extensive faxes in the
early 90s trying to convey the idea to Chris and to reach a usable level of
formalization.
From: Fabian Pascal
To: David McGoveran
After discussions with you and with Chris, and after reading
Chris' last 3 papers which I sent you, I am more and more convince that,
whatever the formal definition, the informal version to be applied by the
practitioner is: no cross-table duplication, which means no two tables should
have rows representing the same entity (or propositions about the same entity).
It also has the nice symmetrical elegance together with the informal version of
POFN.
It is quite possible that the difference between you and CJD stems from the
difference between his header/body view of a relation, which you (and recently
me) do not espouse, and therefore differing definitions of a tuple.
Unfortunately, if my sense is correct, this is not good news for 6NF.
Posted 2/24/06
© Fabian Pascal 2006 All Rights Reserved