From: Michael Osuna
Date: 28 Feb 2006
In response to Hugh Darwen's second response to Brian Selzer
in On POFN and POOD
[...], I believe that all the confusion is merely do to the way you
denote relations. When you write
ENROLLMENT (StudentId,Name,CourseId )
you actually imply a whole slew of constraints. If we choose
an alternative notation we can make these explicit ... A domain is a set of
values. A function is a mapping from
one set of values to another ...
From: Fabian Pascal
To: Michael Osuna
Cc: Hugh Darwen
I will let Hugh respond. I will just note that work is being
done along these lines. Stay tuned to dbdebunk.
From: Hugh Darwen
To: Michael Osuna
I agree ... Indeed, you are echoing one of the discomforts
(with the way normalization theory is presented) I have had myself for many
years; and I thought I had at least tacitly expressed this discomfort in my
reply to Brian Selzer.
I'll stay with my example, but excuse me for reverting to my
(UK) spelling of it (what right did you have to change it? he asks indignantly!
:-):
ENROLMENT { StudentId, Name, CourseId }
A decomposition into CALLED { StudentId, Name } and
IS_ENROLLED_ON
{ StudentId, CourseId} requires some constraints to be expressed to maintain
logical equivalence with ENROLMENT, exactly as Michael Osuna details.
But the teaching of normalization has always included
teaching the problem referred to as "update anomaly", and how
normalization addresses such anomalies.
An update anomaly pertaining to ENROLMENT is that "deleting the
last of the courses on which a particular student is enrolled deletes all record
of that student's name". Another
(really the same one) is that "we can't record a student's name until that
student enrolls on some course".
Given those "anomalies", I always wanted to ask, screamingly,
"In that case, why did we propose ENROLMENT in the first place?
If a student can be named without being
enrolled, then ENROLMENT is not a true reflection of that part of our
enterprise that we wish our database to record information about; it expresses
constraints that are not constraints out there in the real world!"
That said, if it really is the case that we want to avoid
recording names of students who are not enrolled, then we will very likely
still want to decompose, to address the other problem that normalization
addresses: redundancy.
Hence my preferred approach to (relational) database design:
start with 6NF (guaranteed to avoid the kind of redundancy the NFs address,
also guaranteed to avoid "update anomalies", also very easy to
understand) and then consider opportunities for expressing certain constraints
conveniently by denormalizing!
E.g., if every student that has a student id also has a name,
and every student that has a student id and a birth date also has a name, and
every student that has a student id and a name also has a birth date, then we
can save ourselves a lot of writing—and probably the system a lot of work—by
joining the three 6NF relvars together into one.
But the old advice would remain, in general: in so
denormalizing, do not go so far as to violate 5NF.
From: Fabian Pascal
To: Hugh Darwen
Well, if you read my writings on normalization, you will see
that I keep repeating that correct database design that accurately reflects the
conceptual model will yield fully normalized (5NF) databases. Normalization as
such is pertinent only as a repair of poorly designed databases.
And also (to reiterate): update anomalies and redundancy are
not
the only drawbacks of databases that are not fully normalized (that is, poorly
designed). I counted several others: database application bias, harder to
understand databases, more complex queries, and harder to interpret results.
From: Michael Osuna
To: Fabian Pascal
I disagree with the statement: "correct database design
that accurately reflects the conceptual model will yield fully normalized (5NF)
databases". If you take
"correct" and "accurately" to mean that the database design
is logically equivalent to the conceptual
model.
I think a counter example to your claim would be that this:
IS_ENROLLED ( StudentId,CourseId )
IS_NAMED ( StudentId,Name )
is logically equivalent to this:
IS_ENROLLED ( StudentId,Name,CourseId )
IS_NAMED ( StudentId,Name )
with the additional constraint that IS_ENROLLED's Name and
IS_NAMED's Name are equal when joined.
Of course the latter example is almost certainly "poor" design
intuitively, and is definitely "poor" design if being in 5NF is the
definition of "good" design.
But it is correct and accurate.
A relation is essentially shorthand for specifying a number
constraints (and other things). In
fact, I argue that the normalization of relations into simpler relations is
logically equivalent to explicitly specifying the constraints that being a
relation implies.
In summary, I argue that the Normal Forms are a formal way to
say this design is "good" and that design is "poor", but
that it does not necessarily follow that a "correct" and
"accurate" design will be a "good" design. Further, I argue
that other formalizations of "good" can be made, and they could
include models that measure the logical cost of certain operations on the
database and seek to minimize them. In
some of these models my "poor" design may be the "good"
one.
From: Fabian Pascal
To: Michael Osuna
Well, I could not in an email expand on what I meant by
"correct" and "accurate". It is explained, however, in my
papers, particularly #2, which is now being updated.
Ed. Note: I agree
with most of the message, but I do not know what the last paragraph means. An
explanation is not invited, though.
From: Hugh Darwen
To: Fabian Pascal
Yes, I agree with all those too, Fabian, of course.
But I don't agree that 5NF is "fully
normalized". As a matter of fact,
I'm not even sure that 6NF is!
Consider
S {S#,City,Status,Name }
In 6NF this becomes
SC { S#,City }
SS { S#,Status }
SN { S#,Name }
But what if a supplier can be registered—i.e., allocated a
number—who (as yet) has no name, no status, and no city?
(Silly example, but one can think of better
ones to illustrate the point.)
I know what a proper n-ary relation represents in the real
world. I do not know what SC, SS and SN do. I know what you will say they
represent, but it is a convincing and practical way of representing the real
world.
What do you mean by "proper n-ary"?
What's an improper one? I hope you don't mean that n must be >2
to be proper!
More importantly, what do you propose in lieu of my 6NF
design in the case where those various properties of suppliers really can exist
independently of each other over time?
And why isn't the 6NF design practical under those circumstances?
From: Fabian Pascal
To: Hugh Darwen
Sorry, I misspoke. I prefer R-table to relation (explained in
my paper #1, Truly
Relational: What It Really Means), and I meant to say a proper
R-table—a table that obeys the relational discipline.
Perhaps it would be best to wait for the update of my paper
#2, Conceptual
Modeling and Database Design that I am currently working on to see what
we propose. I will just say that (a) a collection of properties defines a class
of possible entities, subsets of which occur at various times (b) entities are
collections of values of those properties which are not independent.
Within this framework I do not know what "properties of suppliers existing
independently of each other" means.
[If the example involves entities with unknown property
values] the solution is described in my paper #4, The Final NULL in
the Coffin. [If your example involves entities with variable
collections of properties at different times] the solution is proper entity
super/sub types [a paper on that may be forthcoming].
Ed. Note: the
SC/SS/SN design violates the POOD by splitting an entity into three R-tables
and duplicating key values.
From: Hugh Darwen
To: Fabian Pascal
No, I don't talk about properties having UNKNOWN values..
I consider "atomic" predicates, such as:
"Supplier S# is registered."
"Supplier S# is named SNAME"
"Supplier S# is located in CITY"
"Student SID is enrolled on course CID"
"Staff member TID is assigned to student SID on course
CID".
A predicate is atomic if its corresponding relation is
irreducible. A 6NF relation is
irreducible.
Irreducible relvars can conveniently be combined (via join)
only if the constraints we have been talking about hold true.
Note "conveniently". They don't have to be combined if proper
relational technology is available (it isn't).
Of course we can have SNC { S#, Sname, City } even if not
every registered supplier has both a name and a city, but then that relvar will
be usable only for those suppliers who do have both.
We will have to have additional relvars for those who have a name
but no city, those who have a city but no name, and those who have neither name
nor city. That does not strike me as
being very convenient, for obvious reasons.
I don't see how these observations can possibly be
controversial, even in an SQL environment where alternative approaches are
available—alternatives that are antirelational and troublesome, of course.
As you know, I did not agree with the approach you outlined
(but did not flesh out) in the paper you mention.
From: Fabian Pascal
To: Hugh Darwen
Well, you and Chris eschew entities. I sympathize, but I don't
think they can be avoided. [After what exactly (a) are the subjects of
propositions (b) what exactly do key values uniquely identify? (c)
whose are the attributes?]
Your design violates the POOD, but you don't accept it. [Ed. Note: Irreducibility or not, POOD
violations introduce certain redundancy: the existence of an employee is
stated multiple times].
Not exactly. Proper RDBMS support of entity type
hierarchies would handle that situation transparently, without inconvenient
user intervention.
Since I don't agree with your approach either, we will have
to agree to disagree and wait until the formal foundation for our approach
(David’s actually) can be published.
Posted 4/21/06
© Fabian Pascal 2000-2006 All Rights Reserved