Sunday, March 31, 2013

Site Update

1.
The 'Quote of the Week' was posted on the QUOTES page.

2.
A 'To Laugh or Cry' item was posted on the LAUGH/CRY page. Many pronouncements exhibit poor grasp of foundation knowledge.  It is a rich target for  debunking, so I may tackle it at some point. You may want to test your mastery of fundamentals before I do.

3.
Links to online exchanges I participated in were posted on the FP ONLINE page.

4.
Job description: Solutions Developer. No comment.

5.
An oldie but goodie republished: Leonardo Was Right!


Thursday, March 28, 2013

Social BigData and Relational Denial

Note: Minor edits 3/29/13.

In an online discussion initiated by the question Does It Matter If Data is BIG or not? MQ commented:
I still feel the discussion around Relational Modelling is confusing the point, and should be put aside until the problem is understood. If a company came to me and said 'Help me solve my big data issue - I have a billion emails I want to analyse' my answer is not 'just create a logical model using relational model theory' because this does not supply an answer. I will make more ground if I say 'right, lets discuss what this is, what technology you have, where the fail points and choke points are, etc and model (relational model) that as part of the process'.
I've built data models for 25 years (all levels) and firmly believed in Relational Theory across this entire period, so I am not saying drop Relational Models, just saying don't start there. Interestingly, I don't get any backlash against relational modelling using this approach - so perhaps the issues mentioned are about how the concept is sold to clients (a weapon rather than an intellectual concept)?

Sunday, March 24, 2013

Site Update

1.
The 'Quote of the Week' was posted on the QUOTES page.

2.
A 'To Laugh or Cry' item was posted on the LAUGH/CRY page.

Here's an exchange between its author and myself:
I was thinking about database skills. When I started in the nineties, and systems were moving from Cobol to RDBMS, database and design skills were really valuable. Now, as much as we know that DB skills are important and valuable, it seems to be the GUI that is more important. At least in management's mind.  :(  At my alma mater, no student is studying the database track now. They either do app development, or networking. When does this mean for the future?
That is why the new generation of products are either applications and files, or labeled DBMSs when they are really application-specific DBMSs. Many practitioners do not distinguish between DBMS, applications, network and OS functions. It's all one big lump.
To which I add: This is also the reason application developers like object-orientation: it is a programming, not data "paradigm".

3.
Links to online exchanges I participated in were posted on the FP ONLINE page.

4.
The following Advanced Database Design and Implementation - Course Outline requires no comments for the reader possessing foundation knowledge. It is good evidence of academia pursuing industry fads rather than leading it with science.
This year the course will examine the following two contemporary fields in the database systems area: XML Data Model and XML Databases and Data Warehousing.

XML Data Model and XML Databases will comprise approximately 65% of the course. There, we shall consider topics such as: XML documents, Document Type Definition (DTD) and XML Schema, XML constraints, XML query languages, Types of XML Databases, Mapping XML data to relational databases, Publishing relational databases as XML documents, and what research is going on in the XML database area. The practical experience will be achieved through the use of XML processors like xmllint and the native XML database management system eXist.
...
By the end of the course, students should be able to:
  • Design well formed XML documents that are valid with regard to a given DTD or XML Schema and thus develop the ability to solve practical engineering problems (BE graduate attribute 3(f)),
  • Analyze a part of the real world and design a corresponding XML DTD or Schema in XML normal form and thus develop the ability to formulate and build efficient models of complex systems using principles of engineering science and mathematics (BE graduate attribute 3(b) and BE graduate attribute 3(c)),
  • Design faithful models of a part of the real world using XML database constraints and thus develop the ability to apply mathematical and engineering science in solving engineering problems (BE graduate attribute 3(a)),
  • Use available web sources to learn about the eXist XML database management system and define XQuery queries and XUPdata updates of an almost arbitrary complexity against a native XML database and thus develop the ability to look for additional information from pertinent sources (BE graduate attribute)...
Interestingly, the pre-requisite is familiarity with    "Relational Data Model, Structured Query Language (SQL), Relational Functional Dependencies and Normal Forms, PostgreSQL Data Base Management System."

Can't figure out why.

5.
On the lighter side: Models and met-a-date.



Monday, March 18, 2013

Depends on the Dependencies: Normal Forms and the Conceptual-Logical Conflation

by Fabian Pascal and Erwin Smout

REVISED: 10/19/16


Confusion of levels of representation is one of the most common and entrenched fallacies in data management [1]. We have written and spoken frequently about "denormalization for performance" [2,3,4,5,6] rooted in logical-physical confusion (LPC), but there is another fallacy, caused by  conceptual-logical conflation (CLC). There is even less awareness of the CLC than of the LPC.

Note:

  • We distinguish between normalization (to 1NF) and further normalization (beyond 1NF up to 5NF), which are qualitatively different. When we mean the latter, we spell it out.
  • For the purposes of this discussion we exclude recently proposed normal forms higher than 5NF, which are targeted at specific narrow circumstances and have not yet attained general consensus. If and when we refer to 'full normalization' we mean 5NF.
  • We deliberately overlook the distinction between BCNF and 5NF, because cases where there actually is a distinction are fairly rare in practice and it does not materially affect our argument.
You encounter CLC when one or more tables are accompanied by the question "Are these fully normalized and, if not, how do I fully normalize them?" without any reference to the conceptual model that they are supposed to represent. Or the following:
"The primary purpose of normalization in a logical entity-relationship model is to identify a single point of definition (or a single point of update for the ACID-oriented modellers) for any given data-item (or groups of data-items in the higher normal forms). Unfortunately just about all discussions of “normal forms” are very simplistic and nearly all the examples of the “normalization process” deal with very simple examples focussed on the values assigned to data-items rather than the purpose of the data-item itself." --When was the last time you were able to build a relational model without a non-DB person insisting on the 'need' for denormalizing?
An enterprise-specific entity-relationship model (ERM) is at the conceptual level (about the real world). As defined by Codd and others, normal forms of relations are at the logical level (database representation). "Logical ERM" conflates the two levels. CLC is a frequent factor in poor database design: normal forms are not "in a E/RM", but in a logical model.

Note: The comment offers a very unusual formulation of the objective of full normalization that very few would recognize as a result of redundancy elimination (we did not). With a considerable slack in the interpretation of terms, it goes like this: "Without redundancy, every data item appears only once in the logical structure and, thus, provides its own "unique single point" to which updates must be applied ("ACID modelers" is an equally odd term and we would not use it in this context)." We also pondered the meaning of "focused on the values assigned to data-items rather than the purpose of the data-item itself." Full normalization applies to relations, not the attribute values within a relation.
 
First, the fundamentals:

  • A relation is normalized (in 1NF) if all its attributes are defined on simple domains without meaningful components, whose values are treated atomically by the data language;
  • A relation is fully normalized (in 5NF) if all dependencies that hold in it are join dependencies (FD) of the non-key attributes on the key (informally, this means that the relation represents facts about a single type of entities).
  • A relation that is in 2NF-4NF (i.e., represents facts about entities of multiple types) can be further normalized by replacing it with its 5NF projections that represent facts of a single type each, without loss of information. 
  • 3NF-5NF relations can be denormalized down to 2NF form by replacing them with their joins. [2] without loss of information.
  • Relations in 1NF can be denormalized by introducing non-simple attributes (e.g., defined on relation valued domains with meaningful components) with values treated as non-atomic by the data language).
Simple examples are not necessarily simplistic for teaching purposes. Carefully selected simple examples are often essential to understand general principles, which can then be applied to more complex realities. In fact, the commenter himself uses a simple example to demonstrate what he calls "overnormalization":
"A particular bug-bear and a mistake that +90% of “data modelers” make, is analyzing “point in time” views of the business data and “normalizing” those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-ID,
 Customer Name,
 Invoice-Address-Line-1,
 Invoice-Address-Line-2,
 Invoice-Address-Line-3,
 Invoice-Address-Line-4,
 Invoice-Address-Postcode,
 Net-Amount,
 VAT,
 Total-Amount
};
Nearly every time, through the blind application of [full] normalization we get this (you’ll need to draw the diagram to get the picture):
SALES-INVOICE
{Invoice-Date,
 Customer-Account-Id
 REFERENCES Customer-Account,
 Net-Amount,
 Vat,
 Total-Amount
};
CUSTOMER-ACCOUNT
{Customer-Account-Id,
 Customer-Name,
 Invoice-Address REFERENCES Address
};

ADDRESS
{Address-Line-1,
 Address-Line-2,
 Address-Line-3,
 Address-Line-4,
 Postcode
};"
To ascertain the normal form a relation (pictured by a R-table) is in, we must know the attribute dependencies that hold in it, which reflect property dependencies in the real world. They are specified by one of the business rules comprising the conceptual model -- that is represented by an integrity constraint in the database. But we are asked to detect a normalization mistake -- a violation of 5NF -- in the absence of both the rule and the constraint (i.e., to infer/assume/guess the dependencies based solely on the column names). Hence Erwin's response:
"Your example has nothing to do with full normalization. To the extent that any "mistake" has been made, it's probably the failure to recognize that "address line as it was at the time of placing the order" is not necessarily the same as "address line as it is currently known right now, 10 years later". You might be interested to know that there are database design schemes that address this problem--correctly--without "denormalizing the details back into the invoice".
that evoked the following reply:
"If it’s not a mistake then why do so many “data modellers” make that mistake? I’ve lost count of the times that someone, when converting my information mode into a database schema, has done that type of conversion “to normalize the data” and then I have to argue with them to explain the error of their ways whilst they accuse me of “denormalizing” their data. There’s even a term for it – it’s called “over-normalization”."
The database field is riddled through with misconceptions and errors due to poor grasp of data fundamentals, CLC being one of their consequences. "Overnormalization" is a label attached to all sorts of poor logical designs that have nothing to do with normalization/further normalization. Failure to spell out the conceptual model (business rules) and starting with logical database design (including confusing tables with relations) is certain recipe for poor design and all its costly implications.

In the real world the dependency that holds is not

{customer} -> {address line}
but
{customer, invoice date} -> {address line}
Had the correct dependency in the conceptual model been spelled out, it would have been clear that the single relation is not denormalized. But if we are given either of the two relations and asked in what normal form it is in without the dependencies, it is impossible to say with any certainty.

Note: Incidentally, "blind application of normalization" would not have left NET_AMOUNT, VAT and TOTAL_AMOUNT in the SALES_INVOICE relation, because of the real world dependency

{net-amount, vat} -> {total-amount}
(unless VAT denotes a percentage and not an amount, which we are not told either). If this dependency is taken into account, then TOTAL_AMOUNT should be separated into its own relation with {NET_AMOUNT,VAT} as key. However, such a base relation would violate the Principle of Representational Minimality (PORM) [7] because it is redundant: it can be derived from the other base relations. If the calculated attribute is left in SALES_INVOICE, an integrity constraint must be enforced to guarantee that at all times
NET_AMOUNT + VAT = TOTAL_AMOUNT
In conclusion:
  • Distortions or misuse of normalization/further normalization should not be confused with the real thing. Often normalization terminology is applied to unrelated design aspects (e.g. designs that avoid missing data, include calculated data or aggregation, or represent entity supertype-subtypes). "Overnormalization" is a misleading term used for any poor logical design.
  • Full normalization alone is not sufficient to achieve designs devoid of drawbacks. For example, "details of historical customer data moved into invoice tables" is not inconsistent with 5NF, yet is bad design.
  • Property dependencies are axioms about the real world, represented in the database by attribute dependencies expressed as integrity constraints. Normalization theory is logical design reasoning based on those axioms (e.g., identification of the key, assessing the normal form of a relation). Different dependencies identify different keys and normal forms. Applying normalization/further normalization without spelling out property dependencies specified in the conceptual model is like reasoning without axioms: any conclusion is possible.
  • Guessing normal forms by inspecting R-tables that picture relations, based on just column names is a fool's errand (which can be tricky even with simple relations) [7]. It is practically impossible with realistically complex relations, particularly beyond BCNF.
  • With knowledge of data and relational fundamentals, a well defined, complete and documented conceptual model and adherence to the Principle of Full Normalization (POFN) should implicitly yield a 5NF database. Explicit normalization/further normalization is necessary only for repair of designs that did not adhere to the POFN. 
  • Practitioners may be aware of one -- redundancy -- but not all of the drawbacks of denormalization [6]. What is more, few realize that the specific type of redundancy introduced only by further normalization (normalization does not cause redundancy!) must be controlled by the DBMS to ensure database consistency, via enforcement of special integrity constraints; do not know how to formulate them (if they are supported by SQL). Unfortunately, if they added these constraints -- a serious development burden -- it would all be for naught, because they would execute the very joins denormalization is intended to avoid [2].


References

[1] Pascal, F., THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS.
[2] Pascal, F., The Costly Illusion: Normalization, Integrity and Performance.
[3] Pascal, F., Normalization, Further Normalization, Ease of Use, Integrity and Performance.
[4] Pascal, F., Tables, Full Normalization and Business Rules.
[5] Pascal, F., Denormalization for Performance: Don't Blame the Relational Model.
[6] Pascal, F., The Costly Illusion of Denormalization for Performance, forthcoming, All Analytics.
[7] McGoveran, D., LOGIC FOR SERIOUS DATABASE FOLKS, forthcoming (draft chapters).







Tuesday, March 12, 2013

Site Update

1.
The Quote of the Week was posted on the QUOTES page.

2.
A 'To Laugh or Cry' item was posted on the LAUGH/CRY page. The perils of online business modeling and database design and the time and effort imposed by the absence of foundation knowledge.

3.
Links to online exchanges I participated in were posted on the FP ONLINE page.

4.
The SCHEDULE page is now displaying an online monthly calendar which will be updated with my public seminars/lectures, with links to the details. The direct link is
http://pub11.bravenet.com/calendar/show.php?usernum=894201442.

5.
Recommendations:
  • Added Nijssen's CONCEPTUAL SCHEMA AND RELATIONAL DATABASE DESIGN to the recommended books (available via the home page). It is, in my opinion, the best that can be done at the informal conceptual/business level.
  •  Mosley, B., and Marks, P., Out of the Tar Pit. A good read on complexity and the benefits of the relational model (h/t Eric Kaun). 
6.
Miscellaneous:
  • Somebody was endorsed for 'Thought Leadership'. I guess this reflects the increasing rarity of thinking and thinkers. Time to appoint Chief Thought Officers.
  • Solutions Developer. An excellent example of the factotum approach to hiring and the exclusive demand for tool experience. Consider the probability that one person can be sufficiently competent in all the tools, without any guarantee of foundation knowledge. Related: A Data Warehouse quiz.
  • Making Friends with Science provides some context for the previous two items:
Making friends is truly the beginning of making lasting memories. To make friends with science is truly to start with making good friends that make lasting memories about science. I'm starting a new revolution in the way science will be made socially for the community and ask the community to step in and help make science fun, engaging, real, social and most importantly lasting friendships.


Thursday, March 7, 2013

Site Update

1.
The Quote of the Week was posted on the QUOTES page.

2.
A 'To Laugh or Cry' item was posted on the LAUGH/CRY page.

3.
Links to online exchanges I participated in were posted on the FP ONLINE page.

4. 
Looking for non-proprietary reference Semantic Data Model of Distribution Requirement Plan 

Is there any standard LDM exists for Automotives like CLDM or FSLDM 

Require database for banking customers  

Database table normalization

Detect a pattern? 

5.
My predicted consequences of the BigData and BI fad come to pass. On the one hand: 

Big Data News Roundup From Porn to Data-ism

On the other:

Trends Shows Problems of Big Data Without Context

What is the purpose of DENSITY in STATISTICS