PAPERS


PRACTICAL DATABASE FOUNDATIONS SERIES

by Fabian Pascal


This series of papers (also available in seminar form) are a set of originally written papers, the objective of which is to make the scientific foundation of database management and its practical implications accessible to the data practitioner and user.

They cover aspects that are not addressed, or are distorted in the IT industry, are not sufficiently or properly clarified and/or are misunderstood. Their objective is to stay as close to and be as accurate about the formal foundations of the field as possible, while using simple, readily understood language.

The debunkings of common misconceptions provided at the end of each paper can be used by the reader to test himself/herself on comprehension of the subject matter.



NEW!! THE KEY TO KEYS: A MATTER OF IDENTITY NEW!!
v.1 (January 2014)


Note: This paper assumes familiarity with the concepts and terminology introduced in papers #1, Business Modeling For Database Design and #2, The Costly Illusion: Normalization, Integrity and Performance, in this series, which are both recommended as preamble. 

If entities in the real world did not have identifiers—attributes that capture their identity and uniquely identify them—we would not be able to tell them apart. It follows that an accurate database representation of a business reality must include keys, R-table columns that formally represent the real world identifiers in the database.
Keys and types thereof, their necessity, selection, function and properties are often not well known and understood.

This paper
  • Defines and explains the key concept;
  • Explains the function and properties of the various types of key;
  • Describes the criteria for key selection;
  • Specifies what is proper DBMS key support;
  • Assesses SQL's key support;
  • Debunks some common misconceptions about keys.

Note: Figures are numbered by Chapter. The examples are kept as simple as possible to make it easy to understand the ideas, but the reader should consider the implications for realistically complex tables.

Table of Contents

Introduction
1. R-tables and Integrity Constraints
2. Keys and Key Constraints
3. Kinds of Keys
3.1. Natural Keys
3.1.1. Candidate and Primary Keys
3.1.2. Simple and Composite Keys
3.2. Surrogate Keys
4. Key Functions
4.1. Duplicate prevention
4.2. Integrity Burden Reduction
4. 3. View Updatability
5. Foreign Keys and Referential Constraints
6. DBMS Key Support
7. Keys in SQL
7.1. SQL and Duplicates
Appendix A: Duplicate removal in SQL
Appendix B: Duplicates and Language Redundancy
Conclusion
References


THE FINAL NULL IN THE COFFIN: A RELATIONAL SOLUTION TO MISSING DATA 

v.3 (August 2013)

The relational data model is based on the two-valued logic (2VL) of the real world: every proposition about the real world is unequivocally true or false. But our knowledge of the real world is usually imperfect—some data is missing—which means that we don't always know whether propositions are true or not; 2VL no longer applies and data integrity and database query results are no longer guaranteed to be enforceable and provably logically correct with respect to the real world.

Missing data has possibly been the thorniest aspect of database management: without a logically sound yet practical solution, data professionals and users are left between a rock and a hard place. They must either (a) rely on SQL's arbitrary and flawed implementations of three-valued logic (3VL) based on NULLs and risk results that are easy to misinterpret, or erroneous in ways hard to discern, or (b) undertake in applications a prohibitively complex, error prone and unreliable burden that belongs in the DBMS.

This paper illustrates some of the drawbacks of the many-valued logic (nVL, n > 2) approach to missing data and SQL’s NULL scheme and proposes a solution within the 2VL/relational framework that:
  • Guarantees data integrity and logically correct query results;
  • Avoids the complications and problematics of nVL/NULL's;
  • Requires no changes to the relational model;
  • Is largely transparent to users;
  • Keeps users better apprised of the existence and effects of missing data.
The proposed solution requires research into its implications for data manipulation and integrity enforcements before it is implemented, but we believe it is theoretically sound and implementable in a truly relational DBMS (TRDBMS) using technologies that, unlike SQL, support full physical data independence e.g. the TransRelational™ Model (TRM).


Table of Contents
  • Introduction
  • "Inapplicable Data”: Nothing's Missing
  • Missing Data: Into the Unknown
  • SQL’s 3VL: NULL
  • Known Unknowns: Metadata
  • A 2VL Relational Solution
  • The Practicality of Theory
  • 2VL vs. NULL in the Real World
  • Relation Proliferation
  • The TransRelational™ Model
  • Conclusion
  • Some Misconceptions Debunked
  • References


THE COSTLY ILLUSION: NORMALIZATION, INTEGRITY AND PERFORMANCE 
v.5 (February 2013; revised April 2013)

A core database design principle is the Principle of Full Normalization (PFN). Database designs that do not adhere to the PFN present certain drawbacks for relational manipulation, integrity enforcement and usability and, consequently, for application development and data accessibility. Despite the plethora of information on the subject (not all of it correct, or well explicated), the subject is still poorly understood.

The first paper in this series, Business Modeling for Database Design, outlines a methodology that implicitly produces fully normalized databases. But because business modeling is informal, ad-hoc and based on subjective perceptions of reality, violations of the PFN, either due to inadvertent errors, or intentional “denormalization for performance”, due to the mistaken notion that logical database design, rather than physical implementation factors, is responsible for poor performance.

PFN violations impose considerable costs that are mostly ignored. Data redundancy is only one of them, albeit a major one. It increases the integrity burden for both developers and the DBMS. What is more, if they undertake the burden to control redundancy, not only will performance gains from denormalization fail materialize, but may actually  degrade performance.

This paper presents normalization as a method of database design repair. It is required only if and when databases are poorly designed as a way to eliminate the drawbacks caused by PFN violations.

It explains in easy to understand language:
  • The kinds of PFN violation possible;
  • The undesirable consequences of PFN violations;
  • How to repair designs and eliminate the drawbacks;

Table of Contents
  • Introduction
  • R-tables, Keys and Dependencies
  • Normalization and Normal Forms
  • Further Normalization As Design Repair
  • Join Dependencies
  • “The Whole Key” and 2NF
  • “Nothing But the Key” and 3NF
  • “The Whole Key” and BCNF
  • Multivalued Dependencies and 4NF
  • Interval Data and 6NF
  • “Denormalization For Performance”
  • The Logical-Physical Confusion
  • Redundancy Control
  • A Note on JDC’s and SQL
  • The Real Problem and Solution
  • Some Fallacies Debunked
  • Dependencies
  • Normal Forms
  • Overnormalization
  • Conclusions and Recommendations
  • References


BUSINESS MODELING FOR DATABASE DESIGN
v.5 (August 2012)

The purpose of this paper is to introduce practitioners to data fundamentals via the process of business modeling for database design. It focuses particularly on and distinguishes between the three types of model that are often confused: business model, logical model and data model.

The objective is understanding, as distinct from the "cookbook approach"

Table of Contents

· Preface
· Introduction
· Business Modeling
  · Basic Modeling Concepts
  · Business Rules
    · Property Business Rules
    · Class Business Rules
      · Attribute Rules
      · Distinctness Rules
      · Functional Dependency Rules
      · Referential Rules
      · Arbitrary Rules
      · Business Models
· Database Design
  · Predicates and Propositions
  · The Relational Data Model
    · Structure
    · Integrity
    · Manipulation
  · Logical Models
· Understanding Database Management
  · Truth vs. Correctness
  · Database and DBMS Defined
  · A Foundation Framework
· Applying the Framework: Exercises
· References



TRULY RELATIONAL-WHAT IT REALLY MEANS
v.2 (January 2011)

In the first paragraph of his first ever published exposition of the relational idea 42 years ago, E. F. Codd claimed three critical advantages of a relational model of data:
  • A sound theoretical basis—logic and mathematics—for database management
  • Physical data independence
  • DBMS-guaranteed integrity of data and query results
He also made explicit the objective of relieving practice from the complexities and rigidity of the hierarchic and network commercial DBMS products at the time.

Yet from 1969 to date the industry has failed to concretize Codd’s ideas by implementing SQL-based products that have limited relational fidelity and violate many relational principles. Moreover, instead of correcting the mistakes, the industry has been regressing to the same costly and unproductive technologies made obsolete by Codd’s innovation more than
thirty years ago.

Much of this is due to an educational failure: vendors, experts, users and the trade press do not know, understand and appreciate the practical value of Codd’s contribution and the huge cost of ignoring it. Young generations of practitioners are not even introduced to the relational model and are instilled either with the notion that SQL products are relational, or that relational technology is obsolete. Driven by industry rather than principles, academia has renounced its true function of education and is serving as product trainer for vendors.

It is therefore imperative—and proper to start this series, intended to make data fundamentals accessible to practitioners—to revisit Codd’s original work, reassert those aspects that have been ignored, recall those that were missed, clarify those that are opaque, correct misinterpretations as well as original mistakes and settle current disagreements and confusion over what the relational model really is.

This paper covers Codd’s seminal first two papers, Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks (1969), and A Relational Model of Data for Large Shared Data Banks (1970), the latter being an important public revision of the former (an internal IBM document), that contained changes and introduced new material.

Table of Contents

· Introduction
· Relations on Domains
· Relation Representation
· Time-varying Relations
· Relation Interpretation
· Data Sublanguage
· Atomicity, Nested Relations and Normalization
· Foreign Keys and (First) Normal Form
· Operations on Relations
· Kinds of Relations
· Derivability, Redundancy, Consistency
· Debunking Misconceptions
· Conclusion
· References





USE OF MATERIAL POLICY

Publisher hereby grants to purchaser non-exclusive, non-transferable, revocable rights to use the materials for non-commercial purposes as follows:

Authorized Use

Authorized Users: Individuals who purchase the materials for their own exclusive use;

Authorized Uses: Authorized users may use the materials as is consistent with the Fair Use Provisions of United States and international law, for non-commercial purposes as follows:
  • Individual buyers have exclusive access to the materials;
  • Academic buyers may use the material, with appropriate credit, in the preparation of course packs or other non-commercial, educational materials;
  • Authorized may use, with appropriate credit, figures, tables and brief excerpts in the user's own scientific, scholarly and educational works;
  • Institutional volume buyers should check with Publisher on authorized use;
Specific Restrictions on Use of Materials

Unauthorized Use: Authorized users shall not

  • Knowingly permit others to use the materials;
  • Modify or create a derivative work without prior written permission of Publisher;
  • Remove, obscure or modify any copyright or other notice;
  • Use the materials for commercial purposes, including but not limited to sale, reproduction, or distribution in any form;


PRICING

Single paper rates

Individuals $15.00
Corporate   $25.00
Nonprofit   $20.00
Students    $10.00
  
Contact us for volume licenses.

Order all 4 papers and get a two-paper combo free:
WHAT FIRST NORMAL FORM REALLY MEANS by C. J. Date
WHAT FIRST NORMAL FORM MEANS NOT by F. Pascal



UPDATES AND NEW VERSIONS

Updates are free to buyers of the original version, and can be downloaded using the same password. New versions require new purchase and password.


HOW TO ORDER

We prefer payments in US$ via PayPal. For check/money order payments email us at the email address on the About page.

  1. Select the paper(s) to order;
  2. Calculate the total order price;
  3. Specify the papers you order in the text field below;
  4. Click on the Buy Now button;
  5. At the payment form, either login to your Paypal account, or pay by credit card;
  6. Upon notification of payment by PayPal, we will email download instructions.


Which paper(s)?

Address any questions or payment/delivery issues to the above address.
We appreciate your support, which keeps this site free.

Thank you.

Fabian Pascal
Founder, Editor and Publisher

1 comment:

  1. Some professors erroneously tell students The item "law school is not all about memorization." my spouse and i say "erroneously" since the law school will be information about memorization... thus much more. But for the moment, let's only focus in grades - AS WELL AS with regard to all courses, That means focusing at exams. electronic press kit

    ReplyDelete