Sunday, October 4, 2015

Database Education: Oughts and OughtNot's



From an online exchange in response to A Tiny Intro to Database Systems:
C: As a non-CS grad coming fresh to databases, I found both the entity-relationship, and the object-oriented models confusing. Then I read Date [1] and Codd's [2] books and papers on the relational model, the one from the 1970s that is basically set and type theory applied to data, and found that to be a lot clearer and a more powerful abstraction to deal with your data model. As a non-"full time developer" it amazes me the number of "experienced" developers who are not aware of the relational model and who do not know what a foreign key is, or why referential integrity might be important.

For example, your Relational Model introduction has a discussion of various data types. But arguably, whether your integer is implemented as BIGINT or TINYINT is an implementation decision which should be separate from the model discussion (dixit Date). In other words, that attribute has a type of integer and how that integer is stored is a separate issue, and your RDBMS ought to abstract it away (as, I think, Postgres is pretty good with, and MySQL quite annoying). The beauty of the latest RDBMS developments, particularly in PostgreSQL world, is that the implementation has gotten so good that you don't need to really worry about it like you used to just a decade ago, at least in 95% of use cases.

I think one can teach SQL (and the relational model) to a non-developer in about 2 hours, because it is so declarative and intuitive. One day I'll go write that tutorial, as many clients need it sorely.


The confusion is indicative of the failure of database education to clearly delineate the three levels of representation. The E/RM is at the conceptual level, the RDM is at the logical level, while the (so-called) ODM purports to be at the latter, but on closer inspection is actually more akin at once to the other two. The RDM is clearer due to its being formal and its dual theoretical foundation that confers precision, completeness and simplicity. The conceptual level--and, therefore, E/RM and ODM--is informal and lacks such foundation.


Note: To the extent that the ODM can claim a theoretical foundation, it is the directed graph theory, which has proved too complex and inflexible in the past, relative to the RDM.


A R-table is defined on domains, which are data types constrained each to a value range and operator set determined by the corresponding property business rules. Domains are, therefore, elements of the logical, not physical model, for which the term implementation is reserved. The "kind of integer" is part of it only if it's how the DBMS physically stores columns of that type.

Teaching SQL would be much easier if preceded by that of RDM. Instead, SQL is usually taught syntactically--the SELECT clause, the FROM clause, the WHERE clause and so on--without proper reference to the relational  operations they express.

Note: What relational operations underlie a SQL SELECT statement, why it was designed the way it is and what are the pros and cons of that design?

E: To really teach the relational model would take quite some more time. I would discuss database normalization (3NF/4NF/BCNF), query optimization, indexes, foreign key constraints and bridge set-theory with the relational model. Optional parts would be triggers and other kind of constraints.To understand the query planner is tantamount to making good schema's and requires insight in the underlying data-structures (B-tree) and join methods. Then, for the student to get used to this way of thinking, I'd have them implement a simple project, e.g. a hotel-booking system.
Teaching the RDM should include, first and foremost, an introduction to its dual theoretical foundation--first order predicate logic (FOPL) and set theory; its three components--structure, integrity and manipulation; their practical benefits for database management; and the relational fidelity of SQL. Constraints, not just FK's, are a crucial integral part of integrity and triggers are just one way in which specific SQL DBMS's implement them).

Note: Normalization is not part of the RDM, but an adjunct to it--the principles of database design when using RDBMS's.

Z: ... "tertiary storage"? 512 byte page size? The whole topic of concurrency control without one mention of MVCC? ...
It is not the various types of physical storage that are important, but physical data independence (PDI), a core practical benefit of the RDM: the insulation of queries and applications from them and their changes. This would rely on levels of representation, which should precede the RDM.
B: The section "Schema Refinement - Functional Dependencies" is an example of what drives many students out of Computer Science. Even so, this is one of the better introductions to functional dependencies that I've read.
J1/2: It's a topic that you'll experience halfway through a graduate-level textbook on databases ... No gentle way to do it, and a student has probably been driven away well before they read about it ... It is painful to say the least ... put me to sleep in class, but it doesn't really have to be--the teaching technique is terribly important ... it's an example of a tricky topic that ought to be presented by a highly engaging, smart instructor instead of a boring one who may or may not understand the material very well.
An excellent example of how education was replaced with tool training under pressure from students, employers and vendors, who dismiss data fundamentals as "theory and not practical" due to precisely the replacement.

Functional dependencies (FD) are a core element of IT profesionals knowledge in general and a fundamental concept in database design and normalization, with considerable practical value (do you know what it is?) that indeed suffers from the lack of good teaching skills, even though FD's are rather very easy to explain. In the relational database context, in every R-table each non-key column is functionally dependent on the key, a formal way of saying that for every key value there is exactly one value of each non-key column. This is true of R-tables designed to represent a single entity class i.e., tables in fifth normal form (5NF).

A: If they can't stand theory, they shouldn't be trying to learn theoretical database foundations. I agree that, at introduction, material should be made as accessible to students as possible. At the same time, we shouldn't dumb down the field for more advanced students.
Precisely: without exposing the practical benefits of relational theory, there is no reason to expect practitioners to appreciate it. Dropping it altogether is not the solution to practitioners' boredom with it--it is tantamount to discarding physical laws in engineering education--an education demonstrating its practicality is.
T: Unless I a missing something this is just for relational databases. The term database can encompass pretty much everything from CSV files to in memory distributed grids.
Well, yes, but with considerable degrdation of meaning.

No comments:

Post a Comment

View My Stats