Sunday, December 16, 2012

"Schema-less Models" and the New World Disorder

There is a tendency in the database field to distort, use poorly defined
terminology, or use it inconsistently.  As I already argued in previous posts, technologies that are founded or used without foundation knowledge and understanding will prove costly fads.

Here are some accurate definitions:
  • Business model: An enterprise-specific informal representation of a segment of reality of interest, expressed in business terms--entities, attributes, classes, relationships, business rules;
  • Logical model: An enterprise-specific formal representation of a business model in the database, expressed in [relational] database terms--R-tables, integrity constraints.
  • Data model: A theory of data consisting of three components--structure, manipulation and integrity--used to represent informal business models  as formal logical models and manipulate the data for information retrieval and to keep data consistent. It is to logical models what a programming language is to programs. By Codd's definition there are only three such models: the old hierarchic and network models, that are not  as complete, well-defined and theoretically sound as the relational model that replaced them because they were prohibitively complex, rigid and costly.
  • Domain: A named pool of values from which R-table columns draw their values. Date and Darwen argue that domains are identical to programming types. But Codd used the term domain for the very purpose of distinguishing it from type. There is at least one important difference: unlike the latter, which is under application control, the former is under DBMS control. To use the formal language of theory, a relation is defined on domains, which means that there are no relations and, therefore, R-tables without domains. But note very carefully that the relational model is mute on what those domains are and imposes no restrictions on them. Thus, domains can have image, video, audio, text,  R-table values, or any other values of arbitrary complexity.
Consider now the terms used in Schemas vs. Schemaless structures and The PostgreSQL Type Farm:
There has been a lot of talk lately about schemaless models touted by NoSQL groups and how PostgreSQL fits into this New world order. Is PostgreSQL Object-Relational? Is it Multi-Model?
Points arising:
  • "Schemaless models": Here models refers to logical models. But a schema is essentially a logical model--it is just another name for it--so in this sense this is a contradiction in terms.
  • "Object-Relational": Here the reference is to a data model. Object-orientation is a programming, not data management paradigm. There is no "object data model" in the sense that the three mentioned above are. Orientation is not theory. The OO terminology is fuzzy, but if you make an effort to interpret how the proponents apply it to databases, you can detect hierarchic/network traces. To the extent that OO contributes anything to database management, it is type inheritance, with which the relational model is not incompatible (as Date and Darwen's Tutorial D relational data language demonstrates, but see comments on domains below).
  • "Multi-Model": Also refers to data models and implies a DBMS and databases that are based not on just one: why restrict ourselves, for example, to just a relational organization of the data? Aside from his realization that database management requires some data model, Codd's genius was in defining one that is the simplest yet most general and theoretically sound and powerful. It is not cost-effective/productive to defeat its benefits, not only by adding back all the problems of the two older data models (not to mention the quasi-models that are nothing of the kind), but to have to learn, implement and use all of them, when one can do the job better. That is precisely why Codd's core Information Principle demands that all data  in a relational database be represented explicitly and exclusively as values in R-tables and managed only by relational capabilities.
Date wrote an article for my old web site titled Models, Models Everywhere, Nor Any Time to Think.

But to continue:
We tend to think of PostgreSQL as type liberal and it's liberalness gets more liberal with each new release. PostgreSQL is fundamentally relational, but has little bias about what data types define each column of related tables. One of PostgreSQL great strengths is the ease with which different types can coexist in the same table and the flexible index plumbing and plan optimizer it provides that allows each type, regardless of how wild, to take full advantage of various index strategies and custom index bindings. Our 3 favorite custom non-built-in types we use in our workflow are PostGIS (of course), LTree (Hierarchical Type), and HStore (Key-Value type). In some cases, we may use all 3 in the same database and sometimes the same table - where we use PostGIS for spatial location, LTree for logical location, and Hstore just to keep track of random facts about an object that are easier to access than having a separate related table and are too random to warrant devoting a separate column for each.
When PostgreSQL uses SQL as its data language, it is not "fundamentally relational", but it is certainly "more so" than NoSQL. But its support of user-defined domains of arbitrary complexity is orthogonal to the data model.

Note very carefully, though, that unlike R-tables, domains encapsulate, which means that  operations applicable to domain values must be programmed into the domains; operations no so included will not be available to users. Take, for example, fingerprints: a fingerprint domain would have to include a 'matching operator' (the equivalent to = for numeric or string values). Its formulation and implementation requires expertise in the fingerprinting field, that raises some thorny issues (see Chapter 1 of PRACTICAL ISSUES IN DATABASE MANAGEMENT). "Burying" data into domains, takes them out of the scope of relational operations.  This is characteristic of the OO approach to database management, which trades database design for programming (for which reason OODBMS's are referred to as "DBMS building kits".
Sometimes we are guilty of using xml as well when we haven't figured out what schema model best fits a piece of data and hstore is too flat of a type to work. The advent of JSON in PostgreSQL 9.2 does provide for a nested schema-less model similar to what the XML type offers, but more JavaScript friendly. I personally see JSON as more of a useful transport type than one I'd build my business around of a type you'd use when you haven't figured out what if any structure is most suitable for your data. When you have no clue what structure a piece of data should be stored, you should let the data tell you what structure it wants to be stored in and only then will you discover by storing it in a somewhat liberal fashion how best to retrofit in a more structural self-descriptive manner. Schemas are great because they are self-describing, but they are not great when your data does not want to sit in a self-described bucket. You may find in the end that some data is just wild and refuses to stay between the lines and then by all means stuff it in xml or json or create a whole new type suit it feels comfortable in.
See what I mean?

This paragraph is an excellent illustration of the poor foundation knowledge and understanding in the IT industry. The notion of data "wanting a specific structure" or "telling you what the schema is" is sheer nonsense. Structure--how data is organized--is an exclusively human endeavor based on human perceptions of reality. Data does not organize themselves, their generation is already the result of some organization by somebody, otherwise  they would not be data, but random noise.

The real issue is that data collected/generated with one organizing principle in mind for a particular informational purposes (often not well thought out) may not lend themselves to manipulation that serves another informational purpose. The solution is not  that structure can be ignored, or that multiple structures are better than one, or that data will structure themselves. Rather, it is to invest as much upfront thinking and time as possible into modeling reality and design and use a data model that yields logical models that satisfy as many informational purposes as possible in the simplest and soundest way.

During my time in academia as a social scientist I was involved in a lot of data analysis--what the IT industry has now "discovered" as business intelligence. It was drilled into me that science involves the development of causal hypotheses about the world, which can be tested empirically by analyzing data that can disprove them. If the data do not disprove them, they  are assumed valid until disproved.

In practice, however, instead of developing theoretical hypotheses--knowledge and reasoning are demanding--researchers often succumb to running mindless data correlations and regressions to "let the data come up with a theory". The attraction of the schema-less/NoSQL/BI fad is, I believe, a similar illusion. Since database management cannot be done without some data model, either each NoSQL product has a different one (ad hoc, not well-defined or formally sound, incomplete, serving a narrow purpose), or is not really a DBMS.

In that sense, NoSQL is the new world disorder.

No comments:

Post a Comment

View My Stats