Thursday, July 1, 2021

OBG: Experimental Science and Database Design

Note: To demonstrate the correctness and stability conferred by a sound theoretical foundation relative to the industry's fad-driven, ad-hoc "cookbook" practices, I am re-publishing as "Oldies But Goodies" material from the old (2000-06), so that you can judge for yourself how well my arguments of then hold up and whether the industry has progressed beyond the misconceptions those arguments were intended to dispel. I may revise, break into parts, and/or add comments and/or references.

The following is an email exchange from 2001 that I recommend reading jointly with my Data Meaning and Mining post (itself a revision of an article originally published at the old "All Analytics" website). I have slightly touched my replies for pedagogical purposes and clarity. You can substitute any data structure for XML hierarchy.


(originally published on 11/10/01)

“I just got a copy of your PRACTICAL ISSUES IN DATABASE MANAGEMENT. One of the principles that you seem to consider important is that the database design must reflect the real world, however irregular and difficult it may be. How would you approach the problem that that Pat Phelan described - where there are thousands of possible experiments, and the number and types of the results of these are not known beforehand?

My take on the XML in a database field is that it could be borrowing trouble. If the client does not rigorously define how materials, processes, and outcomes are named and stored, the result could be an unworkable hodgepodge. However, creating tables on the fly seems also subject to lack of consistency, and a possible performance nightmare to boot. I am curious about this because I am working on a similar project, on a much smaller scale, of recording test results of a heterogeneous set of experiments.

BTW, the discussion referred to by Pat is "How do the store data from a system in which column data types are only determined at runtime?". In that discussion the focus was a system where there were so many possible experiments and outcomes that the DBA could not be expected to "imbue the data with meaning"--the scientists would need to be able to do that on the spot. If there were no physical constraints, the best solution would be to allow the scientists to create tables and relationships on the fly (presumably there would be some kind of interface provided for them). The consensus (in which I did not participate) was that the if would be too difficult for the DBA to plan for such a system (" creating tables on the fly are a DBAs nightmare"). Therefore the solution of an XML building interface that would store results in text field. If it would be impossible to query, at least the results would be stored until an XML enabled DBMS would make that possible.”

Fabian Pascal: In fact, this issue was already raised in a trade article about NASA, which I quote in my book and seminars:

“Consider the plight of NASA ... struggling with ... how to capture and analyze the [terabytes] of data beamed down to earth daily from orbiting satellites ... [a] problem is the way in which the raw data must be assigned to tables in order to be processed. This process inherently requires a degree of rationalization and some predisposition toward the ultimate use of the data. This is difficult because the scientist may not know ahead of time what analysis to run on the data. This lack of knowledge severely limits the usefulness of the system.”
The point is that data is by definition structured/organized, otherwise it would not be data, but random noise that carries no meaning and information -- no inferences can be made. Structuring data -- which is what database design is -- is a semantic endeavor: it is a formalization of a conceptual model, which means that data is already imbued with the meaning (the model) by the designer. Any structuring (organizing) principle derives from that meaning, which is why I keep stressing that a design can only be assessed with reference to a conceptual model.

Note that if a DBMS is truly relational (by which I don't mean SQL), it supports physical independence: data can be stored and physically accessed in any number of ways (if the DBMS supports them) and changes made without affecting the guaranteed logical access by applications, thus insulating them from changes required for performance optimization.
“Your analysis of the general problem is spot-on, I think. But the DBA is still left with few good options. I think the best would be to push for better hardware to overcome the performance degradation that prolific table/relationship creation would tend to create.”
Fabian Pascal: A database is an organized collection of facts and it's the regularity of organization/structure that enables to derive information from the data in the database -- querying is manipulation of the data structure. The DBA can only implement a logical database physically that was designed to represent the scientists' conceptual model -- the structure underlying a data model (XML is hierarchical) is applicable only to an existing conceptual model. What is the DBA supposed to implement in the absence of a conceptual model and logical design? That's like testing a hypothesis that has not been formulated and it's amazing that scientists do not realize this. 





No comments:

Post a Comment

View My Stats