Sunday, January 29, 2017

This Week



1. What's Wrong with This Database Picture?

"I have a database for a school ... [with] are numerous tables obviously but consider these:
CONTACT - all contacts (students, faculty) has fields such as LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password for electronic timesheet login, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT.
Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical?What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows?  Or do you have just one of each and have a change log. Which is best?" --comp.databases.theory

Sunday, January 22, 2017

Are You a Thinking Data Professional?



Note: The following was intended as a comment to my post Don't Design Databases without Foundation Knowledge and Conceptual Models  by Todd Everett. He is a reader I deem a "thinking data professional" -- always the qualitative rather than quantitative target of my writings and teachings. It merits to be a post in its own right to benefit others.

Monday, January 16, 2017

Don't Design Databases Without Foundation Knowledge and Conceptual Models




"I have two tables, one is product which is a parent table with one primary key and I have another child table of product, which is a product_details table. But the child table is linking with parent table(product) with logical data instead of foreign key,as we are doing this relationship with the help of java code in the coding side, instead of depending on the data base, which make it as tight couple. To avoid tight coupling between the tables we are storing the primary key value in the child table.
CREATE TABLE `tbl_product` (
  `product_id` varchar(200) NOT NULL,
  `product_details_id` varchar(200) DEFAULT NULL,
  `currency` varchar(20) DEFAULT NULL,
  `lead_time` varchar(20) DEFAULT NULL,
  `brand_id` varchar(20) DEFAULT NULL,
  `manufacturer_id` varchar(150) DEFAULT NULL,
  `category_id` varchar(200) DEFAULT NULL,
  `units` varchar(20) DEFAULT NULL,
  `transit_time` varchar(20) DEFAULT NULL,
  `delivery_terms` varchar(20) DEFAULT NULL,
  `payment_terms` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`product_id`));

CREATE TABLE `tbl_product_details` (
  `product_details_id` varchar(200) NOT NULL,
  `product_id` varchar(200) DEFAULT NULL,
  `product_name` varchar(50) DEFAULT NULL,
  `landingPageImage` varchar(100) DEFAULT NULL,
  `product_description_brief` text CHARACTER SET latin1,
  `product_description_short` text CHARACTER SET latin1,
  `product_price_range` varchar(50) DEFAULT NULL,
  `product_discount_price` varchar(20) DEFAULT NULL,
  `production_Type` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`product_details_id`),
  UNIQUE KEY `product_id` (`product_id`));
Please suggest the Pros and Cons of the design, we are following this kind of relationship in my company, as the manager is saying it will give [us flexibility]. I know that if we lose the data from the table, we can't know the relationship between the two tables."--StackExchange.com

Monday, January 9, 2017

This Week



1. What's wrong with this picture
"I have two tables, one is product which is a parent table with one primary key and i do have another child table of product, which is a product_details table. But the child table is linking with parent table(product) with logical data instead of foreign key,as we are doing this relationship with the help of java code in the coding side, instead of depending on the data base, which make it as tight couple. To avoid tight coupling between the tables we are storing the primary key value in the child table.

CREATE TABLE `tbl_product` (
 `product_id` varchar(200) NOT NULL,
 `product_details_id` varchar(200) DEFAULT NULL,
 `currency` varchar(20) DEFAULT NULL,
 `lead_time` varchar(20) DEFAULT NULL,
 `brand_id` varchar(20) DEFAULT NULL,
 `manufacturer_id` varchar(150) DEFAULT NULL,
 `category_id` varchar(200) DEFAULT NULL,
 `units` varchar(20) DEFAULT NULL,
 `transit_time` varchar(20) DEFAULT NULL,
 `delivery_terms` varchar(20) DEFAULT NULL,
 `payment_terms` varchar(20) DEFAULT NULL,
 PRIMARY KEY (`product_id`));

CREATE TABLE `tbl_product_details` (
 `product_details_id` varchar(200) NOT NULL,
 `product_id` varchar(200) DEFAULT NULL,
 `product_name` varchar(50) DEFAULT NULL,
 `landingPageImage` varchar(100) DEFAULT NULL,
 `product_description_brief` text CHARACTER SET latin1,
 `product_description_short` text CHARACTER SET latin1,
 `product_price_range` varchar(50) DEFAULT NULL,
 `product_discount_price` varchar(20) DEFAULT NULL,
 `production_Type` varchar(20) DEFAULT NULL,
 PRIMARY KEY (`product_details_id`),
 UNIQUE KEY `product_id` (`product_id`));
Please suggest the Pros and Cons of the design, we are following this kind of relationship in my company, as the manager is saying it will give us flexible to us. I know that if we lose the data from the table, we can't know the relationship between the two tables."--StackExchange.com

Tuesday, January 3, 2017

Understanding the Relational Data Model: A New Series of Papers



"Nowadays, anyone who wishes to combat lies and ignorance and to write the truth must overcome at least five difficulties. He must have:
  1. The keenness to recognize it, although it is everywhere concealed;
  2. The courage to write the truth when truth is everywhere opposed;
  3. The skill to manipulate it as a weapon;
  4. The judgement to select in whose hands it will be effective, and
  5. The cunning to spread the truth among such persons."
--Berthold Brecht
A rather accurate explanation of why it has been so difficult to dispel the misuse and abuse of the Relational Data Model since inception. To the point that most of its core practical benefits have failed to materialize, with the IT industry regressing all the way back to its pre-relational and even pre-database state:
  • Graph DBMSs;
  • XML;
  • JSON;
  • NoSQL;
  • Application-specific databases and DBMSs;
  • "Unstructured data";
  • No integrity enforcement;
  • A cacophony of imperative programming languages rather than declarative data sublanguages (suffixed with QL, just like old non-relational DBMSs were with /R). 

Saturday, December 24, 2016

This Week with Season's Greetings





Data Sublanguages, Programming and Data Integrity

My December post @All Analytics

Both data science employers and candidates stress the eclectic nature of the required skills, programming in particular. Indeed, coding has acquired such an elevated role, that it now entirely replaces education. Aside from the societal destructive consequences of this trend, in the context of data management it is a regressive self-fulfilling prophecy that obscures and disregards the core practical objective of database management to minimize programming. You can frequently encounter it in comments like:
"Anything you can model in a DBMS you can model in Java. The next paradigm shift is business rules centralized in Java business objects, rather than hard-coded in SQL for better manageability, scalability, etc. The only ones that should reside in a database are referential integrity (and sometimes even that isn't really necessary). Don't let pushy DBAs tell you otherwise -- integrity constraints slow down development as well as performance."
Upside down and backwards.

Read it all (and comment there, not here, please).


THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

("What's Wrong with this Picture" will return in 2017)


1. Quote of the Week

"The value of the model may be diminishing in certain enterprises, since busy with deliverables." --Harshendu Desai, LinkedIn.com

3. To Laugh or Cry?

5 Reasons Relational Databases Hold Back Your Business

4. Added to the LINKS page

  • What a Database Really Is: Predicates and Propositions
  • The Logical Fallacies

5. Of Interest


And now for something completely different


New at The PostWest (check it out)

My take of the week

Choosing not to veto, Obama lets anti-settlement resolution pass at UN Security Council
The press refused to publish Obama's Chicago speech to the Palestinian lobby to hide his anti-semitism. He was never as troubled by Assad, or Putin, or Erdogan as he was by Netanyahu.  That's because Jews have always been a soft target (Barak Obama's Israeli Settlements Canard).  If that is not anti-semitism, I don't know what is.
When the US is in the same camp with Russia, China, Iran and Turkey and her acts are cheered by Hamas, Islamic Jihad and Hezbollah, she has sold out and moved to the dark side.

America (like most other countries) is occupied Indian land via atrocities (and not by people who returned to their own country, like the Jews did). So when America returns its settlements to the Indians, Israel will return its "settlements" (which Israelis got when they defended themselves from "being thrown into the sea"). Until then moralizing and selling out Israel to genocidal terrorists is hypocritical anti-semitism, just like everyone else's (see below).


Global Hypocritical Anti-semitism

UN

 
US

EU

Article of the week

Israel and the Occupation Myth

Video of the week
The Red Disaster. The "life" in Romania during the 60s. The Jews did the worst due to deep anti-semitism. America paid Ceausescu to get us out, but neither she nor Europe wanted us. Had there been no Israel, we would have probably starved to death, not necessarily in a rotten jail. Nobody talks about us, or the hundreds of thousands of the Jewish refugees kicked out from the Arab countries, none of whom were murderous, but everybody is obsessed with the suffering of the Palestinians, who are genocidal.

Pinch-me of the week

Ahmad Tibi urges Israelis not to ‘live by the sword’. As if she is allowed to live without it.

Book of the week (Purchase via this link to support the site)
Bard, M., MYTHS AND FACTS: A GUIDE TO THE ARAB-ISRAELI CONFLICT

Note: I will not publish or respond to anonymous comments. If you want to say something, stand behind it. Otherwise don't bother, it'll be ignored.

Tuesday, December 20, 2016

On View Updating (C. J. Date and D. McGoveran)



My recent posts on denormalization [1], identical relations [2] and the POOD [3,4,5] based on D. McGoveran (DMG) interpretation of Codd's RDM, triggered online reactions [6,7] and some comments in place that reflect the current understanding of the RDM. One of my readers referred me back to a 2004 exchange triggered by an exchange @old dbdebunk.com with both CJD and DMG on view updating--an important aspect discussed in my posts--on which the two perspectives differ. Last week I asked what's wrong with CJD's position in the exchange. Here is the original exchange, albeit in abbreviated form (I made minor revisions for clarity and added references to more recent sources the reader may want to consult, such as the 2013 CJD book on the subject [8], which also purports to describe some of DMG's more recent thinking), in which DMG counters with his position and adds a new comment. Throughout, I've changed "relvar predicate" (still used by CJD) to "relation predicate", as preferred in the DMG interpretation.

Monday, December 12, 2016

This Week



THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. What's wrong with this picture

"First of all, let me say that I no longer regard view updating as a fully solved problem. A year ago or so I thought it was--but then Hugh Darwen started to ask some hard questions and I realized I was wrong. (David McGoveran will probably disagree with me here.) That said, I remain optimistic that the problem is solvable. The discussion in my 8th Ed. is generally along the right lines, though it gets some of the details wrong." --C. J. Date

2. Quote of the Week

"SQL is the lingua franca for retrieving structured data. Existing semantics for SQL, however, either do not model crucial features of the language (e.g., relational algebra lacks bag semantics, correlated subqueries, and aggregation)." --Konstantin Weitz, homotopytypetheory.org

3. To Laugh or Cry?

Why is MongoDB wildly popular? It's a data structure thing

Monday, December 5, 2016

Prediction, Explanation and the November Surprise



Note: My November post @All Analytics, which I reposted here.  

Given the overhyped promise of "data science", the "shock" at the broad failure to predict the election outcome was inevitable. Skimming through the media and technical accounts, it looks like a better understanding of prediction and explanation is necessary for less surprises and sounder analytics. Let's take two examples (oversimplified somewhat to make the point). 

First, a game-theoretic account derived from observed behavior in a 2-player game in which one player gets a sum of money and decides how to share it with another, who can only accept or reject the offer: even though accepting any offer as better than nothing is rational, "we don’t behave rationally ... [but] emotionally ... we reject offers we consider unfair".

"… there’s been plenty of economic growth inside the U S--vastly increasing the pile of money to be divided. But ... The first player consists of those people who have benefitted from globalization and trade: the “elites”, derisively referred to as “the 1%”. And the second player ... everyone ... who aren’t in those upper income echelons ... are seeing the pile of money in the game growing ever bigger. And ... the other player keeps an ever-larger share of that pile for themselves ... Trump allowed them to channel their feelings into a rejection of the proposal that has been made—on trade, immigration, and globalisation, and dividing up those spoils ...[and they threw] everything out". --What voters do when they feel screwed--the economic theory)

Second, a complex algorithm that runs a multitude of sophisticated simulations on a "raft of carefully collected public and private polling numbers, as well as ground-level voter and early voting data”. Assume that “the raft” consists of, vote predictors—vote correlates discovered by computers (Whatdidn't Clinton’s data-driven campaign's algorithm named Ada see?).

Suppose (1) an appropriate hypothesis in the form of a correlation at the aggregate level between variables measuring affinity to the first and second player and vote had been derived in the former case which proved accurate and (2) the algorithm in the latter case produced an equally accurate prediction.  Is there any difference between the two approaches?

For those who equate prediction with explanation, the answer is yes. For those for whom explanation is about the past and prediction about the future, the question does not come up. But these are views that obscure rather than enlighten.

In both cases there is a data pattern in the form of predictive correlations. In the first case a theory of individual behavior specifies the causal mechanism—the individual behavior that explains how the pattern is produced--why it exists at the aggregate level. In the second case, the mechanism is of no particular interest and is not specified. In general explained behavioral predictions are more reliable than those without.

Data patterns discovered by computers explained can produce insights—causal mechanisms— for theory development, this is what data mining should be about. That's the context of discovery in science, which requires predictions from the theory developed from the discovered patterns to be tested in the context of validation on different data. But because, unlike in natural science, human behavior is not governed by unchanging universal laws, it is easier to explain post-hoc than to predict. Given the pressure for prediction in industry and politics, the temptation not to bother with the second context is too strong. 

In this age of "big data", "data mining", "data lakes" and machine learning the important difference between prediction and explanation should be understood and kept firmly in mind when performing analytics and assessing their results.

See also Unthinking Machines.

 




Re-write



See the rewrite
Class, Type, Relation and Domain in Database Management



Monday, November 28, 2016

This Week



THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.



1. What's wrong with this picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases:
  1. "type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.
  2. "class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" *is* a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" *only* in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous." --Comment @dbdebunk.com

Sunday, November 20, 2016

The Principle of Orthogonal Database Design Part III




Note: This is a 11/24/17 re-write of Part III of a three-part series that replaced several previous posts (the pages of which redirect here), to to bring it in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
 
(Continued from Part II)

POOD and SQL


As we have seen, if relations are uniquely constrained, with a true RDBMS supporting logical independence (LI) and constraint inheritance, database design can adhere to the POOD and enable DBMS-enforced consistency. A RDBMS can also support ESS explicitly.

Industry misconceptions notwithstanding, SQL DBMSs are, of course, not relational. They have weak declarative integrity support, which, coupled with bad database designs, makes adherence to the POOD (as well as the other design principles) difficult. While even its weak relational fidelity was sufficient to render SQL superior to what preceded it, this is but one example of the many advantages of the RDM that SQL has failed to concretize.

Monday, November 14, 2016

This Week



I have revised:
THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS available to order here.

1. Quotes of the Week
"Wow. Been using SQL for 15 years, and didn’t even know about Except and Intersect!" --Blog.Jooq.org
"I am looking for database schemas with many tables (>100 tables). Where can I find them? I am currently using mysql and haven't done serious database design. So interested in looking at samples with ER diagrams." --YCombinator.com
2. To Laugh or Cry?

Monday, November 7, 2016

The Principle of Orthogonal Database Design Part II




Note: This is a 11/24/17 re-write of Part II of a three-part series that replaced several previous posts (the pages of which redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.

(Continued from Part I)

To recall from Part I, adherence to the POOD means independent base relations (i.e., not derivable from other base relations), which the design example in Part I,

EMPS (EMP#,ENAME,HIREDATE)
SAL_EMPS (EMP#,ENAME,HIREDATE,SALARY)
COMM_EMPS (EMP#,ENAME,HIREDATE,COMMISSION)

violates: EMPS is derivable via union of projections of SAL_EMPS and COMM_EMPS. It requires at least:
  • A disjunctive constraint on each of the SAL_EMPS and COMM_EMPS relations, to ensure mutual exclusivity;
  • A redundancy control constraint to prevent inconsistency due to partial updates (can you formulate it?);
  • Use of the transaction management component of the data language to ensure that each candidate tuple is properly inserted (1) into EMPS and (2) the correct subtype relation;

Monday, October 31, 2016

This Week



1. Quote of the Week

"Normalization is and will always be a direct trade-off. You give up performance (time) for space. Indexing mitigates in the opposite direction of this trade-off." --YCombinator.com

2. To Laugh or Cry?

Comments on my "Denormalization for Performance: Don't Blame the Relational Model"

3. THE DBDEBUNK GUIDE TO MISCONCEPTIONS OF DATA FUNDAMENTALS 

is available. Order here.

Monday, October 24, 2016

The Costly Illusion of Denormalization for Performance



My October post @All Analytics.

Be that as it may, practitioners insist that performance improves when they denormalize databases, because "bundling" facts into less relations reduces joins. But even if this were always true -- it is not -- performance gains, if any, do not come from denormalization per se, but from trading off integrity for performance. What many data professionals miss is that the redundancy introduced by denormalization must be controlled by the DBMS to ensure data integrity, which requires special integrity constraints that, it turns out, involve the very joins that denormalization is intended to avoid, defeating its purpose. These constraints are practically never declared and enforced, which creates the illusion that denormalization improves performance at no cost.

Read it all. (Please comment there, not here)



Sunday, October 16, 2016

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS Available to Order




When I discussed with a book publisher the idea of a guide/reference to misconceptions about data fundamentals, whose objective -- distinct from the usual cookbooks -- is to help data professionals base their practice on understanding, rather than cookbooks, he said "they are not interested in understanding, only in succeeding in their jobs". Apparently, the former is no longer a factor in the latter. Given the increasingly deteriorating experiences I had with publishers, it was time to stop bothering with them -- they pay and do very little -- and self-publish.

THE DBDEBUNK GUIDE TO MISCONCEPTIONS ABOUT DATA FUNDAMENTALS  - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER is now available for purchase ($35, order via the BOOKS page (not to be confused with the RECOMMENDED BOOKS page); contact me by email for volume discounts). 


Monday, October 10, 2016

This Week



1. Quote of the Week
"Legion is a Hadoop MapReduce tool that turns big, messy data sources into clean, normalized flat files ready for ingestion into relational tables in a data warehouse (e.g., Postgres COPY)." --GitHub.com
2. To Laugh or Cry?

Wednesday, September 28, 2016

Monday, September 26, 2016

This Week



1. Quote of the Week
"Which leads to another bad experience: the pernicious use of foreign keys. In the ORMs I've used, links between classes are represented in the data model as foreign keys which, if not configured carefully, result in a large number of joins when retrieving the object. (A recent count of one such table in my work resulted in over 600 attributes and 14 joins to access a single object, using the preferred query methodology.)
...
When you have foreign keys, you refer to related identities with an identifier. In your application, "identifier" takes on various meanings, but usually it's the memory location (a pointer). In the database, it's the state of the object itself. These two things don't really get along because you can really only use database identifiers in the database (the ultimate destination of the data you're working with)." --wozniak.ca

Monday, September 19, 2016

The Principle of Orthogonal Database Design Part I




Note: This is a 11/24/17 re-write of Part I of a three-part series that replaced several older posts (the pages of which which now redirect here), to bring in line with the McGoveran formalization and interpretation [1] of Codd's true RDM.
"The principle of orthogonal design (abbreviated POOD) ... is the second of the two principles of database design, which seek to prevent databases from being too complicated or redundant, the first principle being the principle of full normalization (POFN). Simply put, it says that no two relations in a relational database should be defined in such a way that they can represent the same facts. As with database normalization, POOD serves to eliminate uncontrolled storage redundancy and expressive ambiguity, especially useful for applying updates to virtual relations (views). Although simple in concept, POOD is frequently misunderstood ... is a restatement of the requirement that a database is a minimum cover set of the relational algebra. The relational algebra allows data duplication in the relations that are the elements of the algebra. One of the efficiency requirements of a database is that there be no data duplication. This requirement is met by the minimum cover set of the relational algebra." --Wikipedia.org
Well, not quite.

Monday, September 12, 2016

This Week



1. Quote of the Week
"Data sense-making does not benefit from the relational data model. Dr. Codd’s rules for relational modeling were designed to improve efficiencies in data processing and storage, not to make data more intelligible. In fact, structuring data relationally makes the work of data sensemaking more difficult, which is why Dr. Kimball created dimensional data modeling and also why an entire industry of middleware products emerged to hide the complexities of relational models." --Stephen Few, PerceptualEdge.com

2. To Laugh or Cry?

Monday, August 22, 2016

This Week



1. What's wrong with this picture?
"Can you have 2 tables
- VIEWS
- DOWNLOADS
with identical structure in a good DB schema (item_id, user_id, time). Some of the records will be identical but their meaning will be different depending on which table they are in. The "views" table is updated any time a user views an item for the first time. The "downloads" table is updated any time a user downloads an item for the first time. Both of the tables can exist without the other.

"I don't think that there is a problem, per se from a E/R modelling point of view, as long as they represent two semantically different entities."
"Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding."
 "Chris Date and Dave McGoveran formalised the Principle of Orthogonal Design. Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result."

"When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence. Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often. This alone could be a good reason not to burden it with records from the downloads."
--Can you have 2 tables with identical structure in a good DB schema?, StackOverflow.com

Monday, August 8, 2016

This Week



1. What's wrong with this picture?
Q: "My understanding has always been that a primary key should be immutable, and my searching since reading this answer has only provided answers which reflect the same as a best practice. Under what circumstances would a primary key value need to be altered after the record is created?"

A: "When a primary key is chosen that is not immutable?" --StackExchange.com

2. Quote of the Week

"We start with a value. This may be a number, a date, or a chunk of text. Domain refers to the meaning for a value or a set of possible values. When we have a value with a consistently and widely used set of units, value domains, or applications, we call this value a data element. A data element may be a ticket number, a temperature reading, a hair color. (Some modeling approaches omit the notion of data elements or domains.)
In all the techniques and stages of data modeling, the concepts of entity and attribute are universal. An entity is the “thing” that must be manipulated as a data object. Entity represents an indivisible concept that consists of data elements. Each data element in the entity is called an attribute. Conversely, we could say that we build an entity by attributing data elements to it. You create an entity as a whole, and you delete it as a whole." --What a Concept! Is Logical Data Modeling Obsolete? -- LinkedIn.com
Note: In the preface to my PRACTICAL DATABASE FOUNDATIONS series of papers I deplored an author's (considered an expert) reliance on an industry ANSI committee as a starting point in an explanation of conceptual, logical and physical data fundamentals--there is practically 0 chance of soundness and 100% chance of confusion. So when this article started with "In the common usage established by ANSI in 1975, data modeling goes from abstract to concrete in three steps" I knew I would not read very far. Sure enough, I stopped almost immediately, after the above quote. If you don't understand why, I recommend you check out my papers.

Friday, July 29, 2016

Data Science, Coding, the Automation Paradox, and the Silicon Valley State




My July post @All Analytics.

Indeed, if what Linus Torvald is admiringly quoted as saying is true -- "Don’t ever make the mistake [of thinking] that you can design something better than what you get from ruthless massively parallel trial-and-error with a feedback cycle. That’s giving your intelligence much too much credit." -- why bother with education, science, and theory at all? Just “plug your coding skills” by getting a Microsoft or IBM “Professional Degree”, or even take a free coding course by your favorite billionaire.

Read it all. (Please comment there, not here) 

Monday, July 25, 2016

Duplicates: Stating the Same Fact More Than Once Does Not Make it Truer, Only Redundant




Here's what what wrong with last week's picture, namely:
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --LinkedIn.com

One of my readers once wondered why "database professionals understand uniqueness via keys, but don't seem to understand why duplicate rows should be prohibited and the consequences of breaking relational closure": 

Sunday, July 17, 2016

This Week



1. What's wrong with this picture? 
RB: "From the tabular point of view, does it make sense why we can't have duplicate rows in a relation?"

John Sullivan: "As with everything else in life, it depends what you are trying to do (and exactly what you mean when you talk about a DBMS table v. a formal relation). From an operational (transactional) database point of view, for obvious reasons, you don't want duplicate rows (enforce a natural key). But if you're analysing data from various legacy sources (e.g. spreadsheets) it might be useful. Then again, you might introduce a surrogate key to give you more control over what's going on - again, depends on what you are trying to do." --Question about the relational model


Sunday, July 10, 2016

Levels of Representation: Relationships, Rules, Relations and Constraints




What's Wrong with Last Week's Picture (Question about relational model )

There are relationships at both the conceptual and logical representation levels. Confusing them is bad conceptual modeling and database design.

Relationships, Rules and Relations

AT: "In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)".
A relation is a set of tuples that is a subset of the Cartesian product of the domains, i.e., it is a relationship of domains at the logical level (Chen may have been misquoted: a tuple is a set of attribute values (i.e., a relationship thereof, that represents a fact about a single entity).

Thursday, July 7, 2016

NoSQL, Big Data Analytics, and the Loss of Knowledge and Reason



My June post @All Analytics:

The data management industry operates like the fashion industry. Its most persistent characteristic is migration from fad to fad. Every few years -- the number keeps getting smaller -- some "new" problem is discovered, for which the solution is so magical, that it is extended everywhere to everything, whether it is applicable or not. But many of these problems are old and fundamental and some of the “solutions” bring them back, rather than solve them. ...

Read it all. (Please comment there, not here)



Saturday, July 2, 2016

This Week



1. What's wrong with this picture?

AT: Well, I think I am a bit confused now. In my personal understanding, a relation is defined as a set of tuples. Then ... "in the relational model every relation represents a relationship". And then a quote from Chen: "each tuple of entities ... is a relationship". If I use the first and the second statements - I can say that a relationship is a set of tuples. The third statement says that a relationship is a tuple. So far, is a relationship a set of an element of a set? (Or may be a set of sets?)
GE: I argue that there is essentially no difference between relationships between entity (type tables) and between an entity and its attributes. They both represent relationships between two populations of things. Something is an attribute by virtue of there being a relationship. If relationships are represented by foreign keys and the entity tables must be in 1NF, as in the relational model, then all relationships must be at most Many-to-One (a very unnecessary limitation when modeling some user domain).
TF: The relational model was a mathematical construct, derived from set theory. Hence that particular terminology. The entity-relationship model is essentially a directed graph model, where relationships are prominent residents. Not so in the relational model (despite the name), where relationships (between relations, mind you) are not visible and in the SQL implementations is reduced to constraints. Relationships are about structure, which is as important as meaning (the semantics of the terms used in the universe being modeled).
2. Quote of the Week
"In Relational Theory sometimes the relationships, where we do our Joins are much more important than the attributes on an Entity." (quoted in LinkedIn.com exchange)

Sunday, June 19, 2016

This Week



1. What's wrong with this picture?

This week's picture is the one of the state of knowledge about keys that Martijn Evers painted in Kinds of Keys: On the Nature of Key Classifications, that I had already commented on. As a result of discussions I've been having with David McGoveran in the context of our forthcoming books (his LOGIC FOR SERIOUS DATABASE FOLKS, my DBDEBUNK GUIDE TO FUNDAMENTAL DATA MANAGEMENT MISCONCEPTIONS), I've decided to rewrite my
comments On Kinds of Keys on the subject. I refer the reader to Martijn's article for a refresher--my rewrite will be posted next week.

2. Quote of the Week

There are no rules of normalization for non-relational databases. Effectively, you start out by denormalizing everything. Which means you're designing the data organization to serve specific queries. So follow the same principle in NoSQL databases as you would for denormalizing a relational database: design your queries first, then the structure of the database is derived from the queries. --Bill Karwin, What is a good way to design a NoSQL database

Sunday, June 12, 2016

Levels of Representation: Conceptual Modeling, Logical Design and Physical Implementation



From last week:

What's wrong with this picture? (Kinds of Data Models, LinkedIn.com)

David Hay: "Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business."
Nothing raises uncertainty whether to laugh or cry better than attempts to dispel confusion which suffer from the very confusion they purport to dispel.

Sunday, June 5, 2016

This Week



1. What's wrong with this picture?
David Hay: Part of the ... confusion as to what exactly was meant by “data modeling”--conceptual, logical or physical--is that most data modeling activities seem to focus on achieving good relational database designs ... my approach is the portrayal of the underlying structure of an enterprise’s data--without regard for any technology that might be used to manage it ... a “conceptual data model” ... that represents the business.

Nigel Higgs: ... many folks do not get the difference between the Barker entity relationship style of modeling and the relational style of modeling ... [because] the modeling conventions are very similar and the former [is always] a precursor to RDBMS design.

Clifford Heath: Any terminology for models must project three aspects of intention: (a) audience, (b) level of detail and (c) purpose. These three variables are sufficient to discriminate all the main kinds of models in use. The traditional terms of "conceptual/logical/physical" are manifestly inadequate.

Remy Fannader: Models are meant to describe sets of instances (objects or behaviors).
--Kinds of Data Models, LinkedIn.com

2. Quote of the Week

The first consideration that needs to be made when selecting a database is the characteristics of the data you are looking to leverage. If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. --Jnan Dash, RDBMS vs. NoSQL: How do you pick?

Wednesday, May 25, 2016

Why Data Scientists Must Understand Normalization



My May post @All Analytics:

We are constantly told how data scientists must be “jacks of many skills”, but one of the most important is rarely included in the list.Very few databases are properly designed. Many SQL databases are denormalized inadvertently, or intentionally (and erroneously) "for performance". They require special constraints to control data redundancy and prevent inconsistencies, which are practically never enforced. Analysts cannot, therefore, take database consistency for granted. Furthermore, to issue sensible queries and ensure correct results and interpretation thereof, it’s not enough for analysts to know the types of fact represented in the database, but also whether and how the database designer has chosen to bundle -- nest or merge -- those facts and how to disentangle them for analysis.

Read it all. (Please comment there, not here)



 



Sunday, May 22, 2016

This Week (& a Change in Format)



I am changing the format of the posts on DBDebunk.

"This Week" posts every other week will continue to be followed by posts on data and relational fundamentals, but with a twist: each of the latter posts will explain the preceding week's "What's wrong with this picture?"Here's the first post in the new format

1. What's wrong with this picture?

"Data is stored in two-dimensional tables consisting of columns (fields) and rows (records). Multi-dimensional data is represented by a system of relationships among two-dimensional tables. This usually leads to data storage becoming redundant, and also difficult to maintain on account of addition and deletion anomalies. This is only the case if we do not normalize the data. Keys are fields or combinations of fields used to identify records." --Nigel Peck, Denormalization Summary

View My Stats