Monday, April 17, 2017

Don't Mix Model with Implementation

Here's what's wrong with last week's database picture, namely:
"When you design your database tables there are some important things to think of:
- Normalize to remove redundant data
- Use the smallest datatype possible
- Create as few indexes as possible, but not too few
- Avoid redundant indexes
- Every table must have clustered index
...
This is important in a normal database but it is even more important in SQL Azure because you have limited space for every database, your connections may be terminated due to heavy use of resources, you pay for what you use and the data that you transfer. You can use the SQL Azure management portal do design your tables or write the T-SQL statement yourself. The syntax to create a table in SQL azure is the same as in SQL server 2005/2008, but not all options are supported in SQL Azure.
...
CREATE TABLE [dbo].[table1]
 ([id] [int] IDENTITY(1,1) NOT NULL,
  [column1] [NVARCHAR](50) NOT NULL,
  [column2] [NVARCHAR](15) NOT NULL,
  [column3] [TINYNT] NULL,
  CONSTRAINT [pk_table1] PRIMARY KEY CLUSTERED ([id] ASC )
   WITH (-- PAD_INDEX = OFF,
            STATISTICS_NORECOMPUTE = OFF,
            IGNORE_DUP_KEY = OFF,
         -- FILLFACTOR=80,
         -- ALLOW_ROW_LOCKS = ON,
         -- ALLOW_PAGE_LOCKS = ON,
         -- DATA_COMPRESSION=PAGE)"
--HÃ¥kan Winther, A SQL Azure tip a day – Create a table
Read it all.

Saturday, April 15, 2017

This Week

Database Truth of the Week

"... systems of operations on data are most effective when they are formalisms, in which semantic considerations are unimportant until the formalism is applied to some specific application. In this way, database processing can join the ranks of successful mathematical abstractions. Differential equations, for instance, can be applied to situations ranging from orbit calculations to the quantum mechanics of the atom. The semantics of each application is unique to that application, but the formalism of differential equations is common. The power of the formalism lies in its abstraction from issues of meaning." --H. T. Merrett, Extending the Relational Algebra to Capture Less Meaning

Thursday, April 6, 2017

Understanding Kinds of Keys

My March  Post @All Analytics.

According to search queries hitting @dbdebunk.com, too many data professionals question the mandatory nature of primary keys, ask about changes to them, or prefer surrogate to natural keys. This indicates misunderstanding and misuse of a critical feature that can wreak havoc with inferences made from databases, including analytics. I have explained one function of keys, but there are several types of key that are poorly understood.

Read it all (and please comment there, not here




Saturday, April 1, 2017

"NULL Value" is a Contradiction in Terms

There is nothing wrong with Hugo Kornelis' picture of SQL NULL in NULL: The database's black hole. In fact, I recommend the series of which it is one part. It's the SQL's picture of how to treat missing data that's wrong.
"Let’s first take a look at what NULL is supposed to be. Here is the definition of NULL from the SQL-2003 standard: null value--A special value that is used to indicate the absence of any data value."
While the absence of a value may be represented by some value at the implementation level, I strongly recommend users not think of NULL as any kind of value at the model level. The problems with NULL stem precisely from the fact that it is not a value, but rather a marker for the absence of a value. NULL value is a contradiction in terms that distracts from the problems.

Sunday, March 26, 2017

This Week

1. What's wrong with this picture?

"Things get more complex when NULLable columns are used in expressions and predicates. In a procedural language, this wouldn’t have been a problem--if a procedural program fails to find the information it needs, it enters a conditional branch to handle this situation, as defined by the programmer. In a declarative, set-based language such as SQL, this was not possible. The alternatives were either to have the SQL developer add conditional expressions for each nullable column in a query to handle missing data, or to define a decent default behavior in SQL for missing data so that developers only have to write explicit conditional expressions if they need to override the default behavior." Hugo Kornelis, NULL - The database's black hole

(Nothing wrong with Hugo's picture--in fact, I highly recommend the series of which the source of this quote is one part--only with SQL's picture of relational treatment of missing data).

Sunday, March 19, 2017

New Paper: The Interpretation and Representation of Database Relations

The data management field cannot and will not progress without educated and informed users. Recently I announced UNDERSTANDING THE REAL RDM, a new series of papers that will
  • Offer to the data practitioner an accessible informal preview of David's work.
  • Contrast it with the the current common interpretation that emerged after EFC's passing and to demonstrate the practical implications of the differences.



Saturday, March 11, 2017

What Is a True Relational System (and What It Is Not)

(This is a rewrite of a 12/10/16 post, to bring it in line with McGoveran's interpretation of Codd's RDM.)

Here's what's wrong with last week's picture, namely:
"A quick-and-dirty definition for a relational database might be: a system whose users view data as a collection of tables related to each other through common data values.

The whole basis for the relational model follows this train of thought: data is stored in tables, which are composed of rows and columns. Tables of independent data can be linked, or related, to one another if they each have columns of data that represent the same data value, called keys. This concept is so common as to seem trivial; however, it was not so long ago that achieving and programming a system capable of sustaining the relational model was considered a longshot with limited usefulness.

If a vendor’s database product didn’t meet Codd’s 12 item litmus tests, then it was not a member of the club ... these rules determine whether the database engine itself can be considered truly “relational”. These rules were constructed to support a data model that would ensure the ACID properties of transactions and also eliminate a variety of data manipulation anomalies that frequently occurred on non-relational database platforms (and **still do**)." --Kevin Kline, SQLBlog.com

Thursday, March 2, 2017

The Trouble with Data Warehouse Analytics

My February post @All Analytics

Warehouses are essentially databases biased for some data applications (and against others) and are rooted in poor database foundation knowledge and logical-physical confusion. Even when warehouses consist of relations, warehouse developers often do not understand their precise interpretation. The design is based on various unwarranted or false assumptions about what the data means. The above modeling methods do not allow documenting the transformation itself -- the relational algebra operations that comprise the transformation. But more often than not warehouses do not consist of relations, which are minimally required to be in first normal form (1NF), and are, therefore, not just denormalized, but non-relational. Consequently, all bets are off; sound derivations of correct analytical results are not guaranteed.

Read it all.

Saturday, February 25, 2017

This Week

1. What's wrong with this picture

"A quick-and-dirty definition for a relational database might be: a system whose users view data as a collection of tables related to each other through common data values.The whole basis for the relational model follows this train of thought: data is stored in tables, which are composed of rows and columns. Tables of independent data can be linked, or related, to one another if they each have columns of data that represent the same data value, called keys. This concept is so common as to seem trivial; however, it was not so long ago that achieving and programming a system capable of sustaining the relational model was considered a longshot with limited usefulness." --Kevin Kline, SQLBlog.com

Sunday, February 19, 2017

Simple Domains and Value Atomicity

Revised 2/20/17.

Here's what's wrong with last week's picture, namely:

Q: "I'm currently trying to design a database and I'm not too sure about the best way to approach a dynamically sized array field of one of my objects. My first thought is to use a column in my object to store an array of integers. However the more I read, the more I think this isn't the best option. Concrete example wise, I have a player object that stores 0 to many items, which are represented by an integer. What is the best way to represent this?" 
A: "If a collection of values is atomic, store them together. Meaning, if you always care about the entire group, if you never search for nested values and never sort by nested values, then they should be stored together as a single field value. If not, they should be stored in a separate table, each value bring a row, each assigned the parent ID (foreign key) of a record on the other table that "owns" them as a group. For more info, search on the term "database normalization".

Some databases, support an array as a data type. For example, Postgres allows you to define a column as a one-dimension array, or even a two dimension array. If your database does not support array as a type of column definition, transform you data collection into an XML or JSON support if your database your database supports that type. For example, Postgres has basic support for storing, retrieving, and non-indexed searching of XML using XPath. And Postgres offers excellent industry-leading support for JSON as a data type including indexed support on nested values. Going this XML/JSON route can be an exception to the normalization rules I mentioned above." --StackOverflow.com

Focus on physical implementation ("dynamically sized array field") without well-defined conceptual and logical features it is supposed to represent ("a player object" is hardly enough) and confusion of levels of representation (a real world object does not "store" anything) are always a red flag, an indication of poor grasp of foundation knowledge. So let's introduce some.

Sunday, February 12, 2017

This Week

1. What's wrong with this picture

"If a collection of values is atomic, store them together. Meaning, if you always care about the entire group, if you never search for nested values and never sort by nested values, then they should be stored together as a single field value. If not, they should be stored in a separate table, each value bring a row, each assigned the parent ID (foreign key) of a record on the other table that "owns" them as a group. For more info, search on the term "database normalization".

Some databases, support an array as a data type. For example, Postgres allows you to define a column as a one-dimension array, or even a two dimension array. If your database does not support array as a type of column definition, transform you data collection into an XML or JSON support if your database your database supports that type. For example, Postgres has basic support for storing, retrieving, and non-indexed searching of XML using XPath. And Postgres offers excellent industry-leading support for JSON as a data type including indexed support on nested values. Going this XML/JSON route can be an exception to the normalization rules I mentioned above." --Response to the Quote of the Week listed next, StackOverflow.com


Sunday, February 5, 2017

Meaning Criteria and Entity Supertype-Subtypes

(This is a rewrite of a 2012 post)

Here is what's wrong with last week's picture, namely:

"I have a database for a school ... [with] are numerous tables obviously but consider these:
  • CONTACT - all contacts (students, faculty) has fields such as LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL;
  • FACULTY - hire info, login/password for electronic timesheet login, foreign key to CONTACT;
  • STUDENT - medical comments, current grade, foreign key to CONTACT.
Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical?

What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows?  Or do you have just one of each and have a change log. Which is best?" --comp.databases.theory

Tuesday, January 31, 2017

Outsmarting the DBMS: Analysts Should Beware

My January post @AllAnalytics:

Last month I alerted you to the failure by data professionals to appreciate the importance, for soundness reasons, of reliance on the DBMS rather than application code for data manipulation and integrity enforcement.
...

Relational design, databases, and DBMSs are the analyst's friends and, given the state of the database field, she/he would be well advised to be as familiar with them as she/he is with analytics and programming, to appreciate  unobvious risks to correctness.

Read it all (and please comment there, not here) 



Sunday, January 29, 2017

This Week

1. What's wrong with this picture
"I have a database for a school ... [with] are numerous tables obviously but consider these:
CONTACT - all contacts (students, faculty) has fields such as LAST, FIRST, MI, ADDR, CITY, STATE, ZIP, EMAIL;
FACULTY - hire info, login/password for electronic timesheet login, foreign key to CONTACT;
STUDENT - medical comments, current grade, foreign key to CONTACT.
Do you think it is a good idea to have a single table hold such info? Or, would you have had the tables FACULTY and STUDENT store LAST, FIRST, ADDR and other fields? At what point do you denormalize for the sake of being more practical?What would you do when you want to close out one year and start a new year? If you had stand-alone student and faculty tables then you could archive them easily, have a school semester and year attached to them. However, as you go from one year to the next information about a student or faculty may change. Like their address and phone for example. The database model now is not very good because it doesn’t maintain a history. If Student A was in school last year as well but lived somewhere else would you have 2 contact rows? 2 student rows?  Or do you have just one of each and have a change log. Which is best?" --comp.databases.theory

Sunday, January 22, 2017

Are You a Thinking Data Professional?

Note: The following was intended as a comment to my post Don't Design Databases without Foundation Knowledge and Conceptual Models  by Todd Everett. He is a reader I deem a "thinking data professional"--always the qualitative rather than quantitative target of my writings and teachings. It merits to be a post in its own right to benefit others.

Monday, January 16, 2017

Don't Design Databases Without Foundation Knowledge and Conceptual Models


"I have two tables, one is product which is a parent table with one primary key and I have another child table of product, which is a product_details table. But the child table is linking with parent table(product) with logical data instead of foreign key,as we are doing this relationship with the help of java code in the coding side, instead of depending on the data base, which make it as tight couple. To avoid tight coupling between the tables we are storing the primary key value in the child table.
CREATE TABLE `tbl_product` (
  `product_id` varchar(200) NOT NULL,
  `product_details_id` varchar(200) DEFAULT NULL,
  `currency` varchar(20) DEFAULT NULL,
  `lead_time` varchar(20) DEFAULT NULL,
  `brand_id` varchar(20) DEFAULT NULL,
  `manufacturer_id` varchar(150) DEFAULT NULL,
  `category_id` varchar(200) DEFAULT NULL,
  `units` varchar(20) DEFAULT NULL,
  `transit_time` varchar(20) DEFAULT NULL,
  `delivery_terms` varchar(20) DEFAULT NULL,
  `payment_terms` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`product_id`));

CREATE TABLE `tbl_product_details` (
  `product_details_id` varchar(200) NOT NULL,
  `product_id` varchar(200) DEFAULT NULL,
  `product_name` varchar(50) DEFAULT NULL,
  `landingPageImage` varchar(100) DEFAULT NULL,
  `product_description_brief` text CHARACTER SET latin1,
  `product_description_short` text CHARACTER SET latin1,
  `product_price_range` varchar(50) DEFAULT NULL,
  `product_discount_price` varchar(20) DEFAULT NULL,
  `production_Type` varchar(20) DEFAULT NULL,
  PRIMARY KEY (`product_details_id`),
  UNIQUE KEY `product_id` (`product_id`));
Please suggest the Pros and Cons of the design, we are following this kind of relationship in my company, as the manager is saying it will give [us flexibility]. I know that if we lose the data from the table, we can't know the relationship between the two tables."--StackExchange.com

Monday, January 9, 2017

This Week

1. What's wrong with this picture
"I have two tables, one is product which is a parent table with one primary key and i do have another child table of product, which is a product_details table. But the child table is linking with parent table(product) with logical data instead of foreign key,as we are doing this relationship with the help of java code in the coding side, instead of depending on the data base, which make it as tight couple. To avoid tight coupling between the tables we are storing the primary key value in the child table.

CREATE TABLE `tbl_product` (
 `product_id` varchar(200) NOT NULL,
 `product_details_id` varchar(200) DEFAULT NULL,
 `currency` varchar(20) DEFAULT NULL,
 `lead_time` varchar(20) DEFAULT NULL,
 `brand_id` varchar(20) DEFAULT NULL,
 `manufacturer_id` varchar(150) DEFAULT NULL,
 `category_id` varchar(200) DEFAULT NULL,
 `units` varchar(20) DEFAULT NULL,
 `transit_time` varchar(20) DEFAULT NULL,
 `delivery_terms` varchar(20) DEFAULT NULL,
 `payment_terms` varchar(20) DEFAULT NULL,
 PRIMARY KEY (`product_id`));

CREATE TABLE `tbl_product_details` (
 `product_details_id` varchar(200) NOT NULL,
 `product_id` varchar(200) DEFAULT NULL,
 `product_name` varchar(50) DEFAULT NULL,
 `landingPageImage` varchar(100) DEFAULT NULL,
 `product_description_brief` text CHARACTER SET latin1,
 `product_description_short` text CHARACTER SET latin1,
 `product_price_range` varchar(50) DEFAULT NULL,
 `product_discount_price` varchar(20) DEFAULT NULL,
 `production_Type` varchar(20) DEFAULT NULL,
 PRIMARY KEY (`product_details_id`),
 UNIQUE KEY `product_id` (`product_id`));
Please suggest the Pros and Cons of the design, we are following this kind of relationship in my company, as the manager is saying it will give us flexible to us. I know that if we lose the data from the table, we can't know the relationship between the two tables."--StackExchange.com

Tuesday, January 3, 2017

Understanding the Relational Data Model: A New Series of Papers

"Nowadays, anyone who wishes to combat lies and ignorance and to write the truth must overcome at least five difficulties. He must have:
  1. The keenness to recognize it, although it is everywhere concealed;
  2. The courage to write the truth when truth is everywhere opposed;
  3. The skill to manipulate it as a weapon;
  4. The judgement to select in whose hands it will be effective, and
  5. The cunning to spread the truth among such persons."
--Berthold Brecht
A rather accurate explanation of why it has been so difficult to dispel the misuse and abuse of the Relational Data Model since inception. To the point that most of its core practical benefits have failed to materialize, with the IT industry regressing all the way back to its pre-relational and even pre-database state:
  • Graph DBMSs;
  • XML;
  • JSON;
  • NoSQL;
  • Application-specific databases and DBMSs;
  • "Unstructured data";
  • No integrity enforcement;
  • A cacophony of imperative programming languages rather than declarative data sublanguages (suffixed with QL, just like old non-relational DBMSs were with /R).