From: CB
To: Editor
Would a column that holds a bitmask value be considered a
multivalued column (and thus a violation of 1NF)?
Say, for instance, that my database stores (among other
things) application users and the rights each user has within the application
(e.g., "read invoice", "create invoice", etc). There
exists a many-to-many relationship between users and the rights they
possess. The normal way to set this up (in accordance with the Normal
Forms) would be to create a Users table, a Rights table, and a UserRights table
(which associates user ids with their ids of their associated rights).
However, someone thought of the idea of eliminating the UserRights
table, and, instead, adding a column to the Rights table (RightsBit) to store a
"bit" value for each right (i.e., 1, 2, 4, 8, 16, 32, etc.), which
would serve as that table's primary key. Them they added a column to the
Users table (RightsBitMask) to store the bitmask of that user's rights (i.e.,
if the user has right 2 and right 8, his RightsBitMask column value would be
10. If he were assigned rights 1, 4, and 16, his value would be 21).
The advantage of this is that, when I want to find out which
people have a certain right, I don't have to perform any joins. I can
simply execute a query that performs a bitwise AND operation between the
RightsBitMask column and the bit value of the right(s) I'm querying for.
This runs much faster than performing a join query (of course, it goes without
saying that, if I were using a TRUE RDBMS, there would be no performance issue
- but at my company I'm forced to use a SQL DBMS. Since a SQL DBMS
doesn't separate the logical from he physical design, things such as joins and
normalization do indeed have a performance impact).
In addition, it is argued that using this bitmask field would
not slow down insert/update performance or introduce the possibility of
inaccurate data, because if I needed to change the rights a user had, all I
would need to do is perform an addition or subtraction operation on that one
single field.
What are your thoughts on all this?
From: Fabian Pascal
To: CB
I do not give specific design advice because I do not know
enough about the data and its uses to speak intelligently of it. To do that I
would have to learn about it and that's not feasible via email. I can only
state general principles.
First, read The Dangerous Illusion Parts 1 and 2 and see
if that gives you some ideas.
If you read my book, you know that the meaning of a table is
its predicate. Formulate the predicates of the 3 tables in the
relational design and those of the 2 tables in the bitmask design and see if
you can detect any differences/problems.
This is a good example of how physical implementation flaws
cause users to corrupt the logical design process with performance considerations,
which can only lead to trouble (see: Note on A Quote).
You may have no choice with a current product, but at least know what you're
buying into.
From: CB
I had written you a while ago regarding the use of bitmask
columns as a way of getting around needing an intermediate table (i.e., a
"user" table and a "rights" table, with a bitmask column in
the "user" table containing the sum of the bit values in then rights
bit field of the "rights" table in order to indicate what rights a
user has).
My original question was: would such a field constitute a
multivalued attribute, and thus be a violation of 1NF? And also, could
this introduce integrity problems?
Well, I think I've answered my own question!
(1) This design would not allow for referential
integrity. In other words, let's say that the bit value for the
"create contract" right was 4, and that of the "approve
contract" right was 8. The bitmask column of the users table for a
user who had these 2 rights would be 12. But ... there would nothing to stop
someone from changing the bit values in the rights column (say, to change the
bit value for "create contract" to 16, and that of "approve
contract" to 2). That would instantly corrupt all the data in the users
table; each user would be shown to have the wrong set of rights.
The only way to avoid this would be to create a trigger on
the rights table, preventing anyone from changing the bit values and the rights
each bit represents. But triggers create a performance overhead, which
might outweigh any performance improvements gained by using the bitmask field
rather than the intermediate table.
(2) In addition, using bit and bitmask fields severely
reduces the number of records that the table could hold. The bit field
for record 1 would be 1, record 2 would be 2, record 10 would be 512, etc., and
with only 30 or so records, you would end up with bit values that might be too
large for the column's data type. Thus, if you wanted to hold more than,
say, 30 records in the rights column, you would need to create a second column,
then a third, and so on. Thus you would end up with repeating groups -
another violation of the normal forms.
So I guess there's no getting around the need for fully
normalized database design. I thought I had found a trick to do so, but I
was wrong.
Fabian Pascal (with Chris Date) Responds: Initially I
was not sure I understood the bitmask scheme. Sure enough, when I violated my
reluctance to review/assess the specific design, my reply contained some
incorrect details. CB then sent me the following description of the two
possible designs:
Ø
Normalized version of database:
USERS
USERID USERNAME
===============
1
Bill
2
Joe
3
Jane
:
:
===============
RIGHTS
RIGHTSID RIGHTSDESC
=========================
1
Read
2
Write
3
Delete
4
Edit
Chg Permissions
:
:
=========================
USERIGHTS
USERID RIGHTSID
================
1
1
1
2
1
3
1
4
1
5
2
1
3
2
4
:
================
(the “intermediate” or “joining” table between USERS and
RIGHTS)
Ø
Denormalized version employing bit and bitmask column:
RIGHTS
RIGHTSBIT RIGHTSDESC
==========================
1
Read
2
Write
4
Delete
8
Edit
16
Chg Permissions
:
:
==========================
USERS
USERID USERNAME RIGHTSBITMASK
===============================
1
Bill 31
2
Joe 1
3
Jane 10
:
: :
===============================
How does this work? The RIGHTSBITMASK field in the
Users table contains the sum of the RIGHTSBIT values in the RIGHTS table for
each right the user possesses. Bill has all rights assigned to him, so
his RIGHTSBITMASK value is 31 (1+2+4+8+16).
To find out which rights a user has, one merely needs to
query that user’s RIGHTSBITMASK field using a bitwise AND comparison.
There is no need to “join” to the RIGHTS table. To add or remove a right
from a user, one merely needs to add or subtract the appropriate RIGHTSBIT
value to that user’s RIGHTSBITMASK field.
Based on this description we note as follows:
Ø
The bitmask scheme is not a denormalized design; it does not violate
1NF. Both designed are normalized, one is simply good design, the other has
drawbacks. (Much confusion surrounds the issue of 1NF; it is not possible to do
it justice in this exchange, so we intend to publish an article on the topic in
the near future.) In the meantime, we refer the reader to the following sources
where the subject is discussed:
§ C.J. Date, AN INTRODUCTION
TO DATABASE SYSTEMS, 7th ed., pp. 127-129 and 372-374.
§ Relation-Valued Attributes or Will the
Real First Normal Form Please Stand Up? in C.J. Date and H. Darwen, RELATIONAL
DATABASE WRITINGS 1989-1991
§ RM Prescription 7: TYPE GENERATOR
RELATION, pp. 135-139, C.J. Date and H. Darwen, THE THIRD
MANIFESTO, 2nd ed. (especially part b)
§ C.J. Date, H. Darwen
and N.A. Lorentzos, TEMPORAL DATA
AND THE RELATIONAL MODEL
Ø
My initial comment that the bitmask scheme is poor design remains
generally valid. Performance usually involves tradeoffs, and the optimal
tradeoff varies with databases, DBMSs and the pattern of access by
applications, which is one reason not to contaminate logical design with
implementation details. In this case, to avoid a third table (and joins), one
table is widened by one additional column, thus probably trading off
performance of applications that need to access the individual table for both
queries and updates, for performance of applications that need information from
both tables.
Ø
The most undesirable cost of the scheme, however, is, as CB figured out
himself, its integrity implication (which is usually the consequence of poor
design, see The Dangerous Illusion Parts 1 and 2). Instead of
relying on the straightforward referential integrity in the three-table design
case, the two-table design requires a database constraint (one spanning two
tables, see the integrity chapter in PRACTICAL ISSUES IN
DATABASE MANAGEMENT) to be formulated by the user. The problem is that
current SQL products are not likely to support such a constraint declaratively,
which means that it would have to be implemented procedurally by a trigger (in
itself not very desirable) or, worse, by application code. What is more, even
if the declarative constraint were supported, it turns out to be nightmarish
(not to mention that it involves a join for updates!). And to enforce the
constraint, the DBMS will have to perform the very join that is to be avoided
in the first place!!!
We will illustrate the problem in SQL for reasons of
familiarity, even though it is not to our taste. Given the two tables
RIGHTS and USERS, the columns RIGHTS and RIGHTSBITMASK are of type BIT(N).
Rights are assigned Ids--RIGHTBITs that are powers of 2--i.e., the ith
right has RIGHTSBIT equal to a bit string in which the ith bit, counting
from the right, is a one and the rest all zeros. (We assume here that (a) there
are exactly N distinct rights and (b) those rights are set up in
one-to-one correspondence (somehow, possibly arbitrarily with the integers 1 to
N.)
One benefit of this encoding scheme is that the query “Who
has all of the rights in some specified set S?” is easily dealt with by
setting up a mask SMASK containing one-bits in all and only those positions
corresponding to the privileges in S and ANDing that mask with
RIGHTSBITMASK:
SELECT userid,…
FROM users
WHERE (smask AND rightsbitmask) = smask;
(We ignore here the fact that the SQL standard doesn’t
actually support AND, OR, etc., on bit strings!—but I’m sure some products do.
If they don’t, they should. So should the standard, of course.)
In practice, however, there might be some values of i
for which no corresponding right exists (e.g., some rights might have been
deleted, or they might not have been assigned in a “dense” fashion in the first
place). Thus, we need to ensure that every one-bit appearing anywhere in
USERS.RIGHTSBITMASK actually appears in RIGHTS.RIGHTSBIT. This constraint is
somewhat analogous to a foreign key constraint, but it isn’t a FK
constraint, because it refers to components of column values instead of
column values per se. Here then is one possible declarative
specification of such a constraint:
CREATE ASSERTION
DREADFULLY_COMPLEX
CHECK ((NOT EXISTS
(SELECT *
FROM users
WHERE SUBSTRING(rightsbitmask FROM 1 FOR 1)) = B’1’
AND NOT EXISTS
(SELECT *
FROM rights
WHERE SUBSTRING(rightsbitmask FROM 1 FOR 1) = B’1’))
AND
(NOT EXISTS
(SELECT *
FROM users
WHERE SUBSTRING(rightsbitmask FROM 2 FOR 1)) = B’1’
AND NOT EXISTS
(SELECT *
FROM rights
WHERE SUBSTRING(rightsbitmask FROM 2 FOR 1) = B’1’))
AND
...
AND
(NOT EXISTS
(SELECT *
FROM users
WHERE SUBSTRING(rightsbitmask FROM N FOR 1)) = B’1’
AND NOT EXISTS
(SELECT *
FROM rights
WHERE SUBSTRING(rightsbitmask FROM N FOR 1) = B’1’)));
Problems with the foregoing (if it is supported by the
DBMS):
· It’s complicated. While it might be
possible to simplify it, there is almost zero chance that it will perform as
well as a traditional FK constraint—even though, conceptually, it is very
similar to a FK constraint (and FK constraints receive special case
implementation in SQL, but more general constraints do not.
· FK constraints allow you to specify
certain compensating actions (declaratively, of course)—e.g., cascade delete.
The foregoing expression allows nothing analogous.
· Perhaps most telling of all: The
foregoing expression is fragile—it involves N separate conditions
ANDed together. If the value of N changes (i.e., if the number of rights
supported changes—quite likely), the constraint will have to be changed too. This
shouldn’t happen to a dog.
Attempts to resolve performance issues at the logical level,
forced by the poor relational fidelity and implementation of SQL and its
commercial implementations (a) bias the database for some applications and
against others and/or (b) complicate integrity enforcement. Only by ignoring
these costs can such endeavors be deemed “better”. Practitioners should be
particularly suspicious of “bit-twiddling schemes”, which together with SQL
products defeat a practical relational purpose: human performance
bottlenecks usually overwhelm machine performance bottlenecks, so the
latter should not be traded off for the former.
A well-implemented, truly relational DBMS (TRDBMS) would
obviate the need for such complications. If products force users into such
schemes, users should be aware of the cost and its real cause, instead of
blaming the relational model and/or design.
Posted
01/17/03
[ABOUT]
[QUOTES]
[LINKS]