"By supporting
what it calls the "three faces" of database design -- XML,
relational, and object data -- IBM's database will be able to interact with XML
documents, structured information such as row and columns, and data written in
object-oriented programming languages, namely Java and C++. As such, the goal
of all three [DBMS] vendors is to more effectively run searches across
structured and unstructured data sets."
--Database
Titans Embrace XML, Infoworld
"The summary
focuses on how the need to manage unstructured data in addition to structured
data is influencing the adoption of XML databases. In particular the survey
examines respondent's opinions on: How widespread has the management of
unstructured data become; The use of XML to manage all data, whether structured
or not."
--Grant Laing,
Intellor Group
"Virtuoso 2.7 ...
creates a profound synthesis of SQL and XML data management styles, and wraps
Web-services bindings around both. The SQL engine at the core of the product
can contain structured data, as well as semi-structured data (i.e. XML) and
unstructured data (files, images). There's also a tightly integrated WebDAV
(Web Distributed Authoring and Versioning) datastore that offers hierarchical
access to semi-structured and unstructured data."
--John Udell, Across
the Universe, Infoworld
In an article published under different titles in at least
two trade publications, Janet Perna, general manager, IBM Data Management
Solutions, writes
"Companies today are operating at a huge data disadvantage
because they are unable to capitalize on their rich storehouses of information
to increase productivity, reduce response times, or improve customer
service".
This is because
"Eighty-five percent of the information that businesses
need to operate with does not fall into structured formats of automated
spreadsheets or databases. Instead, it exists in a wide range of unstructured
content, such as e-mail, graphics, or video. With so much diverse information,
employees spend roughly 25 percent to 35 percent of their time looking for the
information they need to do their jobs."
She recommends
"… proven strategies that can help companies capitalize on
their data ... integrate all systems and be certain that critical information
is digitized so employees can access every form of content ... graphic images,
video, and text ... make sure that certain systems are robust enough to manage
all current content, as well as accommodate future volume, while maintaining
fast response times"
Now, few would dispute that digitizing data enhances
accessibility and sharing of information, that's rather trivial. But even
though integration is like motherhood and apple pie--everybody is for, and
nobody is against it--the devil is in what exactly one means by it. There are
limits to information accessibility to, and integration of what is called
(erroneously, as we shall see) "unstructured content", even when
digitized. Without a good grasp of data fundamentals, which is lacking in the
industry, users can be lured down a path that actually inhibits, rather
than enhances intelligent access to information.
"Unstructured content" is a contradiction in terms:
data that is unstructured, that is, not organized in any way, is random
and, thus, carries no meaning and, therefore, no informational content. There
cannot be information management without some data organizing principle.
It is the precise function of a data model to provide,
among other things, a structure in accordance with such a principle e.g.
the relational model's R-table. This imparts meaning to the data (see Something to
Call One's Own and The
Name Game) that, if conveyed to a DBMS via data definition, allows it
to protect the integrity of the database, and to manipulate it in
order to derive information from it at users' request. In other words, the data
model underlying a database determines what questions can be asked of, and
answered by the DBMS, and how accurate and reliable the results are.
Note: One of the
common confusions in the industry is between data model, business
model, and logical model, which are distinct (see Models, Models
Everywhere, Nor Any Time to Think, On What Is a
Data Model: Reply to Simon Williams, On the So-called
"Associative Model of Data", and On
"Respected Technical Analysts".
Text, graphics and video are hardly random data and,
therefore, certainly not unstructured. Each can be organized in more than one
way. For example, text can be organized as e-mail (from: and to: address,
header, body, signature, message, thread), as articles (word, sentence,
paragraph, section), or as contracts (see below). Graphics can be organized as
various pixel configurations (GIF, JPG, TIFF). In information management the
distinction is not between structured and unstructured data, but between different
data structures-- essentially, between different data models.
So if text, graphics and video are not random data, what do
those referring to "unstructured data" mean? Apparently, data that
"does not fall into structured formats of automated spreadsheets or
databases" or, in other words, non-tabular data. But "does not
fall into" is open to two interpretations:
a. data that has not been structured in tables
b. data that cannot be structured in tables
and the distinction is critical for data management purposes.
Note: Although they are lumped together -- "both
involve rows and columns"--spreadsheets and databases have different
integrity and manipulation and, thus, different underlying data models. When,
spreadsheets were (and still are) used for database purposes, some nasty
consequences ensued.
Consider a company selling rights to films in various parts
of the world. Its sales are recorded in legal contracts, many of which are
generated daily, that spell out transactions in textual detail. To be able to
function, the company needs a system which can tell it, at any point in time,
the state of its business, e.g. "what rights to film X in what regions for
what time period have been sold to what customers for what price?"
One option is to digitize the contracts "as is"
e.g. scan and OCR them into a “textbase”. This seemingly "relieves"
the company from the burden of database design, hence the notion that the data
is "unstructured". But, of course, that is inaccurate: the data is
organized in accordance with a "contract text data model", so to
speak--words, paragraphs, clauses, sections, and so on, with its own constraints
and operators. These would have to be implemented in the system that
manages the data, such that it can protect its integrity and derive information
from it. And, as I explain in chapter 1 of PRACTICAL ISSUES
IN DATA MANAGEMENT, not only is the implementation of integrity and
manipulation for such models extremely complex--answering queries such as the
one above is a non-trivial, prohibitive proposition--but the operators
themselves are not even agreed on, in part because nonrelational models lack
a sound theoretical foundation.
As Chris Date points out, different instances of [such] data
require widely different types of processing, and about the only thing they
have in common is that they are hard to deal with in today's DBMS products. An
"e-mail data model" is different than a "contract data
model", even though the underlying representation for both is text. There
are even different "e-mail data models" and different "contract
data models" (would the entertainment profession agree with, say, the real
estate profession on a "contract data model"?)
On the other hand, contract data can, of course, be
organized in tables. Via the process of data modeling, a company-specific
business (or conceptual) model can be defined, consisting of entity types
Films, Rights, Customers, Locations, and so on, each with attributes of
interest. This model can be then mapped, courtesy of the relational data model,
to a company-specific logical model, where the entity types are represented by
R-tables: attributes are represented by columns and entities by rows. The
logical model, including integrity constraints that represent business rules,
can be declared to a RDBMS, which will enforce the constraints, and
applications can invoke relational operations that do have a theoretical
foundation and, thus are logically sound, agreed on and simpler.
Otherwise put, data is not inherently tabular or non-tabular,
and how it is represented and manipulated is a matter of choice.
For any segment of reality there always exist, in principle,
conceptual models mappable to tabular representations, which satisfy specific
sets of informational needs (as the film company example illustrates). Such
structuring comes at a price--knowledge and time--but yields the significant
benefits of generality, soundness, completeness and simplicity of access. The
so-called "unstructured data"--a misnomer--does not mean, as
is often implied, that data in text or images cannot be represented
relationally. It only means that a choice can be made not to structure
the data so. But there is no free lunch: the cost is loss of precisely the
above-mentioned relational advantages.
Nonrelational databases may be adequate for certain purposes,
but questions of the kind that the film company needs answered are
impracticably difficult and costly to answer from "unstructured data"
(Internet search-engines are a case in point), even when digitized and "integrated"
(read: managed by the same system) with relationally structured data.
Practitioners are advised to familiarize themselves with
database fundamentals, to avoid being lured by marketing into the notion that
"multiface" products that "create a profound synthesis" of
structured and unstructured data, obviate the need for database design, e.g.
"NeoCore has developed an XML database that has been
designed from the ground up to support XML. It is both data- and
document-centric hence we call it information-centric ... The biggest benefit
of using the NeoCore system is that it totally obviates the need for doing any
database design. --Product marketing manager.
"If you deploy a DBMS without doing any design, you
deserve anything you get", says Chris Date.
Structuring data is hard-–it requires knowledge, thinking and
time. But it is an illusion to believe that it is possible to avoid it and
still achieve intelligent access to information. A trade article referred to
NASA
"… struggling with ... how to capture and analyze the
[terabytes] of data beamed down to earth daily from orbiting satellites"
and "the way in which the raw data must be assigned to tables in order to
be processed" may require "a degree of rationalization and some
predisposition toward the ultimate use of the data" that "is
difficult because the scientist may not know ahead of time what analysis to run
on the data".
But if it is "this lack of knowledge" that
"severely limits the usefulness of the (SQL) system", it is only the acquisition
and application of the knowledge that can eliminate the limits, not any
product technology."
Posted
06/09/02
[ABOUT]
[QUOTES]
[LINKS]