Monday, August 22, 2005

The Web Database

there are many who have traced the history of database management systems, in particular the Great Debate between the network model and the relational model - embodied in their key proponents, Charles Bachman and E.F. Codd, respectively - and note that if there are any purely technical factors that contributed to the relational model's triumph over the network model it would be that the relational was simpler. not only were network databases more complex to manage from an adminstrative perspective, but from a user standpoint querying network databases was complex and error-prone because developers of the network model were never able to devise a simple declarative query language, having to rely on procedural devices like goto's and cursors and requiring an intimate low-level knowledge of the physical data structures by the user. some relational purists will argue that the relational model's solid mathematical foundation was the source of its technical superiority, but from a pragmatic perspective its grounding in predicate calculus was only important insofar as it simplified the problems of storing and accessing data.

we see the idea of simplicity appearing over and over again when we analyze the advantages of various successful models and systems over their competitors/predecessors. HTML vs. SGML. REST vs. SOAP. Hibernate over EJB and Spring over J2EE. Extreme Programming's KISS philosophy and the New Jersey approach to design. Capitalism vs. Communism. hell, even Nike is going barefoot these days, and in the world of organized violence the paring down of "barred" holds and the mixing of styles is all the rage. common to all of these frameworks is the greater flexibility and creative freedom to allow human ingenuity its fullest expression. when the prime value of the global network that all of our lives are being woven deeper and deeper into is the aggregation and multiplication of human capital, i think that it's no accident that models which release human capabilities are gaining more and more prominence over those that attempt to control them.

what many people fail to realize about the RDF model of data is that it is a simpler and more general model of data than anything that has come before it. not RDF with schemas and ontologies and all that jazz. that's actually more complex than anything that has come before it. i'm talking about basic RDF. designed originally as a data model for the web, one key requirement had to be met: that any data anywhere on the globe, whether it be in relational databases, network databases, flat files, or what have you, could be mapped to it. consequently, what was produced was a kind of lowest common denominator of data models. a key concept here is that of the fundamental, irreducible unit of data as the simplest kind of statement (or, more precisely, in the language of mathematics: a binary relation). even C.J. Date - arguably second only to Codd as an authority on the relational model - acknowledged in a recent comment on "relational binary database design" that there is an argument for the binary relation being an irreducible unit out of which the n-ary relations which relational theory deals with can be composed. in his comment, he describes how a ternary (3 column) relation can be composed by "joining" 2 binary relations. by breaking down the nature of data into something a bit more granular to manipulate we gain a power and flexibility not unlike that envisioned by Bill Joy when he waxes philosophic about nanotechnology and its promise of the ability to create any physical thing by manipulating granular components of matter. indeed much of the progress in our understanding of matter has been driven by successive discoveries of increasingly more granular, or atomic, units of matter.

"No tuples barred" data kung-fu

there's another aspect of RDF that has practical consequences that make it a good fit for the web: it's "self-describing" nature. this aspect of RDF is not just something that was artifically designed in or layered on; it follows quite naturally from its reductionist foundations. since we effectively use the irreducible binary relation as a kind of building block to compose larger types of relations, each irreducible binary relation must have an independent existence apart from the compositional relationships it participates in. it must have a global identifier to be independently recognizable by the system. when the most granular components of even the most complex dynamic aggregations of data are identifiable as individuals with an independent existence, the effect is that the data becomes self-describing. contrast that with the relational model wherein columns are defined relative to a relation. columns cannot be said to exist independent of some relation of which they are a part.

when data is self-describing, schema becomes inessential. there are no RDBMS's that I'm aware of that allow data to be created that does not conform to some pre-defined schema. XML, on the other hand, another self-describing data format, does not require a schema to exist before you can create a valid XML document. while schema may be useful for enforcing/confirming some kind of organization of the data, it is not essential to the creation and manipulation of data.

this allows you to have a database that does not require the kind of bureaucratic planning that the database modeling exercise in a large organization can devolve into before being put into action. if it were a relational database, it would be as if there were no conceivable tuple barred from creation. it allows a level of responsiveness and agility in reacting to problems and creating solutions that simply isn't possible with today's RBMS technology, and with the bureaucracy that has developed in many corporate IT departments around the administration and development of such database systems.

such a system would be much like a database created in Prolog (which almost certainly had an influence on the design of RDF due to its early "knowledge representation" aspirations). in Prolog you can assert any fact, i.e. make any statement that you want without having the predicates predefined. any kind of higher-order structure or logic that exists among the facts, such as a graph connecting a set of binary relations, is an emergent property of a dataset that can be discovered through inference, but is never explicitly defined anywhere in the system. while some sort of schemata may serve as a guide to a user entering facts and rules in a Prolog database, prolog is not aware of it, and has no way of enforcing it. this is much the way that the human brain, indeed matter itself, works. while it's possible at higher levels of organization for both the brain and matter to create rigid molds into which things that don't fit the mold are not accepted, they don't fundamentally work this way. by the same token, it is possible to create RDF systems that ridigly enforce RDF schemas and ontologies, but i wouldn't recommend it. the bigger your world gets the more flexiblity you want. as your horizon expands, it becomes increasingly difficult to define a single schema that fits all data, and the web is about as big a data universe as you can get. the simpler model scales better.

a recent article in HBS Working Knowledge, entitled "How Toyota and Linux Keep Collaboration Simple", describes how "The Toyota and Linux communities illustrate time-tested techniques for collaboration under pressure". the article makes the point that both groups follow a minimalist philosophy of using the simplest, most widely available technologies to enable far-flung groups to collaborate. a minimalist, widely available database technology (i.e. available as a service over HTTP) could allow a kind of real-time programming capability to rapidly create programs that allow collaborators across different organizations to analyze and attack novel problems with unique data patterns in near real-time. the web database should be like a CVS for data, allowing programmers to work in parallel with different representations of data and to merge those representations, in much the way source code version control systems allow different representations of program logic to be worked on in parallel, and merged. like CVS it should provide a lineage of the changes made to those representations allowing them to be "rolled back" if necessary, giving coders the confidence to move forward quickly pursuing a path, knowing that it will be easy to backtrack if necessary. it would be the perfect database technology for agile development, founded on the Jeet Kune Do of data models:
JKD advocates taking techniques from any martial art; the trapping and short-range punches of Wing Chun, the kicks of northern Chinese styles as well as Savate, the footwork found in Western fencing and the techniques of Western boxing, for example. Bruce Lee stated that his concept is not an "adding to" of more and more things on top of each other to form a system, but rather, a winnowing out. The metaphor Lee borrowed from Chan Buddhism was of constantly filling a cup with water, and then emptying it, used for describing Lee's philosophy of "casting off what is useless."

The best of all worlds

recently i came across this interview in 2003 with Don Chamberlin, co-inventor of SQL. nowadays, he spends his time working out a query language for XML and thinking about how to unify structured data and unstructured data under one model, and the integration of heterogenous data with self-describing data models (the latter is exactly what RDF is a good simple solution for, and XML isn't). it ends with some interesting quotes by Mr. Chamberlin:

Chamberlin: Well, you know I've thought about it, and I think the world needs a new query language every 25 years. Seriously, it's very gratifying to be able to go through two of these cycles. DB2 will support SQL and XQuery as sort of co-equals, and that's the right approach. It embodies the information integration idea that we are trying to accomplish.

Haderle: And do you think that, given the Internet's predominantly pointer-based navigation, that Charles Bachman [originator of the network database model] is thinking, "I finally won out over relational?"

Chamberlin: Well, there are a lot of hyperlinks in the world, aren't there? I have a talk, "A Brief History of Data," that I often give at universities. And in this talk, I refer to the Web as "Bachman's Revenge."

Haderle: I know that the IMS guys are saying, "I told you so."

so are we are ready for a new data model? is the web indeed "Bachman's Revenge", and will the new data model be really a return to something old? in some ways, yes. the web, and RDF, do superficially resemble the hyperspace of Bachman's network data model. the hyperlink is a binary relation between two nodes, and both the network data model and RDF are based conceptually, to some extent, on a graph model of data. this is directly attributable to the binary relation's fundamental role in graph theory. but RDF is also fundamentally different. in Bachman's network model it was "records" that were hyperlinked. these records looked more like the n-ary relations of the relational world (though they were never rigorously and formally defined as such). thus, there was a fundamental inconsistency in the network data model. in RDF, all data is modeled as binary relations, and thus all data is "in the graph". thus, all data in an RDF model is at once amenable to the kind of rigorous mathematical analysis and logical inference that the relational model is, and also mappable to a graph (a labeled directed graph, to be more exact). add to that basic structure a self-describing format, and the result is a model of data that achieves an elegance, simplicity, and flexibility that Bachman's model never did, making it a beautiful fit for the web.

in much the same way that the strength of RDF as a universal data model seems to be a result of it being a simplification and distillation of the essence of other models of data, with more dynamism and flexibility, the success of Java was driven in its early days by it being in some sense a distillation of the essence of other popular programming languages and platforms, that was simpler than any of the existing programming languages and platforms - a lowest common denominator that held the promise of portability across all platforms.

Back to the basics ...

so what i'm advocating, in part to help clear up the noise and confusion surrounding this technology, and partly to focus resources where they would reap the most value at this yet early stage in its evolution, is a focus on a simpler RDF. i'm more interested in an RDF--, than an RDF++. the reason the web took off was because it was so simple to use. anyone could write an HTML page. the hyperlink is the most basic and intuitively graspable data structure one could imagine. RDF, in its basic form, doesn't really do much more than add a label to that link, introduce a new kind of node - a literal, and a powerful query language against this network of nodes with labeled links. RDF has yet to "take off". let's wait till that happens and it gains some real traction before we start over-engineering it. let's see how we can cope without schemas and ontologies. let's see if the self-organizing nature of the web will allow us to get away without them. then maybe we'll discover that it's possible to start integrating the world's data on a grand scale.


mia said...

Your blog is great! It's hard to find blogs with good content and people talking about Database Management these days! I have a secret Database Management Exposed if you want to come check it out

Anonymous said...

This is lots of B.S. - there's nothing self-describing in RDF. At best it is yet another notation to describe something by means of reference to "something else" (e.g., a glossary, lexicon, which are an abstaction of a view of the world). "self-describing" cannot exist becuase it is subject to Gödel's Incompleteness Theorem.

Daniel Kim said...

wow, people are actually reading my blog! thanks for the comment, anonymous!

but i think that your interpretation of "self-describing" seems to be much more ambitious than what is usually meant by the term when applied to data formats (though i'll agree the term is confusing). Godel's theorem presents no problems here. what is meant by the term is quite mundane actually and can be demonstrated by a simple example.

we can see how the RDF format (and XML and JSON for that matter) is self-describing by comparing it to one that isn't. the most familiar example is the comma-seperated values, or .csv, format that your spreadsheet data can be exported to. taking the example from the Wikipedia entry linked, what's missing is the names of the fields. in both the abstraction of the data represented as a table, and in the serialization, this information is missing. notice that order is critical and that there is a fixed set of fields allowed - none can be omitted (notice how an "empty" fourth field must still be defined in the file for the 1999 Chevy even though there is no data), and no unanticipated fields are permitted. setting aside the issue of what a human would do with this data without some more information about what the fields mean, some "metadata", there isn't much computer programs can do either.

data that is meant to be processed by a computer is also rarely represented as a headerless table. by and large, programmers either manipulate containers of objects with named attributes, or tables/rows of data with named columns. programmers, being human, find it easier to deal with symbolic identifiers for things rather than remembering their position in an ordered sequence (imagine if each record had 10, or 20, fields). in order for us to get the data in the .csv format from the Wiki article into such a form, we'll have to write a custom program to load it into a list of objects in memory, perhaps, or maybe into a table in a relational database. it is not possible to write a generic program that will load an arbitrary .csv file into an arbitrary container of objects, or database table, without making use of some information, some "metadata", which is not contained in the .csv data itself.

if, however, the programs processing this data did use order to access the fields rather than names, representing it as a 2-dimensional array perhaps, then you'll notice that it would be impossible to merge two different datasets that encode different types of data with different fields into one (at either the abstract table representation level into one table or at the serialized representation level into one file) in a way that wouldn't destroy the ability of any custom programs that were written to handle at least one of the datasets to process the data. for instance, if i had another dataset about car engine information, with records that looked like this:


and i wanted to merge that dataset with the one from the Wiki example, i could append the extra fields (the transmission, engine, and cylinder fields) to the corresponding records in the Wiki example dataset, placing those fields at positions 6, 7, and 8, respectively, but that would break the programs that were written to process the engine information expecting those fields to be at position 3, 4, and 5.

now, let's imagine that we know what the field names are from the Wiki example, and that they are "model year", "maker", "model name", "description", and "price", respectively. let's also imagine that each record has a globally unique identifier that has the form gid:x, where x is an integer. the RDF representation of the data in the example would look something like this (forgiving some liberties i'm taking for the sake of brevity), in N-Triples notation (notice how we don't have to create an "empty" description placeholder for the Chevy):

<gid:1> <model_year> "1997" .
<gid:1> <maker> "Ford" .
<gid:1> <model_name> "E350" .
<gid:1> <description> "ac, abs, moon" .
<gid:2> <price> "3000.00" .
<gid:2> <model_year> "1999" .
<gid:2> <maker> "Chevy" .
<gid:2> <model_name> "Venture \"Extended Edition\"" .
<gid:3> <price> "4900.00" .
<gid:3> <model_year> "1996" .
<gid:3> <maker> "Jeep" .
<gid:3> <model_name> "air, moon roof, loaded MUST SELL!" .
<gid:3> <price> "4799.00" .

the RDF representation of the car engine data might look like this:

<gid:1> <transmission> "Manual" .
<gid:1> <engine> "5.4" .
<gid:1> <cylinders> "8" .
<gid:2> <transmission> "Automatic" .
<gid:2> <engine> "4.0" .
<gid:2> <cylinders> "4" .
<gid:3> <transmission> "Manual" .
<gid:3> <engine> "6.1" .
<gid:3> <cylinders> "6" .

the abstract representation of these two datasets would be a graph that looks something like this (i know the actual data is different but the w3c's online RDF graph generator tool seems to be on the fritz so I'm using this to help visualize what it "looks like"). it is easy to see how a single generic program could be used to process both files and load them into the proper graph representations in memory or in a database. that same program would work on any file representing any kind of data mapping to any kind of graph. it's also easy to see how data can easily be merged with either the abstract or serialized representations. graphs are automatically merged simply by virtue of sharing common nodes (if the w3c's online graph generator were working i'd show the merged representation). with the serialized format, we can simply add the combine the lines from each file into a single file in any order we like.

there is enough contextual information in the data itself to not require these programs to use any information other than what is in the data itself - information that might be embedded in the logic of custom programs which process it, in external configuration files, or in some external metadata dictionaries/repositories - to process it. unlike the tables and comma-delimited files of the .csv format, the RDF data contains within itself all the information necessary for programs written outside of the context in which any given data is conceived and generated to process that data, and to that extent RDF is self-describing. the "metadata" lives with the data. the same is true of JSON. simply eval() any arbitrary JSON text (or use the JSON parser) and it will produce Javascript objects with attributes. the same can be said of XML, though the DOM tree which is produced by XML parsers does not map directly to the data structures of any programming language as simply and cleanly as JSON does to Javascript.

Anonymous said...

Genial brief and this fill someone in on helped me alot in my college assignement. Thank you seeking your information.