Wednesday, February 8, 2006

RDF as a Quantum Model of Data

RDF as a model of data, with certain conditions, looks a lot like the quantum mechanical view of matter. It may be useful to conceptualize RDF as a quantum model of data, as radical a shift from the currently dominant relational model of data as quantum mechanics was from Newtonian mechanics. The "classical" relational model of data, like classical Newtonian mechanics, breaks down at scale. Like quantum mechanics, RDF provides a more useful model when we examine smaller basic units of data.

Let me see if I can explain the correspondences between RDF and quantum mechanics.

The key correspondence between RDF and quantum mechanics that I'll describe here is the notion of the indeterminacy of any property, otherwise known as the uncertainty principle. In quantum mechanics, the notion that any observable property of a particle has a definite value is replaced with a probability distribution of possible values for that property upon observation. Different interpretations of that probability distribution, and of what happens upon observation, lead to different philosophical conceptions of the nature of reality. In the Copenhagen interpretation, which gives special emphasis to the role of the observer, the particle does not "exist" until it is observed - there is no objective physical reality. In the many-worlds interpretation, the particle "exists" in all possible states at once - each property simultaneously has all possible values - and the act of observation splits the history of the observer into alternate histories corresponding to every possible state of the particle. The many-worlds interpretation takes the notion of a quantum superposition of possible states quite literally; the Copenhagen interpretation places the superposition in a kind of meta-reality outside of time and space and posits that the act of observation is required to pluck particular values out of that meta-reality. Each of these interpretations has a rough analogue in RDF, as well.

Let's use an example to illustrate. Let's use something analagous to the position of a particle, but a bit closer to a real-world application. Assume we have some customer address data. For example, for each customer we have data about what country they reside in, as their primary residence. We'll call this property "country of residence". In a relational database, the smallest recordable unit is a tuple, which is a member of a set called a relation. This is otherwise known as a record in a table, which is the language I'll use from now on. So in our relational database, we might have a table called "Customer" with two columns: a "customer_id", which will serve as the primary key, and a "country_of_residence" column. Our customer table, with a record of the fact that Customer 123 lives in the US might look like this:

| customer_id | country_of_residence |
| 123 | US |

In an RDF datastore, the the smallest recordable unit is the statement. Every statement is a 3-tuple, or a triple, which specifies the value of a particular property for a particular thing. Thus every statement in RDF is implicitly a member of some binary relation; however those relations need not be explicitly defined. So in our RDF datastore, we would simply have a statement that says:

<customer_123> <country_of_residence> 'US'

But it turns out that our dataset is really big. It has not only been accumulated over a long period of time, but has been culled from many sources. There are indiscrepancies everywhere. Many customers have multiple primary states of residence. For instance, we have another piece of data that says that Customer 123's primary residence is Canada. This is a contradiction if we agree to take the semantics of the word "primary" to imply exclusiveness and, in fact, the way we've modeled it in our relational database does imply that. If we want to add another record indicating that Customer 123 lives in Canada to the Customer table we can't - there'll be a primary key violation. We have a choice to make. Decide if Canada is the "right" value, and update the existing record if so, or decide that US is the right value and discard the Canada information. The relational model only accepts one version of the "truth". (If we do update the record, depending on the database implementation, we may be able to find in the logs somewhere the fact that the Customer 123 lived in the US at some point in time, but that data is effectively in no-man's land now. From this we can see that relational databases are not designed to handle "versioning" of data.)

In our RDF store, there is no problem. Simply add another statement saying that Customer 123's primary country of residence is Canada. It's just another statement. In fact we can have an indefinite number statements about Customer 123's country of residence, and let's imagine that in our dataset there are lots of them - 100 different statements, in fact. In our dataset, we have 5 different statements saying Customer 123 lives in the US, 10 that say she lives in Canada (CA), 15 that say she lives in Mexico (MX), 20 that say she lives in Argentina (AR), 25 that say she lives in the UK, 20 that say she lives in Germany (DE), 10 that say she lives in Japan (JP), and 5 that say she lives in South Africa (ZA). We add all those statements to our RDF datastore. Now, like Schrödinger's cat, our datastore is at once in multiple contradictory states:

<customer_123> <country_of_residence> 'US'
<customer_123> <country_of_residence> 'CA'
<customer_123> <country_of_residence> 'MX'
<customer_123> <country_of_residence> 'AR'
<customer_123> <country_of_residence> 'UK'
<customer_123> <country_of_residence> 'DE'
<customer_123> <country_of_residence> 'JP'
<customer_123> <country_of_residence> 'ZA'

But we're losing some information modeling the data this way. We've lost the information that "Customer 123 lives in the US" was stated 5 times, that "Customer 123 lives in Canada" was stated 10 times, etc., Furthermore, we've lost the information about which source, or sources, any given statement comes from. Let's also assume that for each fact in our dataset, we also have data telling when each fact was observed - we've lost that, too. There is a way to model this information in RDF: we can store a reification for every statement, wherein each statement itself has certain properties. Let's imagine that we're storing a source and a date for each statement, meaning for each statement we're storing who made the statement and when it was made:

<customer_123> <country_of_residence> 'US' stated by <source_a> at 2005:11:18T07:21:00
<customer_123> <country_of_residence> 'US' stated by <source_b> at 2005:03:30T11:25:00
<customer_123> <country_of_residence> 'US' stated by <source_c> at 2004:10:20T08:19:00
<customer_123> <country_of_residence> 'UK' stated by <source_a> at 2006:01:17T09:30:00
We can now plot a histogram for Customer 123's country of residence:

30 |
| __
20 | __ | | __
| __ | | | | | |
10 | __ | | | | | | | | __
| __ | | | | | | | | | | | | __
0 ||__|_|__|_|__|_|__|_|__|_|__|_|__|_|__|__
What we have essentially is a probability distribution for the possible values of Customer 123's country of residence. As in quantum mechanics, we can now only resort to statistical methods to give any sort of objective answer to the question of what country Customer 123 resides in. We might choose the mode - the UK - as the most reliable value. If we were dealing with another sort of property with an ordering on the range - such as Customer 123's credit score - we might choose a median, or perhaps we'd compute an average for a property with a continuous range - such as her weight. The uncertainty relationship between an average value and a particular value in many cases would be in a way analogous to the uncertainty relationship between position and momentum, or between energy and time, in quantum mechanics. The accuracy in time of a "measured" value for a property would be inversely proportional to its accuracy as a value irrespective of time (an overall average), as the time range sampled is increased or decreased (with a single value at a point in time being the limit of decreasing the range). That is to say, the smaller the chunk of statements for a property that we look at the more accurately it reflects the value of that property at at that time, but the less it reflects the "movement", or evolution, of the value over time.

But there are other ways of dealing with these sorts of ambiguities. Through some deliberate, more subtly defined, act of "observation" we can collapse the possibilities into a few or just one. This is loosely analogous to the "collapse of the wavefunction" in the Copenhagen interpretation of quantum mechanics, which fixes the values of the observable properties of a particle upon observation. We could, for example, in our query against the RDF datastore, specify that only the most recent value should be retrieved, or that only the most recent value from a given source be retrieved, blithely unaware of all the other values. A key difference here from observation in the quantum mechanical sense, though, is that we can define our query to merely reduce the possible values rather than fix it to a single one. For instance we could define our query to get all of the statements made in the past year, and then perhaps take an average, combining techniques to collapse, or reduce, the set of possible values with the application of statistical measures to the resulting probability distribution.

The application has an important role to play here. If the datastore itself is viewed as kind of quantum soup containing all possible states, applications can define queries that select states according to specific rules, in some sense playing the role that consciousness plays in the Copenhagen interpretation of quantum mechanics. The analogy is a bit of a stretch, but you get the picture. Applications are critical in reducing the ambiguity of the base dataset for a user by focusing in on well-defined subsets of the data within a particular usage context. The gap between a statistical view of the data in aggregate form, or only focusing on particular values, is reminiscent of the wave-particle duality in quantum mechanics.

Now an initial gut reaction to the vagueness and "indeterminacy" of this way of modeling data, from those of us used to building systems on relational databases, is to say, "Well that sucks!". But the ability to model this kind of uncertainty is precisely what is needed at scale. The more data, and the more sources of it, that you have, the more likelihood you have of there being different results, differences of opinion, and any otherwise conflicting data. If you measure my weight at three different times, even within the same day, there is a good chance you'll get three different weights. If you ask three different people to provide a rating for the service at some restaurant, you're likely to get even greater discrepancies. Whether the data is subjective or "objective" (some would suggest that no data is objective), the more data you get the less it is possible to definitively identify a single "right" value, and the more important the fuzzy knowledge that statistics provides becomes. Barring the use of statistical methods, you can decide to privilege certain values based on some criteria, such as who provided the data (you may decide to trust the rating of someone you trust over the ratings of others), or when the data was provided (you may decide to trust the most recent weight measurement as the most accurate), making context the determinant of the right data. We can now see how Clay Shirky's principles that "Merges Are Probabilistic", "User and Time are Core Attributes", and "The Filtering is Done Post Hoc", outlined in his brilliant "Ontology is Overrated" essay, can be applied to structured data.

RDF as a quantum model of data is a good fit for Web 2.0 for a number of reasons. If you accept that an important aspect of Web 2.0 is the integration of all web sites into a single global platform for a new breed of applications that combine data and functionality from existing sites in new ways, then you essentially believe that the web is evolving towards one giant database. To integrate all of the structured data contained in the relational database silos that comprise the "backend" of many websites, we need a new kind of relational data model designed to do that, with the simplicity and flexibility to scale. We know that at scale, with data coming from many different sources, uncertainty, indeterminacy, and contradiction are inevitable, and we need a data model flexible enough to accomodate that. Furthermore, if you believe that social aspects of computing are an important theme in Web 2.0, then you need a data model that accomodates multiple points of view simultaneously, that can handle the great diversity of opinion about anything and everything that characterizes the web. Like the multiverse of the many-worlds interpretation of quantum mechanics, different world views must exist side-by-side - the web is a web of webs.

A final note

Some will point out that we could change the design of our relational schema to allow customers to have multiple countries of residence. We could pull the country_of_residence column out into another table, called "Customer_Country_of_Residence" for example (let's abbreviate it as CCR), and create a one-to-many relationship between the Customer table and the CCR table, with the CCR table having a foreign key to the Customer table. (Let's set aside the fact that there should probably have been a "Country" table with the ISO code as its primary key, and a many-to-many relationship between the Customer and Country tables. By the same token, countries should have been modeled as resources in RDF, rather than as literal values, but I wanted to keep the example as simple as possible.) Carrying this further, we could do this for all columns that we would normally put in the Customer table in our relational database, such as name, sex, nationality, etc., pulling each of the columns out into their own tables with a foreign key back to the Customer table. Having done this, we would see that every table in our schema (except for the Customer table itself) is, in fact, a binary relation. Each 2-tuple in every table can be mapped to an RDF triple with the name of the table as the property. Following this sort of schema design philosophy to achieve the kind of flexibility we get with RDF, any n-ary relational data model essentially reduces to a binary relational model resembling RDF.