Monday, June 18, 2007

The Sky Hook

A hook shot, in basketball, is a play in which the offensive player, usually turned perpendicular to the basket, gently throws the ball with a sweeping motion of his arm in an upward arc with a follow-through which ends over his head. Unlike the jump shot, it is shot with only one hand; the other arm is often used to create space between the shooter and the defensive player. Once the ball is in shooting position, the hook shot is one of the most difficult shots to effectively defend. - Wikipedia, "Hook Shot"

It's been called "the most beautiful thing in sports", by Bill Russell. Basketball's version of the pirouette, the player spinning on one toe with arms in the Bolshoi Fourth Position - there is perhaps nothing closer to ballet in any contact sport. Nothing so graceful, nothing with so delicate a form belying the hidden strength required to pull it off, nothing so sweet. The sky hook -- or at least the version of it perfected by Kareem Abdul Jabbar -- is truly the most lovely thing to behold in all of basketball.

But the thing about the sky hook that fascinated me so much when it recently popped up again into my consciousness (thanks to Michael Sokolove), the thing that made me believe that practicing it and actually trying to execute the shot in pickup games (which noone does anymore) might actually have some deep and important lessons to teach me, was the realization of how perfect a union of form and function it is. Its form actually derives almost entirely from its function -- the spin to create space between oneself and the defender; the arm held out at the horizontal to ward off the defender and maintain that space; the ball released at the apex of the motion of fully extending the other arm in the vertical at the farthest possible angle from the defender -- all of its style is also all of its substance. There is nothing in it that does not contribute to the main objective: to put the ball in the basket. It is pure economy of motion, so much so that it almost seems effortless, at times.

Perhaps this perfectly efficient utilization of energy had something to do with Kareem's incredible longevity. Other than Robert Parish, no other player has played in more games, but Parish was nowhere near as prolific as Kareem. Indeed, Kareem is the NBA's all-time leading scorer, with the eighth highest field goal percentage in the league's history.

When Kareem Abdul-Jabbar left the game in 1989 at age 42, no NBA player had ever scored more points, blocked more shots, won more Most Valuable Player Awards, played in more All-Star Games or logged more seasons. His list of personal and team accomplishments is perhaps the most awesome in league history: Rookie of the Year, member of six NBA championship teams, six-time NBA MVP, two-time NBA Finals MVP, 19-time All-Star, two-time scoring champion, and a member of the NBA 35th and 50th Anniversary All-Time Teams. He also owned eight playoff records and seven All-Star records. No player achieved as much individual and team success as did Abdul-Jabbar. - bio

The sky hook was not only the most beautiful, the most poetic offensive maneuver in basketball, it was also the most consistently devastating weapon, the most eternally difficult to defend against, perhaps the most reliable single shot the game has ever seen. It pleased equally the aesthete and the businessman in us. Which is what, I realized, I've been trying to do for my entire career as a programmer. It's what us programmers are reaching for when attempting to craft solutions that possess that elusive quality of "elegance".

We use different terms as conceptual tools to try and get a handle on exactly what it is we're doing as programmers. Computer Scientist. Engineer. Hacker. The appropriately vague "Developer". None of them quite right. Each addressing some aspect of what we do -- "scientist" emphasizing the analytical aspect, "hacker" emphasizing the more aesthetic aspects -- but ignoring others (except for the neutral, and popular, "developer"). Even as we hack away in eXtreme fashion abandoning the hope that the process of creating software can ever be predictable, and embracing the creative humanity of programming, we acknowledge the brilliance of Knuth and his algorithms, are forever engaged in the quest to optimize tradeoffs between time and space, continue to try to create software that is generic and reusable (implicitly searching for universal and general "laws of software", I think, when we do this), and can never forget that, ultimately, what we are doing as programmers is creating virtual machines.

All software must function, must have some utility. All software does some thing for its user. But, any savvy programmer knows that there are all sorts of non-functional issues, within a technological landscape that is under constant flux, that must be confronted and resolved in the development of any software system. And, I think, the very best among us are able to incorporate aesthetic concerns into the process. The mysterious thing about this is that oftentimes the software that address the functional and various other non-functional concerns in the most optimal fashion -- that, in effect, has the most business value -- is also the most beautiful. Just like Kareem's sky hook.

After a certain high level of technical skill is achieved, science and art tend to coalesce in aesthetics, plasticity and form. The greatest scientists are always artists as well. - Albert Einstein

So often style and substance, or form and function, are considered orthogonally. While I do believe that beauty is an inherent good, to be appreciated for its own sake, and believe the same about utility and efficiency, I don't believe that treating them separately is the best approach to building software. Striving to create computer systems that are at once maximally beautiful and maximally useful, not ones that make trade-offs between the two, or that try to balance the two, but that, like Kareem's sky hook, possess a form which derives perfectly from function even while being beautiful, seems to yield the best result.

Perhaps, it's because there is no way to take the human element out of software. Even software that doesn't directly interact with lots of users at runtime must be maintained by humans, consists of code which must be read by humans, possesses a design and an organization that must be apprehended by humans. Man and machine is another false dichotomy. And ugly software is hard to digest not only by end users, causing its popularity to suffer, but by the programmers who must maintain and evolve the software, causing its maintenance to suffer (and any software project manager worth his salt is intimately aware that the lion's share of the cost of any software system - 70 to 80 percent - goes towards maintenance). There is nothing more easily absorbed by a humanoid, more efficiently absorbed if you will, than those things which please the soul.

To succeed in doing this, even on the smallest of scales, is I think the greatest source of wonder and of pleasure that one can get in programming. To create something that not only creates measurable utility in the world but also creates immeasurable beauty, without sacrificing either for the sake of the other, well there is something almost miraculous about that. Which is why I feel so fortunate to be able to do this thing called computer programming.

And why I'm off tomorrow to the basketball courts to practice my sky hook. I don't really know exactly how this perfect union of form and function is achieved, in any sort of predictably repeatable fashion. But I have this feeling that there is some ineffable lesson in how to do it that I might absorb by practicing my hook shot.

B-ball anyone?

Wednesday, February 8, 2006

RDF as a Quantum Model of Data

RDF as a model of data, with certain conditions, looks a lot like the quantum mechanical view of matter. It may be useful to conceptualize RDF as a quantum model of data, as radical a shift from the currently dominant relational model of data as quantum mechanics was from Newtonian mechanics. The "classical" relational model of data, like classical Newtonian mechanics, breaks down at scale. Like quantum mechanics, RDF provides a more useful model when we examine smaller basic units of data.

Let me see if I can explain the correspondences between RDF and quantum mechanics.

The key correspondence between RDF and quantum mechanics that I'll describe here is the notion of the indeterminacy of any property, otherwise known as the uncertainty principle. In quantum mechanics, the notion that any observable property of a particle has a definite value is replaced with a probability distribution of possible values for that property upon observation. Different interpretations of that probability distribution, and of what happens upon observation, lead to different philosophical conceptions of the nature of reality. In the Copenhagen interpretation, which gives special emphasis to the role of the observer, the particle does not "exist" until it is observed - there is no objective physical reality. In the many-worlds interpretation, the particle "exists" in all possible states at once - each property simultaneously has all possible values - and the act of observation splits the history of the observer into alternate histories corresponding to every possible state of the particle. The many-worlds interpretation takes the notion of a quantum superposition of possible states quite literally; the Copenhagen interpretation places the superposition in a kind of meta-reality outside of time and space and posits that the act of observation is required to pluck particular values out of that meta-reality. Each of these interpretations has a rough analogue in RDF, as well.

Let's use an example to illustrate. Let's use something analagous to the position of a particle, but a bit closer to a real-world application. Assume we have some customer address data. For example, for each customer we have data about what country they reside in, as their primary residence. We'll call this property "country of residence". In a relational database, the smallest recordable unit is a tuple, which is a member of a set called a relation. This is otherwise known as a record in a table, which is the language I'll use from now on. So in our relational database, we might have a table called "Customer" with two columns: a "customer_id", which will serve as the primary key, and a "country_of_residence" column. Our customer table, with a record of the fact that Customer 123 lives in the US might look like this:

| customer_id | country_of_residence |
| 123 | US |

In an RDF datastore, the the smallest recordable unit is the statement. Every statement is a 3-tuple, or a triple, which specifies the value of a particular property for a particular thing. Thus every statement in RDF is implicitly a member of some binary relation; however those relations need not be explicitly defined. So in our RDF datastore, we would simply have a statement that says:

<customer_123> <country_of_residence> 'US'

But it turns out that our dataset is really big. It has not only been accumulated over a long period of time, but has been culled from many sources. There are indiscrepancies everywhere. Many customers have multiple primary states of residence. For instance, we have another piece of data that says that Customer 123's primary residence is Canada. This is a contradiction if we agree to take the semantics of the word "primary" to imply exclusiveness and, in fact, the way we've modeled it in our relational database does imply that. If we want to add another record indicating that Customer 123 lives in Canada to the Customer table we can't - there'll be a primary key violation. We have a choice to make. Decide if Canada is the "right" value, and update the existing record if so, or decide that US is the right value and discard the Canada information. The relational model only accepts one version of the "truth". (If we do update the record, depending on the database implementation, we may be able to find in the logs somewhere the fact that the Customer 123 lived in the US at some point in time, but that data is effectively in no-man's land now. From this we can see that relational databases are not designed to handle "versioning" of data.)

In our RDF store, there is no problem. Simply add another statement saying that Customer 123's primary country of residence is Canada. It's just another statement. In fact we can have an indefinite number statements about Customer 123's country of residence, and let's imagine that in our dataset there are lots of them - 100 different statements, in fact. In our dataset, we have 5 different statements saying Customer 123 lives in the US, 10 that say she lives in Canada (CA), 15 that say she lives in Mexico (MX), 20 that say she lives in Argentina (AR), 25 that say she lives in the UK, 20 that say she lives in Germany (DE), 10 that say she lives in Japan (JP), and 5 that say she lives in South Africa (ZA). We add all those statements to our RDF datastore. Now, like Schrödinger's cat, our datastore is at once in multiple contradictory states:

<customer_123> <country_of_residence> 'US'
<customer_123> <country_of_residence> 'CA'
<customer_123> <country_of_residence> 'MX'
<customer_123> <country_of_residence> 'AR'
<customer_123> <country_of_residence> 'UK'
<customer_123> <country_of_residence> 'DE'
<customer_123> <country_of_residence> 'JP'
<customer_123> <country_of_residence> 'ZA'

But we're losing some information modeling the data this way. We've lost the information that "Customer 123 lives in the US" was stated 5 times, that "Customer 123 lives in Canada" was stated 10 times, etc., Furthermore, we've lost the information about which source, or sources, any given statement comes from. Let's also assume that for each fact in our dataset, we also have data telling when each fact was observed - we've lost that, too. There is a way to model this information in RDF: we can store a reification for every statement, wherein each statement itself has certain properties. Let's imagine that we're storing a source and a date for each statement, meaning for each statement we're storing who made the statement and when it was made:

<customer_123> <country_of_residence> 'US' stated by <source_a> at 2005:11:18T07:21:00
<customer_123> <country_of_residence> 'US' stated by <source_b> at 2005:03:30T11:25:00
<customer_123> <country_of_residence> 'US' stated by <source_c> at 2004:10:20T08:19:00
<customer_123> <country_of_residence> 'UK' stated by <source_a> at 2006:01:17T09:30:00
We can now plot a histogram for Customer 123's country of residence:

30 |
| __
20 | __ | | __
| __ | | | | | |
10 | __ | | | | | | | | __
| __ | | | | | | | | | | | | __
0 ||__|_|__|_|__|_|__|_|__|_|__|_|__|_|__|__
What we have essentially is a probability distribution for the possible values of Customer 123's country of residence. As in quantum mechanics, we can now only resort to statistical methods to give any sort of objective answer to the question of what country Customer 123 resides in. We might choose the mode - the UK - as the most reliable value. If we were dealing with another sort of property with an ordering on the range - such as Customer 123's credit score - we might choose a median, or perhaps we'd compute an average for a property with a continuous range - such as her weight. The uncertainty relationship between an average value and a particular value in many cases would be in a way analogous to the uncertainty relationship between position and momentum, or between energy and time, in quantum mechanics. The accuracy in time of a "measured" value for a property would be inversely proportional to its accuracy as a value irrespective of time (an overall average), as the time range sampled is increased or decreased (with a single value at a point in time being the limit of decreasing the range). That is to say, the smaller the chunk of statements for a property that we look at the more accurately it reflects the value of that property at at that time, but the less it reflects the "movement", or evolution, of the value over time.

But there are other ways of dealing with these sorts of ambiguities. Through some deliberate, more subtly defined, act of "observation" we can collapse the possibilities into a few or just one. This is loosely analogous to the "collapse of the wavefunction" in the Copenhagen interpretation of quantum mechanics, which fixes the values of the observable properties of a particle upon observation. We could, for example, in our query against the RDF datastore, specify that only the most recent value should be retrieved, or that only the most recent value from a given source be retrieved, blithely unaware of all the other values. A key difference here from observation in the quantum mechanical sense, though, is that we can define our query to merely reduce the possible values rather than fix it to a single one. For instance we could define our query to get all of the statements made in the past year, and then perhaps take an average, combining techniques to collapse, or reduce, the set of possible values with the application of statistical measures to the resulting probability distribution.

The application has an important role to play here. If the datastore itself is viewed as kind of quantum soup containing all possible states, applications can define queries that select states according to specific rules, in some sense playing the role that consciousness plays in the Copenhagen interpretation of quantum mechanics. The analogy is a bit of a stretch, but you get the picture. Applications are critical in reducing the ambiguity of the base dataset for a user by focusing in on well-defined subsets of the data within a particular usage context. The gap between a statistical view of the data in aggregate form, or only focusing on particular values, is reminiscent of the wave-particle duality in quantum mechanics.

Now an initial gut reaction to the vagueness and "indeterminacy" of this way of modeling data, from those of us used to building systems on relational databases, is to say, "Well that sucks!". But the ability to model this kind of uncertainty is precisely what is needed at scale. The more data, and the more sources of it, that you have, the more likelihood you have of there being different results, differences of opinion, and any otherwise conflicting data. If you measure my weight at three different times, even within the same day, there is a good chance you'll get three different weights. If you ask three different people to provide a rating for the service at some restaurant, you're likely to get even greater discrepancies. Whether the data is subjective or "objective" (some would suggest that no data is objective), the more data you get the less it is possible to definitively identify a single "right" value, and the more important the fuzzy knowledge that statistics provides becomes. Barring the use of statistical methods, you can decide to privilege certain values based on some criteria, such as who provided the data (you may decide to trust the rating of someone you trust over the ratings of others), or when the data was provided (you may decide to trust the most recent weight measurement as the most accurate), making context the determinant of the right data. We can now see how Clay Shirky's principles that "Merges Are Probabilistic", "User and Time are Core Attributes", and "The Filtering is Done Post Hoc", outlined in his brilliant "Ontology is Overrated" essay, can be applied to structured data.

RDF as a quantum model of data is a good fit for Web 2.0 for a number of reasons. If you accept that an important aspect of Web 2.0 is the integration of all web sites into a single global platform for a new breed of applications that combine data and functionality from existing sites in new ways, then you essentially believe that the web is evolving towards one giant database. To integrate all of the structured data contained in the relational database silos that comprise the "backend" of many websites, we need a new kind of relational data model designed to do that, with the simplicity and flexibility to scale. We know that at scale, with data coming from many different sources, uncertainty, indeterminacy, and contradiction are inevitable, and we need a data model flexible enough to accomodate that. Furthermore, if you believe that social aspects of computing are an important theme in Web 2.0, then you need a data model that accomodates multiple points of view simultaneously, that can handle the great diversity of opinion about anything and everything that characterizes the web. Like the multiverse of the many-worlds interpretation of quantum mechanics, different world views must exist side-by-side - the web is a web of webs.

A final note

Some will point out that we could change the design of our relational schema to allow customers to have multiple countries of residence. We could pull the country_of_residence column out into another table, called "Customer_Country_of_Residence" for example (let's abbreviate it as CCR), and create a one-to-many relationship between the Customer table and the CCR table, with the CCR table having a foreign key to the Customer table. (Let's set aside the fact that there should probably have been a "Country" table with the ISO code as its primary key, and a many-to-many relationship between the Customer and Country tables. By the same token, countries should have been modeled as resources in RDF, rather than as literal values, but I wanted to keep the example as simple as possible.) Carrying this further, we could do this for all columns that we would normally put in the Customer table in our relational database, such as name, sex, nationality, etc., pulling each of the columns out into their own tables with a foreign key back to the Customer table. Having done this, we would see that every table in our schema (except for the Customer table itself) is, in fact, a binary relation. Each 2-tuple in every table can be mapped to an RDF triple with the name of the table as the property. Following this sort of schema design philosophy to achieve the kind of flexibility we get with RDF, any n-ary relational data model essentially reduces to a binary relational model resembling RDF.

Tuesday, August 23, 2005

Web Databases vs. Web Services/API's

It seems like everyone and their mother is talking about Web 2.0, mash-ups, and Web Services, lately. On the same day that Mike Weiksner posted this article to my "for" bucket, I was reading a BusinessWeek article called "Mix, Match, and Mutate". Today, published an article entitled "From Web page to Web platform". Perhaps the most lyrical and eloquent rhapsody around this idea appeared in the most recent issue of Wired, in an article entitled "We are the Web". The passage begins:
These are safe bets, but they fail to capture the Web's disruptive trajectory. The real transformation under way is more akin to what Sun's John Gage had in mind in 1988 when he famously said, "The network is the computer." He was talking about the company's vision of the thin-client desktop, but his phrase neatly sums up the destiny of the Web: As the OS for a megacomputer that encompasses the Internet, all its services, all peripheral chips and affiliated devices from scanners to satellites, and the billions of human minds entangled in this global network. This gargantuan Machine already exists in a primitive form. In the coming decade, it will evolve into an integral extension not only of our senses and bodies but our minds.
Later he remarks:
By 2015, desktop operating systems will be largely irrelevant. The Web will be the only OS worth coding for.
This vision is similiar to previous pipe dreams, like The Intergalactic Foundation (which I, in my college years and fresh-out-of-college years, happened to have been a big believer in), except it doesn't seem like such a pipe dream anymore. The web has taught us a great deal about what it is necessary to make a truly "intergalactic" web platform work, and if we look at the evolution of pipe dream towards realistic vision we see a trend towards increasing simplicity of the model. SOAP was a revelation because it looked like CORBA reincarnated on a more lightweight web substrate. In its first incarnation it did not require any specialized software other than a web server and xml parser, which are much easier to come by and simpler beasts than CORBA ORB's. Unfortunately, SOAP seems to be following the path of CORBA in a spiral of increasing complexity towards irrelevance. For that reason, REST appears to be the architecture of choice for these emerging "web service" applications.

The common thread through all of these discussions about distributed computing platforms is the notion of API's, and so the RPC (remote procedure call), in some form, to this day remains the key figure in the vision of Web 2.0. But what I think has been absent from these discussions is consideration of a DBMS for the web. For decades now, some sort of DBMS has served as the backbone for the vast majority of "data-driven" applications, which happens to comprise virtually 100% of corporate IT systems and "business apps". The reason is simple: a standard, consistent, elegant data management platform is not a trivial undertaking, and yet is a requirement for all such applications. For most software developers, developing these applications would be unthinkable without a DBMS, usually an RDBMS.

Databases often serve as an integration point between several applications that share the same data (in fact, this was one of the primary motivations for the development of the first database management systems). Sometimes the quickest way to extend the functionality of an existing application that you've inherited, is to go around the code and look at the database and build a new app directly against that. This is frowned upon but fairly common, in my experience, often because the existing code either doesn't provide an API, per se, or the API is deficient in some way (functionally, or non-functionally). Still, the philosophy that one shouldn't access a database directly, and should go through API's instead, persists and this is still the way many systems are integrated. What are the reasons for this?

Well one reason is that you want to protect your database from "corruption". There are often complex rules surrounding how records get updated that cannot be fully expressed through the "data integrity" machinery of the DBMS, and so some sort of API call (which might be a stored procedure in the RDBMS) backed by code which enforces these rules is required. Furthermore, the space and shape of update operations is usually pretty well understood and to some degree fixed. The application designers can usually map out the majority of useful write operations and provide API calls, or end-user functionality, which accomplish them. Not so with the reading of the data. Application developers often find that users need to be able to generate "reports" about the data that were not foreseen. There are myriad possible ways that a user might want to filter, sort, count, or see relationships amongst the different data elements, and the chances of predicting all of the ones users will want ahead of time is slim. Thus the robust market for reporting and OLAP software that hit the database directly, as well as the trend of building data warehouses - large uber-databases with data culled and integrated from multiple systems across an enterprise, to which OLAP software is then applied.

Another reason for the persistence of this API-oriented thinking, I think, is that there is still engrained in our collective software engineering unconscious this notion of the importance of "encapsulation". We were taught the importance of writing, and writing to, abstract interfaces in our software development, and to treat the implementations of these interfaces as "black boxes" that cannot, and should not, be seen into. It was thought that encapsulation could not only provide greater security, but also prevent users of software libraries from building dependencies in their systems on the parts of the software library most likely to change (the implementations vs. the more stable interfaces), causing the client system to break. While this interface vs. implementation concept has a lot of merit when developing software frameworks, from a practical standpoint its value is negligible in the context of pure read access of data, particularly when the database software and database schema of a production application is the thing least likely to change. Even when the schema does change, this usually requires a change to interfaces representing data anyway since there is usually a straight mapping from database schema to these interfaces. The open-source era has also taught us a lot about the relative value of this black-box notion of software components. Contrary to our prior intuition, in a globally networked environment with constant, instant, and open communication, lots of eyes looking deep into software can increase its safety and reliability. Our ability to respond to changes in software components which break the apps we build on top of them is also enhanced.

A Case Study

Recently, I wrote a Greasemonkey script that reinforced my belief in the need for a web database service for Web 2.0 apps. While it was a fairly trivial script that I wrote simply to tinker around, it highlights some of the shortcomings of a purely API-centric approach to these new cross-web applications. Basically what the script does is replace the photos in the slideshows of city guides on the Yahoo travel site with Flickr photos that are tagged with that city's name and have been flagged by the Flickr system as "interesting".

Well, the first problem is that the Flickr API does not give you a way to retrieve interesting photos. They have a search method that allows you to retrieve photos with the tags you specify, but "interestingness" is some special system attribute which is not modeled as a tag. In a situation like this, where the method hard-codes a limited set of ways in which you can query the data, you're pretty much shit up the creek if you want to query the data in a way that the developers didn't anticipate. You can ask the Flickr development team to provide it, and hope that they honor your request, and implement it within a reasonable timeframe, but your deadline will likely be past by then. Luckily for me, there's a screen I can scrape to grab the photos I need, an inelegant hack that does the job, but which is an ugly solution.

The second problem I had was that I wanted to filter out any photos tagged as "nude", not wanting to offend the users of my script with the sight of unwanted genitalia when they're exploring possible vacation destinations. There is no exclude tag option for the search method, and no easy way to do this. I could if I wanted to, put a loop in my program to repeatedly call the search method (assuming the search method did actually provide an option to specify "interesting" photos), and for each photo in the result page invoke the Flickr service again to find out all that photo's tags and throw it away if it has a "nude" tag, calling the search method repeatedly until I have the number of photos I need to fill in the slide show. Now, it's unlikely that the search method will need to be invoked more than twice, but I have to code for an indefinite number of iterations of this loop cuz I can't know for certain at any time for any given city how many nude photos there will be in the results. And two invocations of the search method is already more than I should have to make. Not only is this solution more work to implement, but it has very unfavorable performance characteristics, and puts unnecessary load on the server. Instead of making one service call over the network, I have to make (N+1)*X calls, where N is the number of results in each page, and X is the number of pages that need to be processed to fill the slide show. In this case, this requirement turned out not be worth the effort and performance impact it would have, so I let it go.

The third problem I encountered was a consequence of the screen scraping approach I was forced to take. I wanted to display the title of each photo, just like the default Yahoo slideshow does. The search method of the Flickr API returns the title of each photo in the results, but unfortunately the screen that shows a page of "interesting" photos with a given tag does not. If I want to display the titles of each photo in the slideshow, I have the same (N+1)*X problem I have with wanting to filter out nude photos; I'd have to make a seperate call to get the title for each photo in the page. This was not such an easy requirement to let go of, so we're forced to pay the performance penalty.

Now this was a very small script with very limited functionality, but you you can see the issues that crop up when you want to build a real-world web app using a purely API-based approach. It is not possible to approximate the power of a full relational/pattern-matching calculus, the kind that is approximated with a typical database query language like SQL, with a set of name-value pairs, which is what the input to a method/REST-endpoint essentially is (the usual way around this is to allow one of the name-value pairs to represent a query that gets executed directly against the database; this is nothing more than proxying the DB query interface through the method call). It is also generally much more efficient to look at a diagram of a data model to figure out what query to run against a database than it is to read a functional API spec to figure out how to orchestrate a set of API calls to accomplish what one query could.

We need a WDBMS (Web Database Management System) or WDBS (Web Database Service)

I say, let's use API's when appropriate(for most write access to data), and give access to DBMS query interfaces when appropriate (which is often the case for read access to rich data repositories). We have a good architecture for Web Services/API's, which is proving itself in real and prominent (press-worthy, at least) apps, in REST. Where's our web database architecture, which can complement REST in its simplicity and ability to scale to a global level? Well, as I've expounded on in previous posts, I think RDF is it.

Another point to consider is that as these mash-ups get more sophisticated they will no longer be pure mash-ups. Instead of merely exploiting existing relationships between data in different web sites, they will allow for the creation and storage of new relationships amongst data that is globally distributed across the web. These applications will need to have write access to their own databases, built on DBMS's designed for the web.

Designed for the web, these databases should be available as online services that can be accessed over the web. There should be a consistent serialization defined from an arbitrary dataset to an "on-the-wire" transport format in the lingua franca of the web - XML - which RDF provides, or alternatively into another web format that is simpler and better - JSON ( this simple requirement could have naively be achieved by storing your data as XML with some sort of XML database technology, but XML has many problems as a data model, not the least of which being that it violates the KISS principle) . Physically, they should look like the web, with a similiar topology and the ability to be massively distributed and decentralized, with distributed query mechanisms that can work in a peer-to-peer fashion. As the data substrate underpinning the sophisticated mash-ups of the future, I see them filling in what might be viewed as the currently "negative space" of the web, the gaps between web sites. I can see these kinds of database services really coming into their own serving as data hubs between multiple sites.

As an experiment, I will be putting a stab at such a WDBS online in the near future. A web app that I'm putting together using Kowari's RDF database engine. It will be available for free use by mash-up experimentalists who just have a Mozilla browser with Greasemonkey at their disposal, and need some place online to store their data. More news on that coming up ...

Monday, August 22, 2005

The Web Database

there are many who have traced the history of database management systems, in particular the Great Debate between the network model and the relational model - embodied in their key proponents, Charles Bachman and E.F. Codd, respectively - and note that if there are any purely technical factors that contributed to the relational model's triumph over the network model it would be that the relational was simpler. not only were network databases more complex to manage from an adminstrative perspective, but from a user standpoint querying network databases was complex and error-prone because developers of the network model were never able to devise a simple declarative query language, having to rely on procedural devices like goto's and cursors and requiring an intimate low-level knowledge of the physical data structures by the user. some relational purists will argue that the relational model's solid mathematical foundation was the source of its technical superiority, but from a pragmatic perspective its grounding in predicate calculus was only important insofar as it simplified the problems of storing and accessing data.

we see the idea of simplicity appearing over and over again when we analyze the advantages of various successful models and systems over their competitors/predecessors. HTML vs. SGML. REST vs. SOAP. Hibernate over EJB and Spring over J2EE. Extreme Programming's KISS philosophy and the New Jersey approach to design. Capitalism vs. Communism. hell, even Nike is going barefoot these days, and in the world of organized violence the paring down of "barred" holds and the mixing of styles is all the rage. common to all of these frameworks is the greater flexibility and creative freedom to allow human ingenuity its fullest expression. when the prime value of the global network that all of our lives are being woven deeper and deeper into is the aggregation and multiplication of human capital, i think that it's no accident that models which release human capabilities are gaining more and more prominence over those that attempt to control them.

what many people fail to realize about the RDF model of data is that it is a simpler and more general model of data than anything that has come before it. not RDF with schemas and ontologies and all that jazz. that's actually more complex than anything that has come before it. i'm talking about basic RDF. designed originally as a data model for the web, one key requirement had to be met: that any data anywhere on the globe, whether it be in relational databases, network databases, flat files, or what have you, could be mapped to it. consequently, what was produced was a kind of lowest common denominator of data models. a key concept here is that of the fundamental, irreducible unit of data as the simplest kind of statement (or, more precisely, in the language of mathematics: a binary relation). even C.J. Date - arguably second only to Codd as an authority on the relational model - acknowledged in a recent comment on "relational binary database design" that there is an argument for the binary relation being an irreducible unit out of which the n-ary relations which relational theory deals with can be composed. in his comment, he describes how a ternary (3 column) relation can be composed by "joining" 2 binary relations. by breaking down the nature of data into something a bit more granular to manipulate we gain a power and flexibility not unlike that envisioned by Bill Joy when he waxes philosophic about nanotechnology and its promise of the ability to create any physical thing by manipulating granular components of matter. indeed much of the progress in our understanding of matter has been driven by successive discoveries of increasingly more granular, or atomic, units of matter.

"No tuples barred" data kung-fu

there's another aspect of RDF that has practical consequences that make it a good fit for the web: it's "self-describing" nature. this aspect of RDF is not just something that was artifically designed in or layered on; it follows quite naturally from its reductionist foundations. since we effectively use the irreducible binary relation as a kind of building block to compose larger types of relations, each irreducible binary relation must have an independent existence apart from the compositional relationships it participates in. it must have a global identifier to be independently recognizable by the system. when the most granular components of even the most complex dynamic aggregations of data are identifiable as individuals with an independent existence, the effect is that the data becomes self-describing. contrast that with the relational model wherein columns are defined relative to a relation. columns cannot be said to exist independent of some relation of which they are a part.

when data is self-describing, schema becomes inessential. there are no RDBMS's that I'm aware of that allow data to be created that does not conform to some pre-defined schema. XML, on the other hand, another self-describing data format, does not require a schema to exist before you can create a valid XML document. while schema may be useful for enforcing/confirming some kind of organization of the data, it is not essential to the creation and manipulation of data.

this allows you to have a database that does not require the kind of bureaucratic planning that the database modeling exercise in a large organization can devolve into before being put into action. if it were a relational database, it would be as if there were no conceivable tuple barred from creation. it allows a level of responsiveness and agility in reacting to problems and creating solutions that simply isn't possible with today's RBMS technology, and with the bureaucracy that has developed in many corporate IT departments around the administration and development of such database systems.

such a system would be much like a database created in Prolog (which almost certainly had an influence on the design of RDF due to its early "knowledge representation" aspirations). in Prolog you can assert any fact, i.e. make any statement that you want without having the predicates predefined. any kind of higher-order structure or logic that exists among the facts, such as a graph connecting a set of binary relations, is an emergent property of a dataset that can be discovered through inference, but is never explicitly defined anywhere in the system. while some sort of schemata may serve as a guide to a user entering facts and rules in a Prolog database, prolog is not aware of it, and has no way of enforcing it. this is much the way that the human brain, indeed matter itself, works. while it's possible at higher levels of organization for both the brain and matter to create rigid molds into which things that don't fit the mold are not accepted, they don't fundamentally work this way. by the same token, it is possible to create RDF systems that ridigly enforce RDF schemas and ontologies, but i wouldn't recommend it. the bigger your world gets the more flexiblity you want. as your horizon expands, it becomes increasingly difficult to define a single schema that fits all data, and the web is about as big a data universe as you can get. the simpler model scales better.

a recent article in HBS Working Knowledge, entitled "How Toyota and Linux Keep Collaboration Simple", describes how "The Toyota and Linux communities illustrate time-tested techniques for collaboration under pressure". the article makes the point that both groups follow a minimalist philosophy of using the simplest, most widely available technologies to enable far-flung groups to collaborate. a minimalist, widely available database technology (i.e. available as a service over HTTP) could allow a kind of real-time programming capability to rapidly create programs that allow collaborators across different organizations to analyze and attack novel problems with unique data patterns in near real-time. the web database should be like a CVS for data, allowing programmers to work in parallel with different representations of data and to merge those representations, in much the way source code version control systems allow different representations of program logic to be worked on in parallel, and merged. like CVS it should provide a lineage of the changes made to those representations allowing them to be "rolled back" if necessary, giving coders the confidence to move forward quickly pursuing a path, knowing that it will be easy to backtrack if necessary. it would be the perfect database technology for agile development, founded on the Jeet Kune Do of data models:
JKD advocates taking techniques from any martial art; the trapping and short-range punches of Wing Chun, the kicks of northern Chinese styles as well as Savate, the footwork found in Western fencing and the techniques of Western boxing, for example. Bruce Lee stated that his concept is not an "adding to" of more and more things on top of each other to form a system, but rather, a winnowing out. The metaphor Lee borrowed from Chan Buddhism was of constantly filling a cup with water, and then emptying it, used for describing Lee's philosophy of "casting off what is useless."

The best of all worlds

recently i came across this interview in 2003 with Don Chamberlin, co-inventor of SQL. nowadays, he spends his time working out a query language for XML and thinking about how to unify structured data and unstructured data under one model, and the integration of heterogenous data with self-describing data models (the latter is exactly what RDF is a good simple solution for, and XML isn't). it ends with some interesting quotes by Mr. Chamberlin:

Chamberlin: Well, you know I've thought about it, and I think the world needs a new query language every 25 years. Seriously, it's very gratifying to be able to go through two of these cycles. DB2 will support SQL and XQuery as sort of co-equals, and that's the right approach. It embodies the information integration idea that we are trying to accomplish.

Haderle: And do you think that, given the Internet's predominantly pointer-based navigation, that Charles Bachman [originator of the network database model] is thinking, "I finally won out over relational?"

Chamberlin: Well, there are a lot of hyperlinks in the world, aren't there? I have a talk, "A Brief History of Data," that I often give at universities. And in this talk, I refer to the Web as "Bachman's Revenge."

Haderle: I know that the IMS guys are saying, "I told you so."

so are we are ready for a new data model? is the web indeed "Bachman's Revenge", and will the new data model be really a return to something old? in some ways, yes. the web, and RDF, do superficially resemble the hyperspace of Bachman's network data model. the hyperlink is a binary relation between two nodes, and both the network data model and RDF are based conceptually, to some extent, on a graph model of data. this is directly attributable to the binary relation's fundamental role in graph theory. but RDF is also fundamentally different. in Bachman's network model it was "records" that were hyperlinked. these records looked more like the n-ary relations of the relational world (though they were never rigorously and formally defined as such). thus, there was a fundamental inconsistency in the network data model. in RDF, all data is modeled as binary relations, and thus all data is "in the graph". thus, all data in an RDF model is at once amenable to the kind of rigorous mathematical analysis and logical inference that the relational model is, and also mappable to a graph (a labeled directed graph, to be more exact). add to that basic structure a self-describing format, and the result is a model of data that achieves an elegance, simplicity, and flexibility that Bachman's model never did, making it a beautiful fit for the web.

in much the same way that the strength of RDF as a universal data model seems to be a result of it being a simplification and distillation of the essence of other models of data, with more dynamism and flexibility, the success of Java was driven in its early days by it being in some sense a distillation of the essence of other popular programming languages and platforms, that was simpler than any of the existing programming languages and platforms - a lowest common denominator that held the promise of portability across all platforms.

Back to the basics ...

so what i'm advocating, in part to help clear up the noise and confusion surrounding this technology, and partly to focus resources where they would reap the most value at this yet early stage in its evolution, is a focus on a simpler RDF. i'm more interested in an RDF--, than an RDF++. the reason the web took off was because it was so simple to use. anyone could write an HTML page. the hyperlink is the most basic and intuitively graspable data structure one could imagine. RDF, in its basic form, doesn't really do much more than add a label to that link, introduce a new kind of node - a literal, and a powerful query language against this network of nodes with labeled links. RDF has yet to "take off". let's wait till that happens and it gains some real traction before we start over-engineering it. let's see how we can cope without schemas and ontologies. let's see if the self-organizing nature of the web will allow us to get away without them. then maybe we'll discover that it's possible to start integrating the world's data on a grand scale.

Tuesday, July 26, 2005

JavaScript and RDF - (almost) perfect together

JavaScript and RDF. a match made in heaven. or perhaps, on earth, rather. what do i mean by that? well let me explain.

the match between JavaScript and RDF, not being forged in heaven could never be perfect. it is a fine match, nonetheless. and we gain much if we remember that there is no perfection down here on earth. many of us share the continual experience that the more data we accumulate, and the more perspectives we acquire, the less crisp and clean do the lines of any theories we hold appear to be. the boundaries drawn by our theories are constantly being scratched out, and redrawn, as we learn more, and for some of us the lines look more like blurry smudges than sharp lines. fine, you say, but what does any of this have to do with JavaScript and RDF? what does an age-old antagonism between Platonic idealism and Epicurean empiricism have to do with RDF and JavaScript?

today we live in a world with ever more digital data from an ever increasing number of sources. and a world all in which all of this data is ever more connected via the web. information technology, no longer controlled by an ordained elite with the power to control by whom, how, and wherefore information is created, processed, and distributed is now largely in the hands of "the people" who are now using the means at their disposal to create massive amounts of data with an unprecedented level of freedom and ease, driving unprecedented levels of creativity and innovation, as well as noise. several important open standards for how this data is represented and distributed have been critical in enabling this tidal wave of information to set forth - TCP/IP, HTTP, and HTML being chief among them. the philosophy of "open source" computer code has been important, as well.

okay, we know all this, i hear you saying. get to the point, you say. we're gettin there ...

by and large the data in this tidal wave is unstructured. HTML being in large part a standard for marking up unstructured text, this makes sense. while Google does an admirable job of helping you harvest this sea of unstructured data, it can't help you with all that structured data out there, much of which is locked up in relational databases behind firewalls, only presented to the outside world in chopped up, regurgited, mixed-with-HTML form. what's missing is a standard for structured data that will scale to the broad, decentralized, and open nature of the web. old models of data that worked well within isolated, well-controlled domains will not scale to meet the requirements of a massive, global web of data.

but i misspoke. we do have such a model of data, and for anyone interested enough to read this far you probably know what I'm about to say: RDF. in RDF, everything has an identifier, called a URI, which is global in scope. more importantly, RDF's structural properties give it the flexibility to accomodate all of the world's structured data in one big structured database - the fabled "Semantic Web", that could be queried with a language that is as powerful as SQL is for relational databases. don't underestimate the gravity and presumption of this statement. all of the data now locked up in relational database silos, and in non-relational ones, with the great multitude of world views, concepts, and prejudices that the schemas underlying those databases embody, could be united into one giant database. and then, at any time, anything, anywhere, could be related to anything anywhere else in the world, in any way, by merely creating a labeled pointer, and then a query involving the relationship between these two things could be executed. the phrases "at any time" and in "any way" are key here. in RDF the relationships are dynamic, rather than being predefined by a schema as they are in the relational world.

"wow - data integration nirvana!", some who have worked in enterprise data integration might say. but then they would scratch their heads and say, "it's not so simple as that". there are all kinds of issues surrounding how data from different sources was modeled, the meanings of the different fields and tables and such, formatting issues, and all that dirty data out there. but this would only underscore RDF's unique potential as a model of structured data for the web. these sorts of problems have perenially plagued those working in the trenches of enterprise data integration efforts. many of these problems are in large part due to the fact that there is no perfect schema; the corporate data model is a myth; or as clay shirky would say: "ontologies are overrated". and rather than going away, these problems are only magnified exponentially when you scale out to the web. the genius of RDF is that it doesn't see resolving all of these "ontological" issues as a prerequisite for integration (that is, unless you're in the ontology-oriented RDF camp, in which case you see the use of ontologies modeled in languages like OWL as a key component of the semantic web. i actually believe that the dissonance in the discourse about RDF and the semantic web, between discussions of its fundamental flexibility on the one hand and very esoteric discussions about ontologies on the other, is largely responsible for the confusion surrounding it, and for how slow RDF has been on the uptake). we can unify and connect all of the world's structured data even though it's all quite messy, complicated, and multi-faceted. and even as there is ever more data produced, and the lines we draw in the data are continually erased and redrawn, RDF accomodates all of this roiling diversity, change, instability, and uncertainty quite well.

ok, rather than trying to drive the point home any further, i'm going to assume that you're with me on the notion that RDF, with its inherent flexibility is an ideal data platform for the web. that you get how rather than requiring the kind of Platonic purity of forms that the relational paradigm implies, it allows for a more organic, florescence of structured data. and i'll take it for granted that you think this is a good thing, a worthy thing. so what of JavaScript? it's just some scripting language used to spice up HTML and make web pages more flashy, right? HA! that's what they used to say about Java in the early days, before folks started realizing its potential ...

the seed of my sense of the affinity between RDF and JavaScript was planted when I was working on an RDF project at my last company. one of my colleagues jokingly labelled my goal of spreading RDF as "hashmaps everywhere". i laughed at the truth embedded in that joke, but i wasn't fully aware of how true it was. for those of you who don't know, hashmaps are a widely used implementation of the Map interface in the Java programming language. maps are otherwise known as "associative arrays", "hashes", or "dictionaries" in other languages. in a very real way, the RDF model of data could be described as interlinked associative arrays. this simplification and reduction to something akin to an essence of RDF was in the back of my mind months later, when I was working on an AJAX application, using JSON as a data interchange format. prior to this, i had never looked too deeply into JavaScript, but the similarities between RDF and JSON were apparent. both are a very general, minimalist means of representing data, with simplicity being a primary virtue. both can be modeled very simply as a sets of connected associative arrays, with the distinction that JSON is more suitable for representing tree-like sets of data, than a global graph of data. in essence, JSON - which is essentially a serialization of JavaScript's object model - is very suitable for representing localized subsets of the uber-graph of data - "the semantic web" - represented in RDF. in fact, in JavaScript an object is an associative array; therefore the properties of any object are completely dynamic.

JavaScript is a prototype-based programming language. in traditional object-oriented programming languages, you need to define a class model, sometimes called an object model, for your data. class models, like RDBMS schemas, are essentially ontologies, and define a narrow, prescriptive container for your data. anything that doesn't fit within the model isn't allowed. the assumption in early waterfall models of software development is that you create the perfect model for your data upfront, and then design your programs around that assumption of perfectness.

of course, the class model is rarely perfect and often changes. iterative development styles and refactoring techniques arose to address this reality. more recently, reflection-based techniques and dynamic byte-code manipulation are the rage, allowing for programs that are more robust and flexible in the face of variability in class structures. but these techniques are rather cumbersome to use, and seem like a big ugly patch on a language that is fundamentally statically typed. prototype-based languages, on the other hand, start out with the assumption that you cannot predefine a perfect class model. there are no classes of data, only instances. some of those instances may serve as prototypes for other instances, but by and large the language is much more empirically oriented than formally oriented.

and so, with JavaScript, you have for your application tier, what you have with RDF, for your data tier. a programming model that is built to accomodate a world of data and function most of which does not fit nicely into clean Platonic shapes, that is more interested in accomodating whatever you throw at it then being a tool for designing the perfect glove. a match made in heaven. oops, i mean on earth.

i think it is no mere coincidence that RDF and JavaScript are both relatively young technologies, both having arisen after the rise of the web. they are both a product of the times, in which change is increasingly rapid, time increasingly scarce, data increasingly abundant and interconnected, and knowledge, or understanding of the data, decreasingly perfect. now i realize that JavaScript has heretofore been relegated largely to cosmetic client-side web page enhancementw, and has made virtually no inroads into the server side where most of the meat of applications today is considered to reside (Netscape's failed LiveWire technology notwithstanding). but there are new projects that are reviving the concept of JavaScript on the server side, and with the emergence of the AJAX web programming model we should be seeing more intelligence moving to the client side.

so what is my vision of RDF, JavaScript, and the web of the future? well, im not quite sure, but it involves web apps with lots of JavaScript manipulating RDF, that is shuffled around in JSON format. and, somehow, the art of programming starts to look more like jazz. but more on that in a future post ...

Wednesday, May 18, 2005

Iron Chef Conspiracy Theory

am i the only one who, after observing the tasting panel's reactions to both chef's dishes, is at times completely baffled at the moment the winner of the Iron Chef challenge is announced? i mean, sometimes, the panel's reactions are absolutley sizzling about Chef A, and just sort of mild about Chef B (e.g., Bobby Flay), and somehow when they come back after the commercial break they've forgotten about how awesome Chef A's food was, and Chef B (e.g., Bobby Flay) wins! and then your jaw drops to the floor and you have to start wondering if something fishy isn't going on.

don't get me wrong. my beef isn't with Chef B. i've warmed up to him since the time i met him in the local Whole Foods Market once, randomly skewering him when I recognized him in the aisles. he was actually nice about it. that automatically makes him the only celebrity chef for whom extra signals of affection light up in the brain when i see him on television. but there are a couple of egregious examples i can remember with Chef B, and he hasn't even been doing this Iron Chef thing that long. and i remember at the beginning of last night's show one of the panelists making some remark about how hot Chef B was. i wonder how often that panelist is looking for a good seat at one of B.'s restaurants here in New York. the point totals were close. i bet she put him over the top - score for Chef B.!

anyways, i'm still gonna watch the show because it truly is inspiring. if that kind of creativity and devotion were applied to more areas of our life, how delicious would life be?

Wednesday, May 11, 2005

My first blog

This is my first blog ever. Hopefully future posts will contain more interesting announcements than this one.