Raw Thought

by Aaron Swartz

The Techniques of Mass Collaboration: A Third Way Out

I’m not the first to suggest that the Internet could be used for bringing users together to build grand databases. The most famous example is the Semantic Web project (where, in full disclosure, I worked for several years). The project, spearheaded by Tim Berners-Lee, inventor of the Web, proposed to extend the working model of the Web to more structured data, so that instead of simply publishing text web pages, users could publish their own databases, which could be aggregated by search engines like Google into major resources.

The Semantic Web project has received an enormous amount of criticism, much (in my view) rooted in misunderstandings, but much legitimate as well. In the news today is just the most recent example, in which famed computer scientist turned Google executive Peter Norvig challenged Tim Berners-Lee on the subject at a conference.

The confrontation symbolizes the (at least imagined) standard debate on the subject, which Mark Pilgrim termed million dollar markup versus million dollar code. Berners-Lee’s W3C, the supposed proponent of million dollar markup, argues that users should publish documents that state in special languages that computers can process exactly what they want to say. Meanwhile Google, the supposed proponent of million dollar code, thinks this is an impractical fantasy, and that the only way forward is to write more advanced software to try to extract the meaning from the messes that users will inevitably create.[^1]

[^1]: I say supposed because although this is typically how the debate is seen, I don’t think either the W3C or Google actually hold the strict positions on the subject typically ascribed to them. Nonetheless, the question is real and it’s convenient to consider the strongest forms of the positions.

But yesterday I suggested what might be thought of as a third way out; one Pilgrim might call million dollar users. Both the code and the markup positions make the assumption that users will be publishing their own work on their own websites and thus we’ll need some way of reconciling it. But Wikipedia points to a different model, where all the users come to one website, where the interface for inputting data in the proper format is clear and unambiguous, and the users can work together to resolve any conflicts that may come up.

Indeed, this method strikes me as so superior that I’m surprised I don’t see it discussed in this context more often. Ignorance doesn’t seem plausible; even if Wikipedia was a late-comer, sites like ChefMoz and MusicBrainz followed this model and were Semantic Web case studies. (Full disclosure: I worked on the Semantic Web portions of MusicBrainz.) Perhaps the reason is simply that both sides — W3C and Google — have the existing Web as the foundation for their work, so it’s not surprising that they assume future work will follow from the same basic model.

One possible criticism of the million dollar users proposal is that it’s somehow less free than the individualist approach. One site will end up being in charge of all the data and thus will be able to control its formation. This is perhaps not ideal, certainly, but if the data is made available under a free license it’s no worse than things are now with free software. Those angry with the policies can always exercise their right to “fork” the project if they don’t like the direction things are going. Not ideal, certainly, but we can try to dampen such problems by making sure the central sites are run as democratically as possible.

Another argument is that innovation will be hampered: under the individualist model, any person can start doing a new thing with their data, and hope that others will pick up the technique. In the centralized model, users are limited by the functionality of the centralized site. This too can be ameliorated by making the centralized site as open to innovation as possible, but even if it’s closed, other people can still do new things by downloading the data and building additional services on top of it (as indeed many have done with Wikipedia).

It’s been eight years since Tim Berners-Lee published his Semantic Web Roadmap and it’s difficult to deny that things aren’t exactly going as planned. Actual adoption of Semantic Web technologies has been negligible and nothing that promises to change that appears on the horizon. Meanwhile, the million dollar code people have not fared much better. Google has been able to launch a handful of very targeted features, like music search and answers to very specific kinds of questions but these are mere conveniences, far from changing the way we use the Web.

By contrast, Wikipedia has seen explosive growth, Amazon.com has become the premier site for product information, and when people these days talk about user-generated content, they don’t even consider the individualized sense that the W3C and Google assume. Perhaps it’s time to try the third way out.

You should follow me on twitter here.

July 19, 2006


Your positions are a bit too liberal/socialist for my taste. Sure, it sounds great that all the world’s information could be gathered at one web site but the reality is the complete opposite. Google recognizes that the world’s data is scattered amongst billions of web pages with sub-optimal structure and consistency.

posted by pwb on July 19, 2006 #

Hello Aaron. I think the third alternative you mention is a very important one. I’ve written before about what I see as unmistakable advantages in leveraging XML/RDF/OWL technologies in controlled environments and I’m as surprised as you are that this alternative context is not in the discourse within the SW communities. I do think it is a very unfortunate misconception (one amongst many) that the web (i.e., an open/distributed environment) is the only context that Semantic Web technologies were meant for.

On the other hand I believe, single purpose databases (Wikipedia can be considered as such) have much to gain in automation of content management, data manipulation, targetted querying, etc.. from applying Semantic Web technologies. Most of the social factors that are serious issues are eleviated by a sytem that is (even somewhat) closed.

There are even other more intrinsic advantages: Closed world assumptions, unique name assumption, minimal semantic, ambiguity, etc..

I strongly believe that SW technologies are applicable to two contexts (the first being the most well known): Open, distributed systems (where the social factor can be as much a pain as catalyst) and closed, single (or minimal) purpose systems (where what you lose with closed doors you gain with precision, automation, and (other) possible oppurtunities to stream line logical reasoning in such an environment.

posted by Chimezie Ogbuji on July 19, 2006 #

pwb: I’m at a loss for understanding how these details of technology could be considered “liberal/socialist”, but your larger point seems to miss the point — as I note in the last paragraph, Google isn’t collecting data, it’s collecting text.

Chimenzie: Can you say a bit more about what single purpose databases can get from Semantic Web technologies? It seems like most of the technical stuff can be better done with more standard database software while the other stuff are simply styles of thought, which I commend, but aren’t technologies.

posted by Aaron Swartz on July 19, 2006 #

I think what pwb is talking about wrt liberal/socialist is the centralization of information, as opposed to the individualist/libertarian world where everyone does their own thing.

But I don’t think its a useful point at all. Wikipedia is both libertarian and collectivist - libertarian because it does nothing (and can do nothing) to constrain the freedom available to those who choose not to participate, collectivist for all the obvious reasons that have been done to death.

I think the crucial thing about the net, and why it is such a democratizing technology, is the low start-up cost. You hint at this when you mention forking. On the net, it seems like a lot of the pie-in-the-sky libertarian theories might actually work. For instance, it looks so far like consolidating power on the net is much more difficult than doing so in less malleable spheres. People like to talk about how Google is gaining control of all information the net, but wikipedia has already proven that false. If google wanted to suppress a piece of information, you’d probably still be able to find it on wikipedia. (And this is not even considering all the other search engines).

posted by Mark on July 19, 2006 #

What if it’s some each of million dollar markup, code AND users? Suppose there is a collectively built services portal which intermediates between unruly data messes and clients requiring data in a uniform format.

posted by storkblemish on July 19, 2006 #


posted by Chimezie Ogbuji on July 19, 2006 #

Central database maintainer doesn’t work as a general solution.

You’ve identified some cases where it’s a good specific solution, and that’s certainly true.

But that still leaves “everything else”.

The central database solution requires a fairly large amount of money to maintain the database. In Amazon’s case, that’s part of the ongoing business, in Wikipedia’s case, part of the venture capital.

It’s not “million dollar users”, but “million dollar website”. And the problem is then you’ve got to have the million dollars for that niche. Some niches - maybe even the most profitable niches - can support that. But there’s still a lot which won’t.

posted by Seth Finkelstein on July 19, 2006 #

I see four potential models and only three covered:

Google: users dispersed, data together Wikipedia: users together, data together Semantic Web: users dispersed, data dispersed ?: users together, data dispersed

I’d like to see more discussion the last one, as it seems like the best option, taking advantage of collective knowledge without creating the various problems associated with centralized data, many of which you’ve touched on here.

posted by Scott Reynen on July 20, 2006 #

I think the technique of getting RDF data from Wikipedia is particularly interesting for things like timezones and airport latitude/longitudes.

My latest hack in this direction is Choosing flight itineraries using tabulator and data from Wikipedia.

I’m following the semantic media wiki project and a few related ideas. I think there’s a good chance of getting something really nifty in that space.

posted by Dan Connolly on July 20, 2006 #

without being an expert, I think there are three problems about data here: getting the data, storing it, and retrieving it. More than the division users/data, I see a trilogy producer/storage/user. And the problem is that when you have dispersed creators with dispersed storage, then data tends to be repeated or irrelevant or very hard to find. This is what I think is the best advantage of the Wiki model. The other option is web indexes, which I haven’t seen any that works. What about neural networks, or semantic analysis based web searching engines?

posted by Sago on July 25, 2006 #

Scott’s users-together-data-dispersed type … isn’t that everybody going to their offices at MegaCorp USA, sitting down, logging in, and going to n different sites?

posted by Niels Olson on July 26, 2006 #

You can also send comments by email.

Email (only used for direct replies)
Comments may be edited for length and content.

Powered by theinfo.org.