Raw Thought

by Aaron Swartz

Introducing theinfo.org

A lot of the work I’ve been doing on Open Library for the past few months has to do with handling large quantities of data. Either I’m writing crawlers to download them from various public web sites, or I’m meeting with librarians to persuade them to give me copies, or I’m evaluating algorithms for processing them, or building tools for viewing it all.

And while I’ve been doing this for information about books, I’ve noticed my friends doing similar things in other fields. Reporters try to get large data sets to write stories. Programmers get large data sets to add features to their sites. Friends are trying to make available data about the inner workings of the government.

And while each community has ways of talking to each other — reporters talking to other reporters, RDF people talking to other RDF people, library hackers talking to other library hackers — there’s no community that cuts across these topical lines. And that’s too bad, because there’s a lot there we could share, from tips on how not to get caught when crawling to tools to make it easier to build big charts and maps.

So that’s why I’ve started a new community site for people who work with large data sets. It’s called theinfo.org and I’d really appreciate it if you joined the mailing lists and spread the word.


You should follow me on twitter here.

January 15, 2008


Great idea Aaron. I’m a consultant working on international trade agreements. The availability of comprehensive data sets on trade flows, trade barriers and production/demographic data has transformed the making and analysis of international trade relations in the two or three decades that I’ve been at it.

There are some very large data sets accessible via the Web. But in this domain they can be SO large that it’s difficult to access them via the web (the bandwidth problem is only the start).

I’ll keep an eye out for developments here (thanks very much for the link to CKAN, by the way. Didn’t know about that).



posted by Peter Gallagher on January 15, 2008 #

Where were you 6 months ago when we started dreaming up RescueTime (YC08)?

Thanks for doing this… Eager to see how it shapes up.

posted by Tony Wright on January 15, 2008 #

Cool beans. This is pretty much my job.

BTW, markup isn’t being processed correctly on user pages: http://theinfo.org/user/Harkins

posted by Peter Harkins on January 15, 2008 #

Hi Aaron!

Looks like another good idea to me. Unfortunately, I have stopped working on this project, but theinfo.org reminded me of my own http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/04-03-03/04-03-03.html that I had almost forgotten…

Good luck in this new project.


posted by Eric van der Vlist on January 16, 2008 #

I posted a long list of dataset links for you at the datawrangling blog. They need to be cleaned up and organized, but I’m sure some info’ers with more time on their hands can wikify them…

posted by Peter Skomoroch on January 18, 2008 #

Lots of possible datasets. http://www.trustlet.org/wiki/Trust_network_datasets http://www.trustlet.org/wiki/Repositories_of_datasets

Highly unsorted and messy ;(

I subscribed to the theinfo mailing list by the way.

posted by paolo on January 18, 2008 #

You can also send comments by email.

Email (only used for direct replies)
Comments may be edited for length and content.

Powered by theinfo.org.