Raj Reddy wants to put a million books on the Internet.

Brewster Kahle, the man behind The Internet Archive and Alexa, wants to digitize all the out-of-copyright (and thus Public Domain) books and put them on the Web. Read more for his email describing his plans and co-conspirators for the project.

Subject: Public Access to the Public Domain: finding the list of out of copyright books and getting them online
From: Brewster Kahle
To: Steve Harris, Gregory B. Newby, David Wolber, Raj Reddy, Robert Thibadeau, Michael Shamos, Jim Fruchterman, James Michalko, Judith Bush, Lawrence Lessig, archivists-talk
Date: Wed, 29 May 2002 01:46:02 -0700

This is a note to a bunch of we technical and lawyerly doers in the world of Public Domain (PD) books to propose a short term strategy in order to get feedback and coordinate efforts.

Steve Harris is a guttenberg person and lawyer, Greg Newby is a guttenberg person and tech guru, Dave Wolber is a prof at USF and is working with the Internet Archive this summer with 12 of his undergrads, Raj Reddy is “universal access to human knowledge”,  Robert Thibadeau is a tech master of the Universal Library at CMU, Michael I. Shamos is leader of Univeral Library at CMU and lawyer, Jim Fruchterman started bookshare.org: 20k scanned online books for the blind, Jim Michalko runs RLG (a union catalog), Judith Bush is running the RLG project to get the catalog onto the net, Larry Lessig has started the Creative Commons.  archivists-talk@yahoogroups.com is an unmoderated list that discusses this kind of thing.  Sorry if I left key people off— it is late.

If we want to help people put a pile of books online here is a strategy:

1. take a large catalog of books in libraries,
2. tag each entry with its US copyright status,
3. prioritize those that are out of copyright,
4. try to inspire the world to digitize the out-of-copyright books,
5. format the books for online distribution,
6. organize the resulting digitized books,
7. cause enlightenment in all corners of the globe.

1.  Get catalog:  Research Library Group (rlg.org) is up for it and will have their catalog ready in July.    Maybe OCLC would be up for it too.

2.  Tag PD works: Tagging Needs to be done.
for US PD rules see: The Public Domain: How to Find and Use Copyright-Free Writings, Music, Art & More (Public Domain, 1st Ed)

steps to tag:

A. get the registration records in dirty OCR’ed form: Newby has started this. (since most are in page image form right now)

B. maybe hand clean reg records up, (guttenberg started see Distributed Proofreaders) maybe with money from IA or CMU for offshore help, maybe dont clean it up but use smart computer algorithms to get it right enough. (harris pointed out US Catalog of Copyright Entries is a start at a hand cleaned version)

C. do a “join” to match them up with the records and find the books that have been registered in the catalog: the rest are then Public domain.

D. Publicize what we have done and the methods in a series of meetings with lawyers, publishers, and authors groups so that we can refine a “best faith effort” to do large scale copyright clearance.  This step is key so that we are not surprising anyone. Maybe Creative Commons can sponsor these?  IA can help.

3.  Prioritize: we need good ideas here.  Maybe we could get checkout counts, or how many libraries have a book to know which ones should be digitizing first. Otherwise we could just leave it up to individuals that would scan, and they do it in any order they want to.

4.  Scanning:  There are a number of ways to inspire this. We need a repeatable and inexpensive process that builds in QA. Universal Library is now cranking on a centralized approach, can we augment this with a napster-like version? bookshare.org is an example where blind people are scanning in books.

5. Format:  archival and access formats are a problem.  there is no “MP3” of online books yet. Of course keep the high rez scans (IA will provide free storage for any needers) and then have some meetings where we try to get the list of supported access standards down to a manageable number.  I suggest a gritty meeting in August in San Francisco that, again, the IA can sponsor.

6.  Organize: I think we can build an amazon.com-like site with a catalog that starts with RLG’s records (they have verbally agreed).  Internet Archive is building such a site this summer, hopefully Dave Wolber’s army will help populate it with the 20k or so existing online books—we could use help.

7.  Educate and enlighten.  We are now in touch with many world wide literacy programs. I am sure you guys are in touch with them as well.  If we deliver a free, massive, and relevant library  they can do alot of the hard work of getting it to the many many people.

Please comment, destroy, or whatever this strategy.

Then pick a hunk to take charge of.

The Internet Archive is best suited to #5 and #6 (Format and Organize) but can help or do any pieces.

“Public Access to the Public Domain” is a step towards Raj’s “Universal Access to Human Knowledge”.  How hard can it be?


posted June 01, 2002 11:42 PM (Technology) #


Copyright Lawsuit Not Included
The New New New Economy
My Books
Boston Trip 2 - Day 1
Public Access to the Public Domain
Tim O’Reilly: OS X and the Next Big Thing
Valenti: “I know damn well I am infringing.”
My Social Network

Aaron Swartz (me@aaronsw.com)