arcget: Retrieve a site from the Internet Archive

Servers die. Companies collapse. URLs change. The Web is a very messy place. Thankfully, the Internet Archive is there to record it all.

But once it's in there, how do you get it back? Sure, the Wayback Machine is nice for getting a couple pages, but anything more than that and it's a royal pain. Wouldn't it be nice if there were some easy way to get back that data? arcget is that easy way.

$ python arcget.py linguafranca.com/images/covers
Grabbing linguafranca.com/images/covers ...
FAILED: linguafranca.com/images/covers/0010cover-sm.gif
linguafranca.com/images/covers/0011cover.gif
linguafranca.com/images/covers/0103cover.gif
linguafranca.com/images/covers/0104cover.gif
linguafranca.com/images/covers/0105cover_small.gif
FAILED: linguafranca.com/images/covers/-
and so on...

arcget asks the Internet Archive for all the files it has of that site, then goes through and tries to find a working copy of each one. It gets it, strips out the modifications made by the Wayback Machine, and places it in a properly named file.

Caveats and Warnings

arcget gets the oldest version of each file. This is because generally the newest version is whatever lame site has replaced the site you want to archive and it's hard to tell programatically whether it's the old or the new site.

The Internet Archive's error messages are less than clear and 404 pages aren't always clearly marked, so arcget may mistakenly download some error pages as the actual files.

The Internet Archive rewrites all <link> and <script> tag URLs to go through the Archive. arcget doesn't fix these links.

The Internet Archive is not 100% reliable. Sometimes files will work, sometimes they won't.

BE NICE: The Internet Archive is a valuable public service. Don't abuse it by hammering the servers with lots of requests.

Download: arcget.py.

Users

arcget helped restore:

The Lingua Franca Archive

History

2005-12-28: 0.8. First release. Used to restore one major site.