You're writing an RSS reader and a reader types in diveintomark.org. How are you supposed to subscribe to that? feedfinder uses RSS autodiscovery, Atom autodiscovery, spidering, URL correction, and Web service queries -- whatever it takes -- to find the feed.

>>> import feedfinder
>>>
>>> feedfinder.feed('scripting.com')
'http://scripting.com/rss.xml'
>>>
>>> feedfinder.feeds('scripting.com')
['http://delong.typepad.com/sdj/atom.xml', 
 'http://delong.typepad.com/sdj/index.rdf', 
 'http://delong.typepad.com/sdj/rss.xml']
>>>

Download: feedfinder.py

Related

If you like this, you'll love Mark Pilgrim's Ultra-liberal feed parser. You might also be interested in rss2email. And want to know about feeds as soon as they're updated? Try atomstream.

History

feedfinder is originally written by Mark Pilgrim and is currently maintained by Aaron Swartz.

2006-05-31: 1.371. Stupid typo.
2006-05-31: 1.37. Use timelimit function from web.py. Check the feed before robots.txt. Strip URIs. Support for "XML-level redirects". Delete bizarre code.
2006-04-24: 1.36. Improve error messages. Use standard error parser. Catch more errors. Support feed:// URIs. Add --debug command-line option.
2006-04-14: 1.35. Replace named entities.
2006-04-14: 1.34. Timeout threads no longer hold up program execution. New argument all forces return of all feeds.
2006-04-10: 1.33. Better timelimit system using function decorators and threads.
2006-04-09: 1.32. Try guesses on common feed locations (helps with blogspot sites).
2006-04-03: 1.31. Give up on using timeouts in threads (caused an error before).
2006-04-02: 1.3. First version by Aaron. Change getFeeds to feeds, add feed, stop overwriting timeout for all sockets, docs tweaks, turn syndic8 back on, better robustness.
2004-01-09: 1.2. Add support for Atom, change name and license, no longer query Syndic8 by default.
2003-02-20: 1.1. Add support for Robots Exclusion Standard.
????-??-??: 1.0. Initial release.

Open Tasks

Return type of feed (atom, rss10, etc.) back with feed URL. Have feed return the "best" feed for a URI (preference to Atom).

Make the timeout level configurable. Make sure it applies to robots.txt queries as well.