Aaron Swartz

html2text

(THE ASCIINATOR)

html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format).

Also known as: html to text, htm to txt, htm2txt, ...

Try

Enter the address of the web page you'd like to convert.

URL:

Example sites: aaronsw.com, daringfireball.net.

Bookmarklet: 2text

Buy

html2text is available under the GNU GPL 3.0.

Download the latest: html2text.py

History

2011-01-28: 2.40. This and later versions are available on GitHub.
2010-09-06: 2.39. let people grab https urls (tx Romain)
2010-02-03: 2.38. package properly (tx Michael Jenny, Vincent Fretin)
2009-09-14: 2.37. don't use stdout by default (tx Greg Brown) warning: may not be backwards-compatible in some odd use cases
2009-08-10: 2.36. relative url resolution (tx Kevin North)
2008-11-20: 2.35. undo last change (tx Sumit Rangwala)
2008-10-09: 2.34. elim extra \ns (tx Keith Bussell)
2008-09-19: 2.33. add support for abbr (tx Nathan Youngman)
2008-07-31: 2.32. fix parsing bug with fastcompany (tx Elias Soong)
2008-07-23: 2.31. fix unicode support (tx John Chapman)
2008-05-26: 2.3. prelim JS support, various fixes, improved performances (tx Johannes Fitz)
2008-05-13: 2.292. add SKIP_INTERNAL_LINKS (tx Christian Siefkes)
2008-04-25: 2.291. add shbang, fix wrapping (tx Christian Siefkes)
2007-11-01: 2.29. fix degenerate sites (cough 9rules) that don't close head tags; fix crash when feedparser wasn't available (tx Johann Burkard)
2007-04-12: 2.28. fix tables (tx Pete Savage)
2007-04-09: 2.27. fix line breaks (tx Danny O'Brien)
2007-02-23: 2.26. input unicode better (tx John Cavanaugh for the push)
2006-10-13: 2.25. output unicode better (tx s s)
2006-02-22: 2.24. preliminary support for dt/dd
????-??-??: 2.23. fix for python2.1
2004-08-27: 2.21. old bug with extra closing list tags (tx Jonathan)
2004-08-26: 2.2. text wrapping (tx++ Joey Schulze!), supress dupe links (tx Ricardo Reyes), python2.1 support.
2004-08-23: 2.12. added hr (tx merlin)
2004-06-30: 2.11. python2.1 codec support.
2004-06-27: 2.1. better module, unicode support. expand ndash.
2004-03-27: 2.01a. fix bug w/ charrefs in links. tx Ian G.
2004-03-19: 2.0a. complete rewrite, supports Markdown
2003-03-16: 1.0. port to Python
2000-06-19: html2text.tcl (with Lars Pind)

Aaron Swartz (me@aaronsw.com)