Fun with wget

For Storage Wars and Data Dumps, I spent some time playing with

wget

to download websites.

I found some good instructions programminghistorian.org.

The command

wget -r --no-parent -w 2 --limit-rate=20k <website>

set my terminal plowing through every link it could find.

-r

stands for recursive retrieval, meaning it’ll follow all links on the page, but

--no-parent

prevents wget from downloading stuff from other websites. It waits 2 seconds before downloading more stuff, and limits bandwidth so as not to cause problems for the server as it makes these recursive requests.

First, I tried wget without any arguments. This downloaded individual html files. Loading the files in a browser, none of the links work, including any referenced CSS or JS. The menus or elements that didn’t seem obvious when styled properly are now as much at the forefront as the main content.

Then, I went after as much of the site as I could nab.

I’m interested in creating an archive / repository of local events. So I looked at a few sites that list upcoming events. I tried timeoutny, which was a bit of a mess. I tried the New Yorker, which is great because it just picks a handful of events and the format would be easy to parse. But the best was nononsensenyc, which hosts archives of its plain text emails. Finally, I also tried wget’ing wfmu.org, which has so many recursive links, it’s still going…so far it’s downloaded 1842+ items that make up 100+ megabytes consisting of JS libraries, images, php files, favicons, audio files, and more. One interesting components is the playlist pages, which document music played in almost every program dating back to 1999.

Next I downloaded the Terminal Boredom forum, for which I had to ignore the robots using the -erobots=off command.

Leave a Reply

Your email address will not be published.