Ever have a situation where a site is going offline and you really, really want to save it?

Before you continue reading, I’m assuming you really want the command to download, say, example.com. No worries. Get the command and go:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com

If you want more, read on.

Warning

  1. This method of downloading may be illegal in your country.

Check local laws before proceeding, as always. IANAL (I am not a lawyer), but in most countries you should be fine as this is no different than going into each page using your Internet browser, taking a screenshot (or using Ctrl-S) and saving the contents of the page. BUT laws are the best and lawyers make bank so always make sure you’re on the legal side. I’m not responsible if you end up in prison (or any other actions you take, etc, etc.)

  1. This method of downloading will certainly piss off server engineers.

This tutorial is a bit of a jerk move to server providers. But hey, this is your last chance to archive that random IP address! If you want to be a good Internet citizen, there are ways to reduce bandwidth and please service providers. Jump to the relevant section

  1. This method of downloading may not be compliant with the Internet Archive.

If you are doing this to hoard data and maybe submit a little bit of history to the Internet Archive, don’t. This way of downloading websites may not be compliant with the rules and data structure required by the Internet Archive. Consider going on Reddit’s r/DataHoarder to find ways you can start archiving websites to submit to the Internet Archive (AKA Wayback Machine).

All right! So if you are familiar with the warnings, let’s get started!

wget

The tool we’re going to use today is wget. You may have used this tool before when downloading .isos from Ubuntu’s website, or when downloading source tarballs from websites. Whatever the case may be, the tool is very powerful and can do much more than downloading .tar.gzs off of the Internet.

So this is probably the most common usage you saw:

wget https://www.ubuntu.com/path/to/iso/Ubuntu-18.04-desktop.iso

And this saves the file Ubuntu-18.04-desktop.iso in your current directory. Not bad, right?

But what if you want to go… deeper?

Downloading an entire website

First, we gotta tell wget to download everything. Recursively download everything.

wget --recursive http://example.com

All right, not bad, what’s next? Well, what if you’re downloading under a certain URL and you don’t want wget going up to the parent directory? The next parameter solves that.

wget --recursive  --no-parent http://example.com

Now, what if your download was interrupted? Simple. Continue the download!

wget --recursive  --no-parent --continue http://example.com

What if there are multiple links to a single page? We don’t want to download the page multiple times. So we use:

wget --recursive  --no-parent --continue --no-clobber http://example.com

Finally, sometimes, websites use “robots.txt” to stop scrapers. But today, we want to download everything. So turn off the robots check. This makes wget non-compliant with good scraping protocols, but in this case we’re not an automated crawling bot like Google’s Googlebot is.

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber http://example.com

There we go! This is the final command you get, and it is the same one we have at the top of the blog post.

Being a good citizen

Now if you just rampantly download from a website, the servers will get overloaded and crash, the website would go down, and everyone will be pissed off. To be a good citizen, restrict your downloads.

Let’s wait a couple seconds before downloading the next page.

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 http://example.com

20 seconds is pretty reasonable between page loads. Let’s restrict wget a bit more. Maybe restrict the bandwidth:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --limit-rate=50k http://example.com

This restricts the bandwidth to 50KB/s which is very reasonable. If you’re downloading large files you might want to bump this up.

Finally, to jitter the download, wait randomly:

wget ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --random-wait --limit-rate=50k http://example.com

When websites are terrible

Sometimes, the network admins detect scrapers by looking at their User Agent string and blocking any non-browser programs from accessing the website. So even though we’re trying to be a good citizen with the above section, it’s just impossible to scrape the website.

Or is it? Fortunately, changing the User Agent string is pretty easy:

wget --user-agent="Mozilla/5.0 Firefox/4.0.1" ‐‐execute robots=off ‐‐recursive ‐‐no-parent ‐‐continue ‐‐no-clobber --wait=20 --random-wait --limit-rate=50k http://example.com

You can change it however you want. Firefox is just used as an example.

So that’s pretty much it on the subject of downloading websites! Hopefully you learned a bit more about wget today and how to use it to download entire websites!