11 November, 2014

Using wget to Download an Entire Website


wget is a complex tool, intended for downloading files from the Internet. It has many available options, one of which is the --mirror option, which allows us to download a full Website so that it can later on be viewed offline. This is useful for mainly two purposes: A. Creating a full backup of the Website. and - B. Viewing the Website, in those rare cases in which we're offline & without an available Internet connection.

Note: Don't expect wget to be able to receive access to "back-end" components, which are required for the full functionality of the Website. The --mirror option is intended for the visible, "front-end" components.

Here's how it's done:

$ wget --mirror --convert-links -p --no-parent -w -P


Explanation of the options used above:

--mirror (or: '-m') : The basic option which achieves the task of backing up the Website.
--convert-links (or: '-k') : Save the Website's links in local viewing format (links are defined to work locally & offline).
-p : Download all required files, so that the Website's functionality is maintained and preserved as much as possible.
--no-parent : Only files below the specified location will be downloaded. The parent directory won't be accessed during the download process.
-w : Add a delay (in seconds) to each downloaded element, in order to avoid creating stress on the remote server (possibly resulting in being blocked during the process).
-P : The local directory to save the Website to.

It's also possible to exclude the downloading of certain file types & directories, and to prevent access to external domains, by adding the following options:

--reject : Don't download the specified file types (comma-separated list). For example: --reject *.pdf,*.jpg will prevent the downloading of any .pdf or .jpg files.

--exclude-directories= : Don't download the specified directories (comma-separated list).
--domains= : Don't follow any links outside of the specified domain names (comma-separated list).

Note: The mirroring achieved with wget isn't always perfect, and there definitely are cases in which certain aspects of the Website's functionality aren't fully available while offline. If the results are not satisfactory, a possible alternative could be to use the tool HTTrack Website Copier.

No comments: