Should

Mirror a Website with WGET: Step‑by‑Step Guide

Mirroring a website with wget lets you create a local copy for offline browsing, backups, or testing. This guide shows a reliable, secure process using wget on Unix-like systems (Linux, macOS) and explains common options, example commands, and troubleshooting tips.

1. Install wget

  • Debian/Ubuntu: sudo apt update && sudo apt install wget
  • macOS (Homebrew): brew install wget
  • Fedora: sudo dnf install wget

2. Basic mirror command

Use wget’s recursive and page-requisite options to download the site and adjust links for offline viewing:

wget –mirror –convert-links –adjust-extension –page-requisites –no-parent https://example.com/

Key options:

  • –mirror: shorthand for -r -N -l inf –no-remove-listing (recursive, timestamping).
  • –convert-links: rewrites links for local viewing.
  • –adjust-extension: adds suitable extensions (e.g., .html).
  • –page-requisites: downloads CSS, JS, images needed to render pages.
  • –no-parent: prevents ascending to parent directories.

3. Preserve site structure and avoid overloading

wget –mirror –convert-links –adjust-extension –page-requisites –no-parent –wait=1 –limit-rate=200k –retry-connrefused –tries=5      -P ./local-copy https://example.com/
  • –wait=1: 1 second between requests to reduce server load.
  • –limit-rate=200k: throttle download speed.
  • –retry-connrefused, –tries=5: handle transient errors.
  • -P ./local-copy: save files to a specific directory.

4. Mirror a site with authentication or cookies

  • For HTTP basic auth:
wget –user=username –password=‘password’ –mirror … https://example.com/
  • For sites requiring cookies (e.g., login sessions), first export cookies from a browser (e.g., using an extension) to cookies.txt, then:
wget –load-cookies cookies.txt –mirror … https://example.com/

Be cautious storing credentials; remove cookie files after use.

5. Exclude or include specific paths

  • Exclude paths:
–exclude-directories=/private,/tmp
  • Include only certain file types:
–accept=html,htm,css,js,jpg,png
  • Reject file types:
–reject=zip,gz

6. Incremental updates and timestamping

By default, –mirror uses timestamping to skip unchanged files. To force re-download:

–no-use-server-timestamps

To run regular updates, use cron:

0 3 * * * wget –mirror –convert-links … -P /var/www/mirror https://example.com/

7. Troubleshooting

  • 403 Forbidden: check robots.txt, site permissions, or set a user agent:
–user-agent=“Mozilla/5.0 (compatible; MirrorBot/1.0)”
  • Large sites: consider limiting recursion depth -l and using –span-hosts carefully if following external assets.
  • JavaScript-heavy sites: wget fetches only server-rendered HTML; use headless browsers (Puppeteer) for client-rendered content.

8. Legal and ethical considerations

  • Respect robots.txt and the site’s terms of service.
  • Avoid heavy scraping that might impact site performance; prefer contacting the site owner for large backups.

9. Example: full command for a polite mirror

wget –mirror –convert-links –adjust-extension –page-requisites –no-parent      –wait=2 –limit-rate=100k –tries=5 –retry-connrefused      –user-agent=“Mozilla/5.0 (compatible; MirrorBot/1.0)”      -P ./example-mirror https://example.com/

This creates a browsable local copy under ./example-mirror. Remove sensitive files (cookies, passwords) after use and run mirrors responsibly.

Your email address will not be published. Required fields are marked *