Mirror a Website with WGET: Step‑by‑Step Guide
Mirroring a website with wget lets you create a local copy for offline browsing, backups, or testing. This guide shows a reliable, secure process using wget on Unix-like systems (Linux, macOS) and explains common options, example commands, and troubleshooting tips.
1. Install wget
- Debian/Ubuntu:
sudo apt update && sudo apt install wget - macOS (Homebrew):
brew install wget - Fedora:
sudo dnf install wget
2. Basic mirror command
Use wget’s recursive and page-requisite options to download the site and adjust links for offline viewing:
Key options:
- –mirror: shorthand for
-r -N -l inf –no-remove-listing(recursive, timestamping). - –convert-links: rewrites links for local viewing.
- –adjust-extension: adds suitable extensions (e.g., .html).
- –page-requisites: downloads CSS, JS, images needed to render pages.
- –no-parent: prevents ascending to parent directories.
3. Preserve site structure and avoid overloading
wget –mirror –convert-links –adjust-extension –page-requisites –no-parent –wait=1 –limit-rate=200k –retry-connrefused –tries=5 -P ./local-copy https://example.com/
- –wait=1: 1 second between requests to reduce server load.
- –limit-rate=200k: throttle download speed.
- –retry-connrefused, –tries=5: handle transient errors.
- -P ./local-copy: save files to a specific directory.
4. Mirror a site with authentication or cookies
- For HTTP basic auth:
- For sites requiring cookies (e.g., login sessions), first export cookies from a browser (e.g., using an extension) to
cookies.txt, then:
Be cautious storing credentials; remove cookie files after use.
5. Exclude or include specific paths
- Exclude paths:
–exclude-directories=/private,/tmp
- Include only certain file types:
–accept=html,htm,css,js,jpg,png
- Reject file types:
–reject=zip,gz
6. Incremental updates and timestamping
By default, –mirror uses timestamping to skip unchanged files. To force re-download:
–no-use-server-timestamps
To run regular updates, use cron:
7. Troubleshooting
- 403 Forbidden: check robots.txt, site permissions, or set a user agent:
–user-agent=“Mozilla/5.0 (compatible; MirrorBot/1.0)”
- Large sites: consider limiting recursion depth
-land using–span-hostscarefully if following external assets. - JavaScript-heavy sites: wget fetches only server-rendered HTML; use headless browsers (Puppeteer) for client-rendered content.
8. Legal and ethical considerations
- Respect robots.txt and the site’s terms of service.
- Avoid heavy scraping that might impact site performance; prefer contacting the site owner for large backups.
9. Example: full command for a polite mirror
wget –mirror –convert-links –adjust-extension –page-requisites –no-parent –wait=2 –limit-rate=100k –tries=5 –retry-connrefused –user-agent=“Mozilla/5.0 (compatible; MirrorBot/1.0)” -P ./example-mirror https://example.com/
This creates a browsable local copy under ./example-mirror. Remove sensitive files (cookies, passwords) after use and run mirrors responsibly.
Leave a Reply