How to mirror, spider, or archive a website

mirror, spider, or archive a website

programs

 * httrack

wget
wget is a command line program for downloading files off the internet but it also has very powerful mirroring capabilities.


 * simple
 * wget -m http://www.mooooo.com
 * wget -R http://www.mooooo.com/files/


 * Advanced
 * wget -m -R *.jpg,*.exe,*.doc,*.gif,*.zip,*search*,*index.cgi* -l 2 http://www.website.com/snork/doodles/wiggles
 * this will mirror(-m) the site recursively, not download any files(-R) of these types *.jpg,*.exe,*.doc,*.gif,*.zip,*search*,*index.cgi*, and only mirror files that are 2 levels down(-l) or less, (in this example it will not download anything that is outside of the snork directory)
 * wget -m -c -D www.tuto.com,pdf.tuto.com http://www.tutomax.com/
 * mirror(-m) http://www.tutomax.com/, continue a partially completed mirroring(-c) and download files that are linked to these domains(-D) www.tuto.com,pdf.tuto.com
 * wget -F -i maximDatasheetsR.html
 * download files from links in the specified input file (-i) that you treat as an HTML file, enables you to download from links in said HTML file (-F)

The documentation for wget is available here: https://www.gnu.org/software/wget/manual/wget.html

From HowTo Wiki, a Wikia wiki.