m (Reverted edits by 96.23.97.22 (talk) to last version by K6ka) |
No edit summary Tag: Source edit |
||
(24 intermediate revisions by 13 users not shown) | |||
Line 102: | Line 102: | ||
done < list |
done < list |
||
rm list # Clean up |
rm list # Clean up |
||
+ | </pre> |
||
+ | [[Category:Road traffic]] |
||
+ | |||
+ | =2021 fandom attempt= |
||
+ | While attempting to grab the images when moving off fandom to a custom wiki host, the above instructions were not working. Instead the following script is what was used to have it download the images. This has the annoyance of needing to manually copy in all the URLs of the wiki sitemap pages. (visit the special page /wiki/Local_Sitemap for the fandom wiki) |
||
+ | |||
+ | <pre> |
||
+ | #!/bin/bash |
||
+ | #Download all the pages via the enumeration of the sitemap |
||
+ | wget -r -l 1 -e robots=off -w 1 https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Succubus+Veronica https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Lovely+Witch%27s+Panties https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Cursed+Wig https://magical-camp.fandom.com/wiki/Local_Sitemap |
||
+ | |||
+ | #then lets find all the images embedded in the pages |
||
+ | grep -r 'class="image"' magical-camp.fandom.com/ > img_tags; wc -l img_tags |
||
+ | |||
+ | #and pull out the linked URLs to the original image |
||
+ | #grep -P -o '(?<=href=")https.*?(?="[> ])' img_tags > img_urls |
||
+ | #This will remove the extra bits off the URL so wget will save the real filename instead of being called latest?somerandomstring |
||
+ | grep -P -o '(?<=href=")https.*?(?=/revision)' img_tags > img_urls |
||
+ | |||
+ | #now download the actual images |
||
+ | wget -i img_urls -P images -w 1 |
||
+ | echo "Finished - images are in the folder 'images'" |
||
</pre> |
</pre> |
Latest revision as of 17:46, 22 November 2021
Wikimedia Commons is a great resource for free/open images, and sometimes you may want to download all the images in one of their directories or pages. Wiki Commons doen't offer a simple way to do this. This howto shows a method to do that.
Command line method
This method uses some commands common to Unix based operating systems. If you are using Windows try installing Cygwin. This will allow you to use these commands inside Windows.
Requirements
these are some standard Unix commands, and are likely installed if you are using Linux, or any other Unix like OS
Steps
In this example we will get all the images on this page: http://commons.wikimedia.org/wiki/Crystal_Clear. It will grab the original, full quality images, not the lower quality thumbnails shown on the page.
- Get the webpages for each image file
- Command:
wget -r -l 1 -e robots=off -w 1 http://commons.wikimedia.org/wiki/Crystal_Clear
- Description: This command downloads all the webpages linked in http://commons.wikimedia.org/wiki/Crystal_Clear and puts them in the directory 'commons.wikimedia.org'
The images we are interested in are linked in the commons.wikimedia.org/wiki/File:* files, so we need to extract the image links from these HTML files.
- Extract the Image links
- Command:
WIKI_LINKS=`grep fullImageLink commons.wikimedia.org/wiki/File\:* | sed 's/^.*><a href="//'| sed 's/".*$//'`
- Description: This creates a list of image links, in variable $WIKI_LINKS
- Download the Images
- Command:
wget -nc -w 1 -e robots=off -P downloaded_wiki_images $WIKI_LINKS
- Description: This will download all the images into a folder called 'downloaded_wiki_images'
- Delete all temp files
- Command:
rm -rf commons.wikimedia.org
- Description: deletes all the HTML pages used to get links
- Note 1: If you are trying to get all the images in a category that has more than 200 images, you will have to run the commands on each category page. ie 0-200, 200-400, 400-600, etc
- Note 2: This method works as of Jan 2010, but as time passes Wiki Commons' page formats may change, and this method may stop working. If so view the source of one of the image pages, find the images URL and see what has changed. The commands should be easy to modify to get it to work.
Script
If you want to do all the steps described above in one command create this script.
#!/bin/bash WIKI_URL=$1 if [ "$WIKI_URL" == '' ]; then echo "The first argument is the main webpage" echo exit 1 fi # Download Image pages echo "Downloading Image Pages" wget -r -l 1 -e robots=off -w 1 -nc $WIKI_URL # Extract Image Links echo "Extracting Image Links" WIKI_LINKS=`grep fullImageLink commons.wikimedia.org/wiki/File\:* | sed 's/^.*a href="//'| sed 's/".*$//'` echo "Downloading Images" wget -nc -w 1 -e robots=off -P downloaded_wiki_images $WIKI_LINKS echo "Cleaning up temp files" rm -rf commons.wikimedia.org/ echo "Done" exit
Alternative (works only for categories)
Getting the filenames via WikiSense and the API, works only for categories but should be faster: Just run it with the wished category as argument.
#!/bin/bash # Get all Images in Category (and 5 subcategories) wget "http://toolserver.org/~daniel/WikiSense/CategoryIntersect.php?wikifam=commons.wikimedia.org&basecat=${1}&basedeep=5&mode=iul&go=Scannen&format=csv" -O list # Read the list file after file while read line; do name=$(echo $line | tr ' ' "\t" | cut -f2) # Extract filename api="http://commons.wikimedia.org/w/api.php?action=query&titles=File:${name}&prop=imageinfo&iiprop=url" url=$(curl "${api}&format=txt" 2>/dev/null | grep "\[url\]" | tr -d \ |cut -d\> -f2) # Get the URL of the File via API echo $name echo $api echo $url wget $url # Download File done < list rm list # Clean up
2021 fandom attempt
While attempting to grab the images when moving off fandom to a custom wiki host, the above instructions were not working. Instead the following script is what was used to have it download the images. This has the annoyance of needing to manually copy in all the URLs of the wiki sitemap pages. (visit the special page /wiki/Local_Sitemap for the fandom wiki)
#!/bin/bash #Download all the pages via the enumeration of the sitemap wget -r -l 1 -e robots=off -w 1 https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Succubus+Veronica https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Lovely+Witch%27s+Panties https://magical-camp.fandom.com/wiki/Local_Sitemap?namefrom=MC%3A+Cursed+Wig https://magical-camp.fandom.com/wiki/Local_Sitemap #then lets find all the images embedded in the pages grep -r 'class="image"' magical-camp.fandom.com/ > img_tags; wc -l img_tags #and pull out the linked URLs to the original image #grep -P -o '(?<=href=")https.*?(?="[> ])' img_tags > img_urls #This will remove the extra bits off the URL so wget will save the real filename instead of being called latest?somerandomstring grep -P -o '(?<=href=")https.*?(?=/revision)' img_tags > img_urls #now download the actual images wget -i img_urls -P images -w 1 echo "Finished - images are in the folder 'images'"