sitemap2proxy

When doing a web app test you usually end up spidering the site you are testing but what if the site could tell you most of that information without you going hunting for it. Bring on sitemap.xml, a file used by a lot of sites to tell spiders, like Google, all about their content.

This script takes that file and parses it to extract all the URLs then requests each one through your proxy of choice (Burp, ZAP, etc). Now this won't find anything that isn't mentioned in the file and it won't do any brute forcing but it is a nice way to identify all the pages on the site that the admins want you to know about.

A good addition to this script is pagefinder from Tim Tomes. This tool checks a list of sites to try to find a sitemap and/or robots.txt file.

Installation

sitemap2proxy is a simple Ruby script and doesn't require any additional gems to be installed. Just make it executable and thats it.

Usage

Usage is pretty simple, you can specify either a sitemap that you've already downloaded or point it at one on the site. It will take either raw XML (sitemap.xml) or a gzip'ed file (sitemap.xml.gz), I've not see any other variants but if there are any let me know and I'll add handling for them. The other parameter it requires is the proxy URL.

By default the requests are made with the Googlebot user agent string to try to hide the traffic in the logs. If you want to change this you can specify your own agent using the ua parameter.

Here are some examples.

Grab Google's sitemap.xml file and pass it through a local proxy on port 8080:

./sitemap2proxy.rb --url http://www.google.com/sitemap.xml --proxy http://localhost:8080

Note: I wouldn't recommend running this against Google, they have 35k of records in their sitemap, just parsing that takes quite a while.

Do the same time, this time pretending to be the Yahoo Bot

./sitemap2proxy.rb --url http://www.google.com/sitemap.xml \
	--proxy http://localhost:8080 \
	--ua "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

Parse a file you've already downloaded and send it through a proxy on a different machine:

./sitemap2proxy.rb --file sitemap.xml.gz --proxy http://proxyserver.int:8080

Do the above but verbosely:

./sitemap2proxy.rb -v --file sitemap.xml.gz --proxy http://proxyserver.int:8080

If you are stuck, simply ask for instructions:

./sitemap2proxy.rb --help

Interesting Fact

While testing this I found that in the robots.txt file on google.com they specify a bunch of additional sitemaps, I didn't know you could do that. You should always be checking the robots.txt file for juicy stuff, I think the possible findings just got juicier.

Download

Change Log

  • Version 1.1 - Added response code stats
  • Version 1 - Released

Support The Site

I don't get paid for any of the projects on this site so if you'd like to support my work you can do so by using the affiliate links below where I either get account credits or cash back. Usually only pennies, but they all add up.