Scrapy is an extremely efficient framework to write web spiders. With just 30 lines of codes, you can write a crawler that can go through each link in your website and report each link , with HTTP status and the link from where it is linked.
I use this spider to detect broken links on my site. It can also be used to see how many links actually exist on your site and later make a sitemap.
My website is very small with under 10K links, so I set the CONCURRENT_REQUESTS_PER_DOMAIN to 2 in my spider settings so that the website is not overloaded.
CONCURRENT_REQUESTS_PER_DOMAIN = 2
If you have a large website, you can have more concurrent connections to the site by increasing this number. With more concurrent connections, the crawler is faster but the site has more load.
Run the spider and let it dump the data into a csv.
scrapy crawl brokenlink -t csv -o broken.csv
or alternatively if you want to run it in batches, pause and resume, you can run it with -s
scrapy crawl brokenlink -s JOBDIR=crawls/broken -t csv -o broken.csv
Here is the spider coding
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item, Field
status = Field()
name = "brokenlink"
allowed_domains = ["yourdomain.com"]
start_urls = ['http://www.yourdomain.com']
handle_httpstatus_list = [404,410,301,500] # only 200 by default. you can add more status to list
rules = (
Rule ( LinkExtractor(allow=(''),deny=('pattern_not_to_be_crawled'),unique=('Yes')), callback='parse_my_url', follow='True'),
def parse_my_url(self, response):
sel = Selector(response)
item['referer'] = response.request.headers.get('Referer', None)
item['status'] = response.status