Tuesday, May 1, 2012

Python link checker to find broken links in a website

Recently I was doing an online course "CS 101:Building A Search Engine" at www.udacity.com. In this course I learnt about basics of python programming, crawlers and search engine. I thought of building a small utility using crawlers to locate broken links in a web site if any. This utility crawls every page of your web site and checks if all the links are accessible. In case of any error it returns the url along with their http error code. This utility is suitable for that kind of web sites where you have multiple pages connected through links(using "href") and does not have too many form submission and AJAX calls.

Algorithm:
  • Accepts 2 parameters - 1) start page url 2) max depth to which crawler runs.
  • It first crawls root page and add all the links into to-be-crawled list.
  • once a page is crawled, it gets removed from to-be-crawled list and added to crawled list along with its status code (for accessible page the status is "OK", for http error status will be http status code and for wrong urls the status will be "invalid url".
  • It keeps crawling all the links until it reaches the max depth.
  • After finishing crawling, it writes a file with name "site-health.txt" which will have all the urls along with their status.

Note:
  • This utility could be more useful during release phase or during support phase of the project where after every new build you want to make sure that all the links are working.
  • It does not crawl pages which has AJAX calls.
  • It only crawls pages which has links using "<a href="<url>"></a>
  • It crawls only those links which are internal to the domain name of the root url. It does not not crawl links external to root domain. For e.g: if your root url is www.a.com, and your website has link to an external site www.b.com. Then it will crawl all the links inside www.a.com domain, www.b.com, but it won't crawl links available on the site www.b.com. If you want to add some more domains for crawling then you may need to edit source code as follows:
    • find the statement domain = get_domain(root) in the source code, change this line to 
                   domain = get_domain(root, "b", "c", <other domains>)
                  then it will crawl root url domain and domain b, c and other domains given in above statement.
  • I have tested it using Python 2.7.2. Please make sure that you have Python 2.7.2 or later version installed in your machine.
Source code:
https://docs.google.com/open?id=0B8O-miA80x0gS2JnSkVqTkZtTWs

How to use:
  • Download  check-web-health.py from above link and open the source code in edit mode.
  • Go to the last line of the code. It has the line: check_web_health('http://google.com',2)
  • Edit this line to check_web_health(<url of start page>,<max depth of crawling>)
  • Save and run this program.
  • After this program exits, find a file with name "site-health.txt" in the same directory where the  check-web-health.py file is present. Each line in this file will have url along with its status.

My knowledge of Python programming is of intermediate level. So, probably there may be some issues with this utility. Please use it at your own risk :)

Please let me know the issues you faced while using this utility.

Thanks

2 comments:

  1. A broken link on your site is a dead end for your visitors and will also be bad news for your search engine optimisation.

    Thanks
    SEO company New York

    ReplyDelete
  2. I couldn't resist commenting. Perfectly written!

    Feel free to surf to my weblog - decorating (http://houseofhomecraft.blogspot.com)

    ReplyDelete