Chrome Crawler – A web-crawler written in Javascript

EDIT: I now have a newer, better version of this called “Recursive

Depending on your level of geekness you may or may not enjoy this one.

I proudly present Chrome Crawler, my latest Google Chrome extension:

The idea is simple really. You just give it a URL, it then goes off and finds all the links on that page then follows them to more pages then gets all the links and follows them and so on and so on.

Along the way it checks each page to see if there are any ‘interesting’ files linked there, if it finds an interesting link it will flag it for you so you can check it out.

Theres an options page that lets you customise the way it all works:

If you are still confused check out the video below:

So why did I make this? Well to be frank, I made it mostly “just ’cause I can”!

Also having learned from my last Chrome Extension project PostToTumblr I realised the Chrome API allowed you to do some things that you wouldn’t normally be allowed to do on a website (nameley the Cross-Origin XHR) and I wanted to do something to take advantage of it.

It didnt take me long to knock out this project, one lazy Saturday for the majority of the code and today for a quick fix or two and to write this post and make the video. As such I expect there to be many bugs and problems so if you encounter one drop me an email (my address is found in the options page).

Oh finally, I wouldnt try using this on a google page as you will likely end up seeing this quite often:

Anyways you can grab it over on the Chrome extensions gallery here. If you enjoy it please leave me a review / comment, much love!

11 Comments

PAEz

about 2 years ago

This is some nice code to make custom crawlers with, thanks so much for sharing it. Love the way you handled the settings, didnt have much experience with get and set but Ill be using them in the future. One thing that might be a nice addition to this is to offer an option to block images as it downloads the images when it makes the page and thats really unnecessary. Unfortunately the only way I know how to do that (content settings in options doesnt work) is to block images in the whole browser, I couldnt figure a way to just target stuff from the page being crawled....If you ever find a way Id love to hear it. Heres how I block all images in Chrome from being downloaded (I put this in a separate extension).... chrome.webRequest.onBeforeRequest.addListener( function(info) { return {cancel: true}; }, // filters { urls: [ "", ], types: ["image"] }, // extraInfoSpec ["blocking"]);

Reply

PAEz

about 2 years ago

urggh...the code didnt come out right (stripping tags I guess) the URLS should be left sharp bracket "no idea what their really called;)" all_urls right sharp bracket, inside the quotes

Reply

Raja

about 2 years ago

hello sir, this extension is really osum...i really liked it... i need a help from you . could you please contact me when you come online...or givwe me your mail id. my id : kkrajdurai@gmail.com

Reply

Angry

about 1 year ago

Dumbest reason to implement a new scraper I have heard of. Basically: "I know how to be a criminal, so I thought I would be one."

Reply

mikecann

about 1 year ago

not sure what you mean there? criminal?

Reply

Shivin Saxena

about 7 months ago

I am working on a similar project on chrome extensions which needs to do exactly what your extension does!!; get all outbound links from the current website. Unfortunately, I have not been able to implement it or understand the algorithm or strategy behind making a recursive call to scan all links one-by-one. I am decent at javascript and it would be great if you could share some tips with me! :)

Reply

mikecann

about 7 months ago

Hi mate,You should search my site for "Recursive" it was a follow up project I did that works on the same principle, full source included with plenty of infos!Mike

Reply

Shivin Saxena

about 2 months ago

Thanks! I will look into the code. How did you get your crawler to spider URLs so quickly? Are you using some server side scripting or some other way of multi threading AJAX and XHR calls to each URL you find on your way? Actually I am pretty new to this and am working on a similar extention :)

Reply

mikecann

about 2 months ago

Nope nothing fancy. Just async ajax calls

Reply

windows 8 upgrade

about 6 months ago

It was wonderful to read about chrome crawler which is web crawler written in JavaScript. It was nice of you to share the options of the chrome crawler with image, as it was easy to understand through the image that you have shared.

Reply

Leave a Comment

Leave a Reply