I recently decided to release a personal project of mine on GitHub. The name is CrowLeer and you can find it here.
In the last year I worked for a customer which needed a software capable of extracting particular data from a bunch of public websites' pages. I was ready to write the code for the recognition and storage of said data, but couldn't find any existing crawler that fit my needs. They come in all shapes:
- Some offer a lot of very useful SEO data but can't download pages
- Others have a download feature but lack the granular control needed to avoid downloading or following a great number of irrelevant pages
- The ones which can download and have proper control over the flow of the crawling lack reliability or a proper way to be integrated with other software
I ended up using one of the previously mentioned "unreliable" ones (with loads of ad-hoc middleware) and called it a day, but months later decided to create my own as a personal project.
CrowLeer was created with simplycity, control and interfaceability in mind. You can find all the details in the GitHub page on the top of the article. I have plans to greatly expand its features but I already find it much more functional than many of the competitors I've worked with.
If you want to use it in your project, or just try it, you can send me feedback at my mail address. Even negative feedback will be much appreciated.