I recently decided to release a personal project of mine on GitHub. The name is CrowLeer and you can find it here.
In the last year I worked for a customer which needed a software capable of extracting particular data from a bunch of public websites' pages. I was ready to write the code for the recognition and storage of said data, but couldn't find any existing crawler that fit my needs. They come in all shapes:
- Some offer a lot of very useful SEO data but can't download pages
- Others have a download feature but lack the granular control needed to avoid downloading or following a great number of irrelevant pages
- The ones which can download and have proper control over the flow of the crawling lack reliability or a proper way to be integrated with other software
I ended up using one of the previously mentioned "unreliable" ones (with loads of ad-hoc middleware) and called it a day, but months later decided to create my own as a personal project.
CrowLeer was created with simplycity, control and interfaceability in mind. You can find all the details in the GitHub page on the top of the article. I have plans to greatly expand its features but I already find it much more functional than many of the competitors I've worked with.