Because the crawling process is I/O bound, it is very useful to fetch pages in threads. I chose an architecture with an administrator class and a various number of worker threads doing the I/O stuff. This speeded up the crawling process for more than three times.
I often had the problem that the Interactive Console from Pydev does not start in the project source folder. On Windows, the working directory is per default:
So I wrote a short script the set a working directory automatically which can be added to “Initial interpreter commands” under Preferences > Pydev > Interactive Console. Here it is:
1 2 3 4 5 6 7 8 9 10
Please note, that there has to be two newlines at the end of the script. Otherwise the
else-block wouldn’t be left correctly.
This works fine if you doesn’t have any custom paths in your PYTHONPATH environment variable.
In this part of the blog series about a search engine in Python I want to present how to automate the crawling process and how to handle the robots.txt file. Here is a log about the performance of the resulting crawler:
1 2 3
In my project SITforC I created a setup for Windows with NSIS, so that the software is not dependent on a Python installation. This implied the requirement of a portable version of Python, because I don’t want to use such tools like py2exe or pyInstaller. If you like the idea of a portable Python version and need one with your own dependencies, see this short guide to generate it of your own.
Note: You can download the current version of SITforC to get a portable Python version with some common dependencies like
I’m currently working on an experimental search engine in Python named PySeeek and want to share my experience. In this blog post I want to present some basics to create a simple web crawler using the built-in library urllib2 to fetch the resources and parsing the HTML with lxml.