schlamar.blog

Python Search Engine - Crawler Part 3

Because the crawling process is I/O bound, it is very useful to fetch pages in threads. I chose an architecture with an administrator class and a various number of worker threads doing the I/O stuff. This speeded up the crawling process for more than three times.

PyDev Interactive Console - Set Working Directory

I often had the problem that the Interactive Console from Pydev does not start in the project source folder. On Windows, the working directory is per default:

1
2
>>> os.getcwd()
'C:\\Windows\\system32'

So I wrote a short script the set a working directory automatically which can be added to “Initial interpreter commands” under Preferences > Pydev > Interactive Console. Here it is:

1
2
3
4
5
6
7
8
9
10
import sys; print('%s %s' % (sys.executable or sys.platform, sys.version))
import os
base_path, _ = os.path.split(sys.executable)
cwd_path = [path for path in sys.path if base_path not in path
        and 'org.python.pydev' not in path and 'python26.zip' not in path]
if len(cwd_path) == 1:
    os.chdir(cwd_path[0])
else:
    # choose shortest path
    os.chdir(sorted(cwd_path, key=len)[0])

Please note, that there has to be two newlines at the end of the script. Otherwise the else-block wouldn’t be left correctly.

This works fine if you doesn’t have any custom paths in your PYTHONPATH environment variable.

Python Search Engine - Crawler Part 2

In this part of the blog series about a search engine in Python I want to present how to automate the crawling process and how to handle the robots.txt file. Here is a log about the performance of the resulting crawler:

1
2
3
Total runtime: 25 min
Pages processed: 1963
Average: 1.265 Pages/s 75.927 Pages/min

Python Portable on Windows

In my project SITforC I created a setup for Windows with NSIS, so that the software is not dependent on a Python installation. This implied the requirement of a portable version of Python, because I don’t want to use such tools like py2exe or pyInstaller. If you like the idea of a portable Python version and need one with your own dependencies, see this short guide to generate it of your own.

Note: You can download the current version of SITforC to get a portable Python version with some common dependencies like pygtk, numpy and matplotlib.

Python Search Engine - Crawler Part 1

I’m currently working on an experimental search engine in Python named PySeeek and want to share my experience. In this blog post I want to present some basics to create a simple web crawler using the built-in library urllib2 to fetch the resources and parsing the HTML with lxml.