building a web crawler: mongo vs sql

Toby Segaran’s Programming Collective Intelligence is a lot of fun. Recently I was playing around with some of the code in the chapter on “Searching and Ranking,” which builds a simple web crawler. (The full source tree for the book is available on Toby’s blog.) Contributing to the steepness of the learning curve, in my case, at least, was the crash course in relational databases. All of the data that is scraped by the crawler are dispersed into various tables — and connecting the data requires tables relating these tables, which are also built concurrently. The code design accomplishes all of this with elegance. The discussion in the book is exemplary in its lucidity. Nevertheless, all of this takes a couple days for the beginner to completely absorb.

Take for instance the task of searching indexed pages. The database used by the crawler routines is SQLite, a lightweight relational database. Once a tree of webpages has been crawled and indexed, it is searched by submitting SQL queries to the database. For example, to find all pages that contain two words requires submitting the query

select w0.urlid,w0.location,w1.location
from wordlocation w0,wordlocation w1
where w0.urlid=w1.urlid
and w0.wordid=10
and w1.wordid=17

Not only is it difficult for a human to debug this string, since the two “words” wordid=10 and wordid=17 point to rows in a table that are determined via look-up, but the string itself is generated recursively using a Python method that itself calls many helper methods in the same class.

It was a good feeling once I started seeing how all the pieces fit together.

The goal that I set for myself was to see what this would look like using document-oriented database like Mongo DB. I’ll explain a little bit how I approached this and describe some of the things that I discovered. Please pull the code base from the repository if you’d like to sing along.

Dependencies

The experiments require both the document-oriented database like Mongo DB and SQLite databases. Also ensure that their respective Python bindings, pymongo and PySQLite, are installed as well as Beautiful Soupfor parsing HTML documents.

Storing Records

Like SQLite, MongoDB has an interactive command interpreter, which is useful for experimenting with insertion and query commands. The documentation page has some clean tutorialsfor getting up and running. All “records” accessed via the command interpreter are JSON objects, which translate immediately into Python dictionaries. The main data that are scraped from a webpage for these experiments are the links and the individual words, and each page is indexed by its url. Thus a single dictionary record may look something like:

{
	"url" : "http://kiwitobes.com/wiki/Programming_language.html",
	"words" : [
		"your", "continued", "donations", "keep", "wikipedia",...
		],
	"links" : [
		"http://kiwitobes.com/wiki/Artificial_language.html",
		"http://kiwitobes.com/wiki/Control",
		"http://kiwitobes.com/wiki/Computer.html",
		"http://kiwitobes.com/wiki/Natural_language.html",
		...
		]
}

The inner most loop of the crawler algorithm has two main steps: (1) use Beautiful Soup to parse the elements out of an HTML page and insert these into the single dictionary record, (2) insert the record into the database. Compare this, on the other hand, to the intermediary layers needed when storing all data in an SQL database. The “urls” and “words” are stored in separate tables. Instead of using strings directly, row id’s are used. With a uniform naming convention, a helper routine like the following is used to convert indexed urls and words into to row id’s:

def getentryid(self,table,field,value,createnew=True):
	cur=self.con.execute("select rowid from %s where %s='%s'" % (table,field,value))
	res=cur.fetchone( )
	if res==None:
		cur=self.con.execute("insert into %s (%s) values ('%s')" % (table,field,value))
		return cur.lastrowid
	else:
		return res[0]

These row id’s are in turn used to generate SQL insert commands (as Python strings) for maintaining the table of “links” from a url, and the table of words associated to a url.

Searching

The structure in which the data is recorded automatically assigns a set of words to each url, but searching requires a mapping in the opposite direction: given a query word, or small collection of words, how do we get all urls associated to them? Think about the urls and words as being the two sets of nodes of a bipartite graph: there is an “edge” between a url and a word only if that word is on the webpage for that url. Our storage format as JSON records implicitly captures this information as a directed graph. The search problem would be much easier to solve if this were an undirected graph. One convenience of querying a Mongo database is the simplicity of the format: supplying a JSON argument of the form {"key": "value"} to the find() method returns all records with that value for the given key. It also accepts a comma-separated list of key-value pairs for further refinement. This inspired my (first approximation) solution to the search problem: build another database collection with a record containing both a "url":value and "word":value pair for every word on a webpage. Querying this collection for a "word":value pair returns a list of all records with the webpage where that word occurs. Solving the problem for multiple search terms then reduces to the problem of finding the intersection of a set of webpages. This approach in effect amounts to extracting the undirected bipartite graph — as a database collection — from the directed graph. For the executive summary, there are three main methods to the searcherclass:

  1. a method count() determine “stop words” in a language neutral fashion; for instance, hashing the words to their respective frequencies in the dataset and ignore those above a hand-tuned threshold;
  2. a method buildIndex() to assemble the containing all “url”:value, “word”:value data, ignoring the stop words;
  3. a search method simpleQuery('string') that returns the raw list of urls from a query of key words using the reduced index.

That sucked

It turns out that querying a record against a collection in this kind of database is really slow. I can now understand the appeal of using a relational database, even with all of the overhead it demands in terms of planning, implementing, and intermediary steps for inserting and querying.

My next goal was to refactor the code from the original PCI project to compare the performance characteristics with my approach side-by-side. Therefore I wrote two versions of the searcher class: one in the module searcher_mongo.py and another in searcher_sqlite.py. The classes in the respective modules have the same names and same methods, the methods just have different implementations: in the SQLite module, searcher.buildIndex() builds the tables urllist, wordlist, and wordgraph using the data indexed by the crawler, literally representing the “urls,” “words,” and “edges” of the bipartite graph.

One more detail: the code base from Toby’s PCI project tracked not the edges between urls and words, but also the “location” of the word on the page, essentially the integer index of that word when representing that page as an array. This information is used in post-processing the raw list of search results to provide various rankings. Although these post-processing features aren’t implement in this version of my class library, the location data is captured in the and indexed by the methods searcher.buildIndex() for modules.

The searcher.simpleQuery() method returns the search results as a lot of structured data. The first item is a list of “tokens” from the search query string, and the second item is dictionary of url:data pairs, where the “data” is in turn another dictionary encapsulating the token:locationsdata for that document.

Results

I made no attempt to run an exhaustive battery of comparison tests, but the numbers returned corroborate my experience. To obtain these I used the cProfile package; see the main_test.py module in the source tree. The test program first indexed a set of webpages using the crawler class starting from an initial seed list. Then it connects a search class to this database for each of the respective modules:

>>> Sql = searcher_sqlite.searcher()
>>> Mdb = searcher_mongo.searcher()

The word index is built (and timed) using for the SQL version:

>>> Sql.initTables()
>>> Sql.count(0.14)
>>> Profile.run('Sql.buildIndex()')
searcher_sqlite : building word index
         796334 function calls in 21.349 seconds
...

And the same is done for the Mongo DB version:

>>> Mdb.count(0.14)
>>> cProfile.run('Mdb.buildIndex()')
searcher_mongo : building word index
         5257029 function calls in 18.389 seconds
...

Finally it compares the search performance for each of the classes:

>>> cProfile.run('(tokens, results) = Mdb.simpleQuery("anything")')
         2025 function calls (2019 primitive calls) in 3.787 seconds

>>> cProfile.run('(tokens, results) = Sql.simpleQuery("anything")')
         165 function calls in 0.004 seconds

and

>>> cProfile.run('(tokens, results) = Mdb.simpleQuery("something special")')
         5560 function calls in 7.593 seconds

>>> cProfile.run('(tokens, results) = Sql.simpleQuery("something special")')
         385 function calls in 0.015 seconds

and

>>> cProfile.run('(tokens, results) = Mdb.simpleQuery("what about Google?")')
         8748 function calls (8743 primitive calls) in 6.315 seconds

>>> cProfile.run('(tokens, results) = Sql.simpleQuery("what about Google?")')
         1737 function calls in 0.171 seconds

These tests were run on a late 2010 MacBook Air. Your milage may vary. Comments welcome.

getting with github

I have a lot of projects on SourceForge. It’s home of the first project where I devoted serious contributions, and it seemed a natural place to host my own projects. Some of those are academic research projects, some are ideas that I wanted to share with friends, and some are code bases where I simply replaced their brittle makefiles with a build configuration system.

However, the self-imposed ostracism of not being active on GitHub grows more conspicuous with each piece of code I want to publish. It’s like being one of those rare souls who aren’t on Facebook.

Recently I “seeded” my GitHub profile with a small code base, gslextn, parts of which have already been incorporated elsewhere. I’m an avid user of the GSL, but eventually found the process of working with an API where things like matrices and vectors are C-structs to be wanting of more lubrication. Wrapping essential objects in C++ classes to exploit operator overloading and RAII idioms — without succumbing to the impulse to wrap each API call in a class method — seems like a good balance.

refactoring

I recently switched the back-end of this site from Indexhibit to WordPress. Call it a refactoring of os-scientific.org. Republishing a few articles as pages on was a good exercise in learning the ins and outs of WordPress’ software. (Not to mention an exercise in patience.)

Not only do I write a lot of code, I write a lot about it — actually a lot about the intersection of code and math. I have been using Doxygen for a while now to document code. What I really like about it as a publishing tool is that:

  • it handles code snippets elegantly,
  • it lets me incorporate LaTeX,

In fact Doxygen’s \mainpage has become my replacement for \documentclass{article}, allowing me to keep expository notes in the same file tree as a code base, and to generate clean looking HTML.

Indexhibit plays reasonably well with external webpages, so that the index.html generated from a Doxygen main page is essentially a ready-to-go blog post. WordPress does not play so well with foreign HTML code pasted into the editor, particularly HTML with lots of formatting directives. After a little experimentation with regular expressions, I found a way to scrub Doxygen-erated HTML into something more palatable to WordPress.

Let me mention that the site is running MathJax in the background. Formatting symbols in posts is handled exactly as in LaTeX documents, i.e., LaTeX markup escaped with the standard dollar-sign symbols.

For the scrubbing filter I used sed:

 $ sed -f doxy2wp.txt $DOCPATH/index.html > wordpress.html

where doxy2wp.txt contains the regular expressions:

s/<span[^>]*>//g; s/<\/span>//g
s/<div[^>]*>//g; s/<\/div>//g
s/^ //g
s/<img class=\"formula...\" alt=\"//g
s/\" src=\"form_[0-9]*\.png\"\/>//g
s/<p>//g; s/<p class="formulaDsp">//g; s/<\/p>//g

A bit of explanation: the first three lines strip away Doxygen artifacts that are superfluous to WordPress. The last three lines extract LaTeX markup needed for MathJax. Although Doxygen renders all math symbols with PNG files, it thankfully embeds the original LaTeX markup as an attribute of the HTML tag.