Français - English
Source (plain): svn://svn.saintamh.org/code/search/trunk/
Source (highlighted): http://svn.saintamh.org/code/search/trunk/

This software provides a Web search interface that looks like a simplified Google, but combines results from various search engines (as of 2012: Google, Bing, Yandex and Exalead).

Background

I decided to write this when Bing came out and one Microsoft employee published the Blind Search tool, where you can type in a query and results from all three engines are shown side by side, but you're not told which engine produced each of the three columns of results. I used it as my search engine for a week or so, and I was surprised to notice that:

  1. results are quite different from engine to engine;

  2. Google, the One True Search Engine I'd been using exclusively for years, and which I'd always just assumed to be the best, didn't always have the most relevant results;

  3. perhaps even more surprisingly, Bing was actually sometimes quite good!

This made me realize I might get a better web search experience overall if I simply used all three engines. The Blind Search is not practical for everyday use, though, so I set about to write my own.

Algorithms

The scraping is very standard stuff, so are the data structures. In order to achieve near-tolerable speeds, the queries to the three search engines are done in parallel. The proper way to do this would have been to use threads, but the version of Perl installed on my Web server wasn't compiled with threads, so we use subprocesses. In order for data to be piped from the child processes back to the parent, a quick and dirty serialization lib was written.

The one challenging bit was figuring out how to rank the results when merging them into one set. If page A appears in 1st position in one engine, but page B appears in 2nd position in the other two, which one wins out and goes on top? Are certain engines' rankings more valuable than others? The algorithm I finally settled on puts each result in the highest rank any engine gave it (so if a page was put on top by any three engine, it'll appear on top in the merged output). Ties are resolved by looking at the second highest position given to the page by any engine, and so on.

Results

The program gathers some usage statistics. Basically, they show that Google does provide the most relevant results for me — I click on Google results more often that on other engines', and Google shows more results that aren't returned by other engines.

But Google's not enough on its own: about 1 out of 4 results I click wasn't returned at all by Google (not on the first page or results, anyway, which is in most cases all I read). This is the main reason why I continue using this little hack here for all my web searching.