Search Engine Blind Test

saintamh.org / Search Engine Blind Test

From September 2009 to August 2012, I used as my everyday search engine a little script that fetched results from several search engines and merged them into a single list. The page didn't show which results came from which engine, and tracked my clicks. This page shows some stats extracted from the click data, hoping to answer the question that kind of sparked the whole project, "which engine gives the best results?"

Clickthrough rates

This chart shows, for each search engine, how often I clicked on their results:

If an engine always returned off-topic results, I would never click on their results, and so their clickthrough rate would be 0. If an engine always returned results so relevant that every result I ever clicked had been returned by them (along with possibly other engines), then that engine's clickthrough rate would be 100%.

On the search results page was a button that would reveal the names of the search engines that eachg result originated from. The numbers here exclude those times when I opted to see the engine identifies, so as to keep the test "blind". I rarely select to see the engine names anyway, except when debugging this app itself.

Some engines were invoked fewer times either because I added them when the experiment was already ongoing, or because I stopped using them early.

Average rank

This chart shows the average rank, within the engine's result page, of the results that I clicked on.

Smaller values mean I tended to click on links that were closer to the top in that engine's results, higher values mean I tended to dig deeper in that engine's results.

User votes

Starting midway through the 3-year experiment, every search result had a "+" and a "-" link next to it that allowed me to flag results that stood out as noticeably good or bad. Again, I couldn't see which engine I was voting for.

Here are the tallies. The percentage indicates how often the search engine received an up- or downvote, relative to how many times that engine's results were shown.

An alternative way of scoring these votes is to give more weight to votes for results that are ranked higher. This seems intuitive: if for instance I downvote a link that was returned by two engines, and the downvoted link occupied the top position in engine A's results, but only the 30th position in engine B's results, then it sounds reasonable that more blame should go to engine A than to engine B: it's not so bad to return irrelevant links in 30th position as it is in the top position.

In the chart that follows, each value is the sum, for all votes on links returned by that engine, of the inverse of the rank that the engine gave to the link. So for instance in the previous example, the downvote would add a full negative point to the tally for engine A, but only 1/30th of a point for engine B.

Engine correlation matrix

Each of these charts shows, for the given engine, the percentage of results returned by that engine that I clicked on that were also returned by each of the other engines.

So for instance the "Google" chart has 64.89% in the "DuckDuckGo" column, that means that 64.89% of the results that I clicked that had been returned by Google had also been returned by DuckDuckGo.

Conclusion

The results have been pretty consistent over time: Google emerges as the clear winner no matter how you measure it. Their results are more relevant and complete than any other engine's.

The engines complement each other quite well, however -- Bing often has relevant results that Google doesn't have, for instance. And 1 out of 4 results that I click didn't appear in Google at all.

Another interesting finding (not reflected in the data on this page) is that the quality of Yandex and Google results is slowly but surely improving all the time, while other engines seem more stable.

After 3 years of using that little script as my search engine I concluded that I would just use Google :)