Long road to DSearch, Part 2: out of the Jungle but into the daymare

Still in the Jungle in early 2002 I came across with then raising supernova of the search engines – Google. They posted details of their first programming contest. Now for historical reasons that for now will remain undisclosed I do hate programming contests, particularly those that limit choices of programming languages to something lame like Pascal (yuck!), and also rules for many of them are often too far away from practical useful things that can be done. In this respect the contest that Google offered back then was really good – I was annoyed a bit that I could not write in my language of preference at the time Perl, but other then that the idea for this contest was good: Google just gave you a bunch of crawled web pages and you need to do something interesting with the data, an excellent approach! Sadly later Google’s contests moved into the wrong direction, so bad that it is not worth talking about, but their first attempt at it was the best, as you will see later this contest played pretty critical role in decision to start working on DSearch.

Many failures can be turned into opportunities: it is especially rewarding when the failure was not yours, so when all the bits of lousy architectural decisions made by expensive fishy consultants became apparent I decided to give some extra thinking on how to make a better search engine because the one we had implemented could be basicaly summed up in the following SQL query:

select * from Products where Keywords like ‘%KeyWord%’

It is hard (if possible at all) to do it worse then that: this algorithm is particularly bad in cases when one or more of the keywords is wrong, which would make database scan whole table before not finding anything – a handful of pointless queries could hit site very har and effectively allow bad chaps execute a DoS attack on an e-commerce site. The database in question was DB2, a poor but “free” replacement to great Sybase DB. The box that was running database had 12 CPUs and they were running at around 75-80% – way too high, so I took that as a chance to play around some of the newer approach that I invented to make searching faster: basically we need to go away from table scanning and ideally decide quickly if some keywords will never result in any matches, so we can abort searching quickly.

Products were already referenced by a unique integer product ID, so it was only logical to turn keywords into numbers: a simple Perl script took product IDs with keywords and tokenized keywords converting them into unique WordIDs thus creating lexicon or dictionary. This allowed to do a very quick lookup in the dictionary which was kept as a separate data table with unique index on keyword that allowed very fast determination of either whether we have got some keywords that are not present at all (made up queries that are probably designed to DoS us), or have WordIDs for keywords that we need to search for. Later, when I started reading up on relevant research papers, I found out that this approach is called Inverted Index.

The search itself was done in a table containing clustered WordIDs and ProductIDs – in case of multiple words they were union’ised using SQL, a fairly fast operation when data is already pre-sorted, but in any case it would beat big table scan. When the search went live overall database load dropped to 20-25% – a very substantial decrease. As the prototype was done in Perl/T-SQL, which turned to be an unofficial “bad” language at the time, almost everything had to be changed to Java and DB2 SQL, something that was done by my good colleagues James and Mark.

The most annoying part for me was that the powers that be in the company removed sub-second search time shown on the search pages – much like Google was doing at the time to show that they are that fast, so could we (on obviously much lower scale), but that idea was overruled.

By this point situation in the Jungle became rather unbearable and ultimately good company was driven into the ground. A lot of good work that I did beyond the search engine perished and this was actually a very valuable lesson to learn, later it influenced my decisions pretty heavily. But at the time I was just thinking that the worst was behind as I joined a new dynamic company to work along side of two colleagues in the e-commerce department there, surely life was good as I had a chance to re-implement all the good things I did at Jungle.com, but little did I know that I would experience the worst possible daymare (like nightmare, only during your working day) of my life…

To Be Continued

One Response to “Long road to DSearch, Part 2: out of the Jungle but into the daymare”

  1. elhoim Says:

    Sad but ynteresting history! Too bad you do not continue writing it…

Leave a Reply