Archive for the 'Algorithms' Category

Beware of the sorts you use!

Wednesday, August 1st, 2007

Usually I try to make incremental changes to a debugged big piece of software in order to avoid introducing new bugs that can bite rather painfully later when you least expect it. One reasonably good way of testing that new changes do not break old stuff is to use same inputs and save knowingly good (I call it “gold”) output from software and then compare it with the new output. It should be identical if the changes that you made were designed to improve things like performance or scalability of the same code but not change actual output. This output can be easily compared using fc /b command to know that outputs are exact or just visually check size (more dangerous). Say for example new code might do complex calculations on multiple CPU cores and then merge results, but such results should be exactly the same as if it was running serially on just one core. Sounds simple, but not always!

(more…)

First post and a taste of things to come

Monday, January 1st, 2007

For some time I thought of starting a blog as means of recording some interesting finds as well as venting some frustrations experienced during process of building DSearch.

The current problem I am working on is automatically determining best recrawl rate for pages that generate dynamic content that is technically different every time they are requested, either due to personalisation, some hidden internals like client-side web analytics or they are just designed to look updated to make search engines recrawl them more often than they really needed. The solution requires an algorithm that is resistant to small changes on the page and allows to determine if a substantial part of the page change, it should also be very fast as we can’t spend too much time analysing a page, and also it should take very little space… if that’s your cup of tea then stay tuned for updates!