Beware of the sorts you use!

Usually I try to make incremental changes to a debugged big piece of software in order to avoid introducing new bugs that can bite rather painfully later when you least expect it. One reasonably good way of testing that new changes do not break old stuff is to use same inputs and save knowingly good (I call it “gold”) output from software and then compare it with the new output. It should be identical if the changes that you made were designed to improve things like performance or scalability of the same code but not change actual output. This output can be easily compared using fc /b command to know that outputs are exact or just visually check size (more dangerous). Say for example new code might do complex calculations on multiple CPU cores and then merge results, but such results should be exactly the same as if it was running serially on just one core. Sounds simple, but not always!

I’ve just spent some hours tracing bug that I thought I introduced after yet another rewrite only to find logical “legit” explanation for the behavior – Array.Sort in .NET is based on QuickSort, which is said to be “unstable”. This mumbo-jumbo scientific term means in human language that if you happen to sort keys with values, then it is possible that those keys that are identical (but with different values) may change order in which they do (actually their values would swap, not keys since they are identical in value) – global order will be perfectly sorted, but in terms of two identical keys they might swap places. This won’t happen all the time as it depends on QuickSort internal splits and actual data you have – the less identical keys you have the less likely you will be hit by this not that well explained up-front “feature”: Array.Sort in .NET and other languages should really carry a “health” warning mentioning this subtle but real issue.

This problem can occur even when you use same well tested input data (which makes it looks as if you just added new bug that you can’t find) – it is sufficient to change buffer sizes that you sort for such effect of swapped keys to hit you. The end result might well be valid – it will just look a bit different from what you expected, and very different if you calculate hashes or compress output data.

Leave a Reply