How do you test a complex system that is trying to mimic being smart? You want to automate testing, so you have your quality meter available for every little change you make. While having unit tests helps catching classical programming regressions, the major part of the challenge is having 'smart part' under control. Unfortunately the only way to tell if the system is doing a good job or not is to have human check the results. The trick is that if you could automate testing in general, you would already have solved the hardest problem.So what you basically do is generate a set of evaluation data, manually. And have a system that does something like unit tests, but instead of giving you fail/pass results, you get statistics. Now you would think that the problem is solved, but that's far away from truth.
There is changing of the dataset - when you have new content in the system, you get completely new related stories and you have to go back and have a human judge them. There is expansion of the evaluation data - as you add new tests you generally can't send them through previous versions of your algorithms, since that would be prohibitely expansive. And there is statistics that hardly gives you overview over what exactly your changes caused, just few final numbers. And then there is the problem of pipelining the processing. Even if you improve the first stage, end results might be worse, since you've already adapted the second stage to previous first one. So you need to actually evaluate each part of the system in isolation and then together.
At the end you actually find out that you spend disproportional amount of time evaluating even the smallest changes. So you are in danger to just skip that evaluation which naturally you shouldn't.
Ok, so much for today, now I think the evaluation run has just ended and I should be checking the results, again.