At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.
One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations. Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.
Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).
In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.
Related articles by Zemanta
- Lucene/Solr Information (arnoldit.com)
- Solr Digest, August 2010 (sematext.com)
- 11 Apache Technologies for the Enterprise (itexpertvoice.com)
- Stumped with Solr? Chris Hostetter of Lucene PMC at Lucene Revolution (lucidimagination.com)
[This post was originally published at The Unreasonable Effectiveness of Data blog]