Term pruning

Pruning tools utilized by a pruning and tree-s...
Image via Wikipedia

At Zemanta we are constantly experimenting with new ideas how to improve our service. Most of our experiments are gainless, but quite often one learns more from failure than success.

One of the gainless but illuminating experiments we did lately is term pruning. Experimentally, we have observed that 52% of terms occur in only one document and that excluding terms occurring only once have had no influence on precision of our recommendations.  Our recommendation engine is computationally very demanding and make it more efficient is a never-ending process. A chance to prune 52% of terms seemed quite promising for increasing performance of our engine and reducing index size.

Our recommendation engine is based on Apache Lucene/Solr. At a recent Lucene EuroCon conference, Andrzej Bialecki presented a Lucene patch that provides an easy tool for index manipulation. Using this tool we have removed all terms occuring in only one document, and all postings and payloads belonging to such terms. It has turned out that efficiency of our engine did not change and also the index size decreased only slightly (by 1.5%).

In our opinion, this experiment has shown that Lucene is very efficient at storing terms and associated term data (postings & payloads), and that presence of rarely used terms in the index is not of a concern.

[This post was originally published at The Unreasonable Effectiveness of Data blog]

Enhanced by Zemanta

Subscribe to newsletter

  • http://twitter.com/dusano Dušan Omer?evi?

    If you are developing information retrieval techniques, then the one thing you might learn from our post is that term pruning does not increase efficiency.

  • http://twitter.com/dusano Dušan Omer?evi?

    If you are developing information retrieval techniques, then the one thing you might learn from our post is that term pruning does not increase efficiency.

  • http://twitter.com/dusano Dušan Omer?evi?

    If you are developing information retrieval techniques, then the one thing you might learn from our post is that term pruning does not increase efficiency.

  • http://twitter.com/dusano Dušan Omer?evi?

    _

  • http://twitter.com/dusano Dušan Omer?evi?

    _

  • http://twitter.com/dusano Dušan Omer?evi?

    _

  • http://www.discoverrei.com bruce

    My question now is how can I effectively apply this knowledge on my Data blog?