Fruit Blog

Apache Lucene meetup, report

Posted by andraz, under Uncategorized on November 5th, 2009

We use Lucene/Solr in Zemanta a lot, so it is very interesting to know what’s happening there and what better place for that than Lucene Meetup.

First up was Erik Hatcher (co-author of Lucene in Action book [internal note to Zemanta guys: it's on the bookshelf]) talking about building search interfaces with Solr Flare and how he built search for public and university libraries. A quote: “You don’t have a faceting problem, libraries have a faceting problem!”. Bottom line: Lucene enabled him to build great search solutions with free software, replacing expansive system like Endeca which one of the libraries has paid $250k for.

Uwe Schindler (yeah, home page is in German!) presented a new and vastly more efficient way to handle date or numeric range searches that came with Lucene 2.4. In a nutshell Tries mean that prefixes (of different lengths) of numerical numbers are indexed in order to be able to quickly narrow down search to interesting documents by using standard posting lists beneath. You spend a bit more of disk/memory space and get blazingly fast date and numeric range queries. The implementation details are a bit awkward since Lucene still only supports UTF8 for terms, so he had to devise a way to encode binary numbers in UTF8 efficiently (solution: only use 7 bit characters). Fun.

The restriction of terms having to be UTF8 strings is going away in next version of Lucene (3.1?) This will be done by Flexible Indexing where terms are just binary. Flexible indexing will also enable underlying code to change the index bit format – for example use more efficient delta encoding of posting lists to save space. Anyone interested in this stuff should definitely watch the presentation Challenges in Building Large-Scale Information Retrieval Systems from Google’s Jeffrey Dean. It’s a must see – he goes through 10 years of Google’s architectural evolution and how the different CPU/memory/disk balances were changing and affecting architectural decisions. Must see.

Back to Lucene. I got to know about Katta which is a properly done Lucene distributed sharding and serving with shared TF-IDF calculation (thus avoiding the pitfalls of biased shards). Katta was presented by its founder Stefan Groschupf. Katta is basically a proper lucene-in-the-cloud solution. It uses ZooKeeper to do bookkeeping of the cluster and does automatic distribution of both indexes and queries. It however does not do indexing, you have to do that yourself (for example with Hadoop if you have lots of data). It does double-trip when answering the request in order to poll in the TF-IDF data and keep the exactly same accuracy even with distributed index.

The plan for Katta is to make it general and enable sharding of any data – for example mysql and also do indexing in the future. My personal hope is that it somehow starts to work closely with Solr.

Before this post becomes a book here are some other topic discussed: Andrzej Bialecki presented ideas for static pruning of terms from indexes (not very useful for Zemanta since doesn’t work well for multi-term searches). Marvin Humphrey gave a talk about Lucy, the loose C port of Lucene which is easy to bind to languages like Perl (Marvin has a great story about how music got him into programming and how he never before knew he had that talent and sees it as very similar to musical talent). LinkedIn’s Jake Mannix presented real-time search with Lucene. Basically they are adding documents and having them available for search instantly. They do a lot of juggling with caches and swapping active and new active writers/readers all the time.

A few final notes:

  • For lighting talks I did a quick impromptu presentation about how we use Solr/Lucene at Zemanta to do document-to-document search. I reached out to people doing similarly unusual stuff to contact me at the end. I was a bit disappointed that we seem to be alone at it. Only LinkedIn guys said they have similar albeit subtly different challenges – how to do document-to-document search between different types of ‘documents’ such as between companies and people, people and groups, etc. The similarity is that they too are using lots (tens) of terms in complex queries.
    Image representing Lucid Imagination as depict...
    Image via CrunchBase
  • The meetup was organized by Grant Igersoll the founder of Lucid Imagination which seems to be the company that is increasingly the go-to place for the interesting things happening in open source search and big data. And Grant is also one of the fathers of Apache Mahout. I wish them luck!
  • Thank you Nathan Kurz for your fantastic ice cream at the end! Everybody should try delicious Scream Sorbet. I’ve never eaten sweet potato ice cream before and probably never would would it not be for this great Lucene programmer turned ice-cream-entrepreneur. He also has an amazing story of how being faced with the possibility of not being able to code anymore he found his other talents. Oh and he still hopes to return to coding after he’s done building the greatest ice-cream company ever.

After final notes: Do you think these posts are too long? Should be split into separate ones? Should they be shorter in general? Tell me on twitter.

Reblog this post [with Zemanta]
  • http://lucene.grantingersoll.com/ Grant Ingersoll

    Andraz, if you're still here, I'd like to talk more about the doc-doc stuff, as I have done some of that in my past. Try to track me down

    -Grant

  • andraz

    Unfortunately today I can't come to Hadoop meetup. Anyone willing to record it, please? :)

    However I am now living in San Francisco. Where are you/lucid based?

  • http://www.ugg-outlet-australia.com Cheap ugg

    Andraz, if you’re still here, I’d like to talk more about the doc-doc stuff, as I have done some of that in my past. Try to track me down