Downloading the internet with a single machine

Downloading the internet

The core of Zemanta is a recommendation engine. With a single call to the API it can recommend everything from related articles and relevant images to interesting links for your text.

Okay, it only does those three things …

Still, doing that requires data. Lots of data. Regularly following over 600,000 RSS feeds has already amassed about five million evergreen articles and a <full slave hard drive> of content every three months or so.

Most of this happens on a single awesome machine.

Looking from the stratosphere, the architecture appears simple:

  • an aggregator downloads the internet
  • a frontend serves requests

Downloading the internet

Grumpy old bear

Aggregators and crawlers wrestle the grumpy old grizzly bear that is the internet. They’re the riot shield making sure all other services in your over-engineered application are talking to a predictable resource.

Zemanta’s aggregator works in two stages.

A resolver takes any URL imaginable and translates it into a proper feed using a bunch of heuristics and even the good old looking into the HTML trick. Source suggestions mostly fly in from user preferences, but there’s also a service stumbling around the internet in a drunken stupor and finding interesting things to look at.

The second stage is a bit trickier. This one needs to make sure websites don’t crash.

Not only does it have to say “Whoa there website, you’re a bit slow right now. I’ll come back later” it also needs to overcome the uncanny valley of web scraping, decide what would make a good thumbnail, figure out whether a website deserves to be put in something called a “main bucket”, maybe this page is spam? …

And then you’re still left with indexing all this data so you can look it up later. Quickly.

Quite a task. No wonder MySQL was the best key-value store to do this.

Wait what?

Turns out MySQL is not only fast, it’s also pretty good at being a huge key-value store. As long as you don’t need to run an ALTER TABLE – that takes about a week.

Don’t worry, all the modern bells and whistles are being used as well. The flip-side of Cassandra’s eventual consistency being eventual inconsistency. But more about that some other time.

Indexing

MySQL and Cassandra are great at storing json blobs and such, but Zemanta needs full-text search. Now that there is a problem.

Luckily it’s a problem a lot of people have. This means there’s an off-the-shelf solution in the form of Solr, a software beast  written in Java, running on top of Lucene, giving everyone the ability to perform quick full-text searching over thousands, even millions, of documents.

I have no idea how it works.

But I understand enough to know  why complex queries could be a problem. A problem so big it sometimes takes seconds to solve. Unacceptable for everyone but the most patient clients.

Realizing most of your lookups will be based on phrases picked out by entity extraction solves this problem. Extract your entities in the crawler stage, ask Solr to do some searching and cache the responses.

Marvelous.

The frontend

Killer suggestions, by maureen lunn
Killer suggestions, by maureen lunn

At this point we are sitting on a pile of constantly updating data, waiting for somebody, anybody, to issue an API request so we can delight them with our vast piles of knowledge.

But not so fast!

A single machine might be enough to handle crawling, but it certainly won’t handle everyone talking at the same time. Sprinkling a bit of simple rsync solr replication, some sysadmin-fu and a bit of amazon’s EC2 makes the problem as good as solved.

Next we are left with running Django behind Nginx to handle incoming requests, unpack the data and later making sure to respond with something the client can actually read.

Octo takes over from here – he’s the sort of process that can take a request, split it into many tiny requests and making sure all subprocesses complete their darn task in a timely and orderly fashion.

This is where the actual recommendations happen. First a rule based named entity recognition (NER) algorithm takes a shot at the input text. Its only problem is it can’t really decide whether Albert Einsten is one or two names, which human users find kind of odd.

To make the suggestions more human-like Zemanta created something called WTD, a statistics based entity extractor that uses a huge accumulation of knowledge, the Aho-Corasick string matching algorithm and a sprinkling of things I have no hope of understanding.

All I know is that it works.

Wrapping up

After that it’s up to Octo to collect suggestions from all the different recommenders – entity extraction, related articles, images … – and packs them up into something Django can package into json and send off to the user.

Hopefully the user likes what they get!

And I just noticed WTD recognised “Aho-Corasick string matching algorithm” as an entity. Colour me impressed, I was certain it wouldn’t.

Impressively picked out entity

You should follow me on twitter here.

test

Enhanced by Zemanta

Subscribe to newsletter