What's the best thumbnail for this page?

A feature exists that we all use tens of times a day. A feature almost every social startup implements eventually. A feature so neglected not a single library is devoted to it. goose

Thumbnail extraction.

Whenever you share something on Facebook or G+, a thumbnail is extracted from the page. Sometimes good, sometimes bad, sometimes laughably attrocious.

There's no denying it - pictures make things sexier. People engage with sexy things. If your <insert site> can drive engagement and clickthroughs ... congratz! You have a successful product on your hands.

So why has nobody solved the problem of: Here is a link, which image best represents it?

Perhaps it's just one of those problems that seem almost too trivial to solve. Sure, just throw a hacker at it for two nights and you'll come up with something that works. It won't be perfect, but as it turns out, users aren't very demanding. The simple fix is just to give them a choice of three images and they'll solve the problem for you.

This attitude ganks Don't-Repeat-Yourself principle in a dark alleyway and beats it with a stick. Repeatedly.

Lucky for you, two solutions roam around Github - Goose and Reddit.

Goose

Goose is a web scraping library written originally in Java and being ported to Scala as we speak. Part of its offering is an image extractor.

Geese Over Odiham

The approach is pretty simple:

  1. Does anything match a regex of known elements - hand tuned for a few sites
  2. If not, are there any large images? - download all images, check their dimensions, make sure to filter out any known duds (banner-y images, gifs, etc.)
  3. If not, check some meta tags

Since all three are implemented exclusively, the second approach is likeliest to find an image. This is actually a somewhat slow approach, but it works well enough. Just implementing this and you're probably golden ...

... if you aren't processing tens of thousands of pages a day.

Reddit

Reddit's approach is even simpler - a 1688 sloc python script.

Rally to Restore Sanity and/or Fear.

  1. Traverse DOM looking for images
  2. Download image until you can read its metadata
  3. Abort download
  4. If area is smaller than 5000, ignore
  5. If image is wide or narrow, ignore
  6. If name contains "sprite", penalize

Then all found images are ordered by area and the first one becomes the lucky winner of the Being a Thumbnail award. This still basically downloads all images, but with the caveat that it actually doesn't - just enough to determine size, whereas Goose actually writes all images to disk.

But the most amazing part of Reddit's algorithm is what happens next.

The best candidate is cropped ... according to colour entropy. This basically ensures the thumbnail isn't just sky or whatever, but actually something useful to look at.

A bit of simple maths on the image's histogram and voila, perfect thumbnails.

Zemanta

Zemanta's approach seems light years more sophisticated - according to what @tomaz describes, I didn't get to poke around the code. Rather than just slapping together a bunch of heuristics, they take a more disciplined approach:

  1. get image candidates from DOM - just like reddit and, to an extent, Goose
  2. get media:thumbnail from the RSS
  3. look at opengraph metadata (og:image) and other embed hints
  4. very dataset specific, but looking for Wordpress's featured image can be of great help
  5. images embedded with zemanta have special css classes - very helpful

With the candidates collected, all it takes now is performing a priority sort according to extracted meta data, shove all the images in a rabbitMQ and wait for the workers to process each useful image.

So what IS the best thumbnail for this page?

To be honest, I don't know. Unlike article extraction, it doesn't seem anyone anywhere has ever put a lot of thought into getting thumbnails out of a website. Let alone performing a precision/recall study of different approaches.

Guess it's just not something users care much about and almost any old image will do as long as it's found on that page and not on too many other pages.

But it's an interesting little challenge, if someone makes it into a service I bet a lot of people will be very happy - one less problem to think about.

Enhanced by Zemanta