The uncanny valley of web scraping

Web scraping is one of those “Bah, I can do that in a weekend!” problems. What could possibly be so difficult about going to a website and deciding what the actual content is?

Social Media Outposts
Social Media Outposts (Photo credit: the tartanpodcast)

Then you try.

It doesn’t work.

After a bit of digging you realize this is actually a pretty tough problem – but surely there’s a library for this, right? I mean, with so many products needing this, surely someone’s already solved this problem?

They have!

A bunch of libraries exist, ranging from advanced algorithm implementations, like boilerpipe, to a car crash of hand tuned heuristics, like Readability.

Most of them extract the meat of the article perfectly. Very few actually miss anything. But it feels wrong. It just doesn’t look exactly like what you consider content as a human.

Maybe there’s the “<tags> by <author> at <time>” line left on the top, maybe it’s on the bottom, perhaps there’s a bit of cruft strewn in here and there … Most scrapers deal with this to varying degrees of success, sometimes they’re better on blogposts, sometimes on news sites, sometimes something else completely.

Sometimes the website is so weird a scraper will just give up and decide the first two paragraphs aren’t content. Granted, a lot of blog writers could do with just cutting away the first few paragraphs, but that’s beside the point.

Your content indexer, article previewer or full RSS builder are going no where fast. And you’re way past that weekend project phase too!

Easy for humans

This task is so easy for humans we notice every little detail. While extracting most of the content with a bit of cruft works perfectly well for indexing and data mining – showing that to a user will only end in tears.

It’s a bit like drawing hands or faces – unless you get it within 5% of perfection it just looks wrong. You’re almost better off drawing it at 80% within perfection and calling it a cartoon.

The uncanny valley of article extraction!

The closer you are to perfection, the less subconscious clues users will get to pick out the content themselves and the more jarring the difference between what they expect and what they get.

Now what?

USE ALL METADATA

Instead of relying just on what scraping algorithms say, you should help them out with as much knowledge of the website you can get.

1. If there is a full RSS feed, why are you even scraping? The content in there is usually clean. (the story of translating an url to an RSS url will come another day)

2. Without a full RSS feed you can still learn a lot about the start of the article from looking at the excerpt published in an RSS. Clean up the html, take the published excerpt, then go on a search through the DOM to look for an approximate match – voila, the beginning of your article!

3. Sometimes you can achieve a lot by relying on good old regular expressions and hand-tuned heuristics. A lot of those erroneous first and last lines look very similar. Just write a regex to detect a few variations of those and clean them out.

4. Another reasonable approach is guessing which articles come from the same website (hint: not just the same domain). These have almost the same cruft around every article. You can run a clustering algorithm on these and figure out what are the bits your scrapers are usually leaving in or missing – then just fix for those.

Zemanta uses a combination of these to create article previews in their widget and I have to say, until I talked with the guys about this blogpost it didn’t even cross my mind those had to be scraped (even though I’ve tried solving the same problem myself). And that’s how it’s supposed to be!

It is possible to make a content extractor worthy of a human observer, just not easy.

For those more technically inclined – all of this is explained in great detail over at Tomaž Kova?i?’s blog.

Enhanced by Zemanta

Subscribe to newsletter

  • Pingback: 12 Most Trending Zemanta Posts from 2012 - Zemanta Blog

  • http://www.techotalk.com/ Rohit kothari

    i guess scraping is best as it reduce website speed 

  • http://www.techotalk.com/ Rohit kothari

    i guess scraping is best as it reduce website speed 

  • http://www.webscio.net/ Martin B.

    Hehe, can only sympatize with the post.. Have done several years of blog crawling/scraping and it’s a real hell.. Tried using RSS excerpts, but sometimes people just post different content in their feeds that doesn’t appear on the actual site at all.. I think in the end the best approach was doing some simplistic machine learning to figure out where articles start based on several permalinks – though even then things may differ if the HTML is not constructed too cleanly, with weird images, dates, tags appearing where they shouldn’t..

    And yes, it’s SO easy for humans it’s just annoying :)

  • http://www.webscio.net/ Martin B.

    Hehe, can only sympatize with the post.. Have done several years of blog crawling/scraping and it’s a real hell.. Tried using RSS excerpts, but sometimes people just post different content in their feeds that doesn’t appear on the actual site at all.. I think in the end the best approach was doing some simplistic machine learning to figure out where articles start based on several permalinks – though even then things may differ if the HTML is not constructed too cleanly, with weird images, dates, tags appearing where they shouldn’t..

    And yes, it’s SO easy for humans it’s just annoying :)

  • http://blog.databigbang.com Sebastian Wain

    That is right but those 300 characters help you to identify where the content is. The Google Reader NoAPI is just another trick in this space.

  • http://blog.databigbang.com Sebastian Wain

    That is right but those 300 characters help you to identify where the content is. The Google Reader NoAPI is just another trick in this space.

  • Anonymous

    yes, but do you get full text for feed that has only first 300 characters of each post?

  • Anonymous

    yes, but do you get full text for feed that has only first 300 characters of each post?

  • http://swizec.com Swizec

    The problem is, what if you aren’t? Can’t just add custom css selectors for every website you want to support … and what if it gets redesigned?

  • http://swizec.com Swizec

    The problem is, what if you aren’t? Can’t just add custom css selectors for every website you want to support … and what if it gets redesigned?

  • http://swizec.com Swizec

    What’s the actual difference? I mean, what kind of information retrieval are you doing that a human would need help with beyond just a clean text or data?

  • http://swizec.com Swizec

    What’s the actual difference? I mean, what kind of information retrieval are you doing that a human would need help with beyond just a clean text or data?

  • http://swizec.com Swizec

    Ah yes, the problem of even accessing the website, let alone getting the content out.

    Forgot to touch on that one in the post, thanks for mentioning it :)

  • http://swizec.com Swizec

    Ah yes, the problem of even accessing the website, let alone getting the content out.

    Forgot to touch on that one in the post, thanks for mentioning it :)

  • http://feedity.com/ Feedity

    We have a home-brewed scraper and parser (written in C#) at Feedity - http://feedity.com and let me tell you – it’s one thing to scrape data but to derive information out of it is not as easy.

  • http://feedity.com/ Feedity

    We have a home-brewed scraper and parser (written in C#) at Feedity - http://feedity.com and let me tell you – it’s one thing to scrape data but to derive information out of it is not as easy.

  • http://twitter.com/mark_ellul Mark Ellul

    I used to use Scrapy for my scraping needs http://scrapy.org/

  • http://twitter.com/mark_ellul Mark Ellul

    I used to use Scrapy for my scraping needs http://scrapy.org/

  • Anonymous

    I scrape a *lot* of websites and have used a wide variety of tools/techniques over the years.  The first thing I check for is a sitemap of any kind, often websites will have a root /sitemap.xml even though it’s not linked from any where.  Depending on the website / your scraping objective, a sitemap.xml can save a lot of time.  I wish I’d started explicitly looking for them a long time.

    I’m a big fan of python’s lxml library, I’ve found scraping it with XPath to be faster (both to write the code as well as run the code) than BeautifulSoup.  But I won’t hesitate to use BeautifulSoup if its better for any given site or page section.

     9 times out of 10 I use an http proxy between my scraper code and the website.  The proxy does verbose logging and I often have it cache the responses when the site I’m scraping is especially slow or has a shitty (read: annoying) implementation of auth tokens / session management.

  • Anonymous

    I scrape a *lot* of websites and have used a wide variety of tools/techniques over the years.  The first thing I check for is a sitemap of any kind, often websites will have a root /sitemap.xml even though it’s not linked from any where.  Depending on the website / your scraping objective, a sitemap.xml can save a lot of time.  I wish I’d started explicitly looking for them a long time.

    I’m a big fan of python’s lxml library, I’ve found scraping it with XPath to be faster (both to write the code as well as run the code) than BeautifulSoup.  But I won’t hesitate to use BeautifulSoup if its better for any given site or page section.

     9 times out of 10 I use an http proxy between my scraper code and the website.  The proxy does verbose logging and I often have it cache the responses when the site I’m scraping is especially slow or has a shitty (read: annoying) implementation of auth tokens / session management.

  • http://blog.databigbang.com Sebastian Wain

    From my experience lxml.html is better thant BeautifulSoup.

  • http://blog.databigbang.com Sebastian Wain

    Yes.

  • http://blog.databigbang.com Sebastian Wain

    Yes.

  • Anonymous

    how? you are trying to solve the golden grail of computer science. How to get computers to interpret what you want when you cant be bothered to figure it out for yourself. That isnt meant to be derogatory in any way :D

  • Anonymous

    how? you are trying to solve the golden grail of computer science. How to get computers to interpret what you want when you cant be bothered to figure it out for yourself. That isnt meant to be derogatory in any way :D

  • http://blog.databigbang.com Sebastian Wain

    It is impossible to create a generic scraper because it is an undecidable problem, you can just use heuristics or specific markup (like HTML5 article)

  • http://blog.databigbang.com Sebastian Wain

    It is impossible to create a generic scraper because it is an undecidable problem, you can just use heuristics or specific markup (like HTML5 article)

  • http://blog.databigbang.com Sebastian Wain

    It is impossible to create a generic scraper because it is an undecidable problem, you can just use heuristics or specific markup (like HTML5 article)

  • http://blog.databigbang.com Sebastian Wain

    The formal one for the future is to use the HTML5 article element/tag: 
    http://dev.w3.org/html5/spec/Overview.html#the-article-element

  • http://blog.databigbang.com Sebastian Wain

    The formal one for the future is to use the HTML5 article element/tag: 
    http://dev.w3.org/html5/spec/Overview.html#the-article-element

  • http://blog.databigbang.com Sebastian Wain

    The formal one for the future is to use the HTML5 article element/tag: 
    http://dev.w3.org/html5/spec/Overview.html#the-article-element

  • Me

    “Scrape all the things” would have been better, IMO.

  • Me

    “Scrape all the things” would have been better, IMO.

  • Scrapey Scrape

    If you’re writing a scraper that’s customized for a particular site, and that site uses the same template throughout, such as in a blog with a consistent theme, you just need to find the right css selectors for the things you want to scrape and run that through BeautifulSoup.

  • Scrapey Scrape

    If you’re writing a scraper that’s customized for a particular site, and that site uses the same template throughout, such as in a blog with a consistent theme, you just need to find the right css selectors for the things you want to scrape and run that through BeautifulSoup.

  • Anonymous

    does that work even if the feed is not full feed?

  • Anonymous

    does that work even if the feed is not full feed?

  • Anonymous

    good luck with that one :)
    btw: use BoilerPipe as a fall-back, it’s good enough

  • Anonymous

    good luck with that one :)
    btw: use BoilerPipe as a fall-back, it’s good enough

  • Twirrim

    OOn the subject of using regex to parse html (a very bad idea as personal experience has taught me ) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    For what it’s worth, the BeautifulSoup libraries for python are brilliant and make life a lot easier.

  • Twirrim

    OOn the subject of using regex to parse html (a very bad idea as personal experience has taught me ) http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

    For what it’s worth, the BeautifulSoup libraries for python are brilliant and make life a lot easier.

  • http://twitter.com/valejo Tre Jones

     Another idea: watch out for Instapaper tags. http://www.instapaper.com/publishers

  • http://twitter.com/valejo Tre Jones

     Another idea: watch out for Instapaper tags. http://www.instapaper.com/publishers

  • http://twitter.com/valejo Tre Jones

     Another idea: watch out for Instapaper tags. http://www.instapaper.com/publishers

  • http://twitter.com/avalaxy Leon Cullens

    I ran into the same problem, and I can confirm that it’s nearly impossible to create a generic scraper that will correctly scrape the content of lots of different websites.

    Personally I’m doing all the sites one-by-one using regular expressions, DOM traversal and RSS parsers.

    Maybe one day I will find a fail-safe solution that can do this for me, but I think that this will only happen when the web agrees to use one standard format to display content.

  • http://twitter.com/avalaxy Leon Cullens

    I ran into the same problem, and I can confirm that it’s nearly impossible to create a generic scraper that will correctly scrape the content of lots of different websites.

    Personally I’m doing all the sites one-by-one using regular expressions, DOM traversal and RSS parsers.

    Maybe one day I will find a fail-safe solution that can do this for me, but I think that this will only happen when the web agrees to use one standard format to display content.

  • http://blog.databigbang.com Sebastian Wain

    I have another alternative to retrieve historic RSSs beyond the actual one. Using Google Reader “NoAPI”. You can take a look at my article on that: 
    http://blog.databigbang.com/extraction-of-main-text-content/

  • http://blog.databigbang.com Sebastian Wain

    I have another alternative to retrieve historic RSSs beyond the actual one. Using Google Reader “NoAPI”. You can take a look at my article on that: 
    http://blog.databigbang.com/extraction-of-main-text-content/