Web scraping is one of those "Bah, I can do that in a weekend!" problems. What could possibly be so difficult about going to a website and deciding what the actual content is?
Then you try.
It doesn't work.
After a bit of digging you realize this is actually a pretty tough problem - but surely there's a library for this, right? I mean, with so many products needing this, surely someone's already solved this problem?
Most of them extract the meat of the article perfectly. Very few actually miss anything. But it feels wrong. It just doesn't look exactly like what you consider content as a human.
Maybe there's the "<tags> by <author> at <time>" line left on the top, maybe it's on the bottom, perhaps there's a bit of cruft strewn in here and there ... Most scrapers deal with this to varying degrees of success, sometimes they're better on blogposts, sometimes on news sites, sometimes something else completely.
Sometimes the website is so weird a scraper will just give up and decide the first two paragraphs aren't content. Granted, a lot of blog writers could do with just cutting away the first few paragraphs, but that's beside the point.
Your content indexer, article previewer or full RSS builder are going no where fast. And you're way past that weekend project phase too!
Easy for humans
This task is so easy for humans we notice every little detail. While extracting most of the content with a bit of cruft works perfectly well for indexing and data mining - showing that to a user will only end in tears.
It's a bit like drawing hands or faces - unless you get it within 5% of perfection it just looks wrong. You're almost better off drawing it at 80% within perfection and calling it a cartoon.
The uncanny valley of article extraction!
The closer you are to perfection, the less subconscious clues users will get to pick out the content themselves and the more jarring the difference between what they expect and what they get.
Instead of relying just on what scraping algorithms say, you should help them out with as much knowledge of the website you can get.
1. If there is a full RSS feed, why are you even scraping? The content in there is usually clean. (the story of translating an url to an RSS url will come another day)
2. Without a full RSS feed you can still learn a lot about the start of the article from looking at the excerpt published in an RSS. Clean up the html, take the published excerpt, then go on a search through the DOM to look for an approximate match - voila, the beginning of your article!
3. Sometimes you can achieve a lot by relying on good old regular expressions and hand-tuned heuristics. A lot of those erroneous first and last lines look very similar. Just write a regex to detect a few variations of those and clean them out.
4. Another reasonable approach is guessing which articles come from the same website (hint: not just the same domain). These have almost the same cruft around every article. You can run a clustering algorithm on these and figure out what are the bits your scrapers are usually leaving in or missing - then just fix for those.
Zemanta uses a combination of these to create article previews in their widget and I have to say, until I talked with the guys about this blogpost it didn't even cross my mind those had to be scraped (even though I've tried solving the same problem myself). And that's how it's supposed to be!
It is possible to make a content extractor worthy of a human observer, just not easy.
For those more technically inclined - all of this is explained in great detail over at Tomaž Kova?i?'s blog.
- Scraping: Not Just for RSS Feeds Anymore (plagiarismtoday.com)
- By hand or on the computer? The threat of the Uncanny Valley (sessions.edu)
- The Uncanny Valley of Big Data (revolutionanalytics.com)
- Web scraping with Python - the dark side of data (r-bloggers.com)
- Beyond the uncanny valley (kottke.org)
- So you think you can scrape? (hyperorg.com)
- Beat the Content Scrapers with Fat Pings (labnol.org)