April 26, 2012
by Swizec Teller
21 Comments

RSS will never die

| 21 Comments

This was supposed to be a post about translating a URL into a [good] RSS feed. After reading The War on RSS and some of the passionate debate it kicked off on HackerNews I decided to write something else.

In short: RSS will never die.

The War on RSS part un

Propaganda

In May 2009 Steve Gillmor wrote on Techcrunch

It’s time to get completely off RSS and switch to Twitter. RSS just doesn’t cut it anymore. The River of News has become the East River of news, which means it’s not worth swimming in if you get my drift.

~ Rest in Peace RSS, Steve Gillmor on Techcrunch, May 2009

It sparked a meme. Suddenly everyone and their dog was convinced RSS was dead and we should all move on. Twitter will save us from something as horrible as a fourteen year old idea. That’s much too old for us web people.

In early 2011 RSS still wasn’t quite dead. “If RSS is dead, what’s next?“, a guy asked on Quora. This time, a very diplomatic answer came from the Robert Scoble (when I met him he said my startup idea is a fail just because it revolved around RSS):

First off, let’s define what dead means.

To me, anytime someone says a tech is dead it usually means that tech is not very interesting to discuss anymore, or isn’t seeing the most innovative companies doing new things with it

Essentially Scoble thinks RSS is dead because Google Reader stopped working out for him and nobody is innovating in the RSS space anymore.

Bummer.

Five months later he wrote about Feedly – an RSS reader for the iPad. Saying “don’t miss out and get Feedly on your iPad”. He called the idea of an RSS reader for the iPad stupid just 7 months prior.

Guess RSS isn’t that bad after all :)

The War on RSS part deux

Hogarth_Idle_Prentice_executed_at_Tyburn

This week – April 2012 – RSS still wasn’t quite dead. The War on RSS got a lot of passionate attention on HackerNews.

There’s a veritable explosion of companies removing RSS from their products … for whatever reason. Usually because it doesn’t directly benefit the bottom line – they prefer proprietary formats.

The next Mac OS – Mountain Lion – will likely ship without native RSS support. Gone from Safari (in favor of their proprietary Reader/Read Later thingy). Gone from Mail.

Somewhere in the last few versions Firefox removed the RSS icon from its usual place in the url bar.

Twitter removed public support for RSS feeds of user accounts. The feeds still exist – discovering them just takes a bit of trickery since they aren’t even mentioned in the HTML anymore.

Once upon a time even Facebook had support for profile RSS feeds. These have long been gone, so long in fact I don’t remember ever having seen them.

And there has never been native RSS support in Chrome. So much for that.

This time RSS is well and trully busted right? Took an arrow to the knee never to be heard from again.

RSS Will Never Die

Evolution of the Cylon

For a piece of tech that was declared dead and boring almost three years ago, RSS can stir up a suprisingly strong debate … mostly passionate users clinging on for dear life.

I asked Twitter whether anyone still uses RSS as a human. The replies started flying in as quickly as I pressed the submit button. 11 yes, 1 no-ish, 1 sort of no and 1 resounding no.

The data is skewed, yes. Only people passionate about enough to care replied and I am well aware that Normal Humans ™ don’t knowingly use RSS. That’s also quite a bit of responses for a random question posted to Twitter by some random guy.

It shows RSS will never die because of a simple reality: power users.

There is something called the 90-9-1 rule of online participation. At its core is the idea that 90% of  content comes from the top 1% of contributors.

Saying those top contributors are your power users is a pretty safe bet. And that’s why RSS is here to stay for at least a while longer – all those people doing most of the sharing? A lot of their stuff comes from RSS.

Why do people still use RSS anyway?

Old Desk

Ok, so the top 1% of that top 1% may have moved away from RSS and onto social media. Or at least that’s what everyone was claiming back in 2009 when Twitter was still something fresh, new and exciting. And most of all, much, much slower.

Twitter is not a replacement for RSS. Not by a long shot. It’s too busy!

My Twitter stream gets about 30 new messages every minute or two. This isn’t an environment to follow important-ish updates. Certainly not a place to look for 500+ word chunks of text that take ten minutes to read.

And god forbid anyone writes their blog only once a week, I’d miss 99% of their updates!

That’s where RSS comes in.

Not only does it take an hour for ten new posts to reach my Google Reader – when something does vanish, there is a sidebar full of subscriptions where I can see that, hey, there’s a bunch of stuff I want to read … eventually. No pressure. It’s all going to be here tomorrow, a week from now … even a month.

By the way, anything older than a week or two stops existing on Twitter.

When I want to read The Art of Manliness, I can just waltz over to Google Reader and check out the last few posts . No rush. The content is long, but it’s informative and it waits for me. There’s also no interruption or conversation. Just the curated best of what they have to say.

None of that on their Twitter though. Even though they only post every couple of hours, most of it is still reposts of old stuff and answering questions. I think there’s actually less than one new Actual Post ™ per day.

It gets worse for people, like me, who use Twitter as persons. Most of it is just random chitchat you don’t care about, sharing cool links from the web and generally everything but a RSS replacement for my personal blog.

Consequently, RSS offers bigger exposure to your content.

Looking at a recent personal post … tweeting three times creates 67 clickthroughs. Posting to RSS reached 145 readers, however Feedburner might be calculating that.

That’s a big difference!

RSS may have flopped for the regular user. It’s complex and kind of weird; but for that most important of readers – a fan - it will never really die.

And that’s before we even consider computers needing a simple and open way to follow websites’ updates.

Enhanced by Zemanta

April 18, 2012
by mateja
0 comments

EDF2012 Hackathon – is your business ready for Linked Open Data?

| 0 comments

Planning a trip to Denmark in June? No? Think again.

Copenhagen Business School (CBS) will host an European Data Forum 2012 (EDF2012) on June 6-7,2012. This two day conference is a meeting place for small and medium size enterprises (SMEs), researchers, policy makers and community initiatives to discuss challenges of Big Data, novel data-driven business models, technical innovations and other important aspects (does open government data rings any bells?).   The good thing is you only have to pay for your travel, because participation is free of charge. One of the main organizers is also LOD2 consortium (full disclosure: we are in this consortium).

One of the really cool side events of the conference is EDF2012 Hackathon, which will focus on “Integrating Linked Open Data into Business Content Management“. Organizers are counting on senior developers and software architects who are looking for ways to use web-based data resources (think of Linked Open Data) and incorporate them into their business solutions. Factual is a nice example of a service using Open Data (just one of many).

open government data - simple venn diagram

Photo credit: justgrimes

Well, EDF2012 is not the only event related to Open Data this year. Last month I participated at Open Government Data Business Day 2012 in Vienna, this month Seattle, King County and Washington state are organizing Startup Weekend GOV on April 27 – 29, 2012 and Open Government Data (OGD) Camp will take place in Helsinki this autumn. I didn’t make all this up: visit Open Government Data hub, a lot of interesting events are on their calendar.

Anyway, back to EDF2012… You can find more information on their website, register (free of charge) for the event, follow them on twitter, tweet (use the hashtag #EDF2012) and/or spread the PR word. :)

Enhanced by Zemanta

March 24, 2012
by mateja
0 comments

A day in the life of LOD2 people…

| 0 comments

… and life of the LOD2 in a day.

A plenary meeting for LOD2 project was taking place in Vienna this week (March 21-23). When Martin suggested we volunteer for the blog parade, I was the first to raise the hand, not because we were offered LOD2 mugs as reimbursement – I found this out after raising my hand — but to use this as a chance to show people what the “LOD2 project meeting” is all about, and after all… we at Zemanta are pro-blogging! :)

A lot has been said about the LOD2 project and the first day of the meeting, therefore I’ll skip the introduction and rather focus on integration, the main topic of the afternoon discussion on our 2nd day of LOD2 plenary.

Note: from this point on I’m assuming you’re familiar with the notion of Linked Open Data (LOD). 

Dealing with open data requires lots of different kinds of (software) tools; first, data needs to be obtained one way or another, then cleaned, transformed to appropriate formats, so it can be enriched, linked… you name it.  Follow me through this little exercise: imagine you are one of the poor souls… or let’s say open data pioneers, e.g. administrative workers, assigned to the task of making your agency’s data open, free, and re-usable. You may have LOD-aware friends who tell you about SILK or PoolParty, or they mention OntoWiki, LIMES, Sindice, maybe even LODGrefine(to name but the few).

Scream Cropped

Scream (Photo credit: Wikipedia)

At first you’re puzzled, but then you take the courage to google up the names. Wow! So many choices and functionalities, which ones to use? How to install this? What the heck is RDF schema?

You start dreaming how wonderful it would be to have all the tools in one place and someone to guide you through the whole process, letting you know what you’re supposed to do next…

Fortunately, your dream is a LOD2 reality (in making). The tools mentioned above are integrated into LOD2 Technology Stack and even more tools will be integrated into it. The best thing is the tools in LOD2 stack are freely available under Open source license. Sometimes semantic web dreams DO come true. :)

 

However, having all the tools in one place is only the beginning. What about the flow of information between components? We are currently working on making the workflow as smooth and user-friendly as possible. This includes integrated and unified user interface to provide shared user experience by exposing the functionalities of the tools that are important for the semantic web user.

It seems (business) people are more likely to trust data if they have some provenance information: who did what to which pieces of data from which origin. Sure, this requires authentication, which was also on our agenda for months now. Having a simple username and password in a local database doesn’t feel right for the world of Linked Open Data; authentication has to be webish. Besides, who wants to log-in into each and every one of the tools separately? Google, Amazon, Facebook, Twitter… they all made us sign-in-lazy by enabling the access to all sorts of services by 1-2 clicks. Fairly easy, right? Not really. Tools in the stack support different kinds of authentication (if any), some of them define user roles to grant access to certain functionalities. Should we use OAuth or OpenID? What about WebID? We had several (online) integration meetings – before this one in Vienna – to discuss pros and cons. Finally we decided to go with WebID and see what happens. If not sure what to do: 1. research, 2. experiment, 3. evaluate. Rinse and repeat.

And now something completely different. Another topic on our agenda for this day was also the development of application scenarios and testing of LOD2 Stack configurator. In my opinion one of the key elements in demonstrating the value of any tool are real life use case scenarios. LOD2 Stack is no exception. The Mihajlo Pupin Institute presented scenarios they’re working on, e.g. introducing LOD into their national statistical office (making it more transparent and accessible with LOD), airport emergency training and leveraging HRM corporate intelligence. We also heard a few words about LOD2 Management suite… and then it was already time to go to the Open Government Data Business Day 2012 – a call for action for both government agencies and SMEs to start contributing to LOD.
Open Government Data (OGD) BusinessDay event deserves a blog post of its own, but I’ll try to summarize it in a few points:

  • Power to the user: Vienna is empowering end-users by giving them means to check if they are really getting what they paid for (e.g. speed, latency of network connections).
  • Let them play: Open up data, let people look, take and play. We can all benefit from this.
  • Follow best practices: UK did it (data.gov.uk), so did Netherlands, Austria is doing it, what are you/we waiting for?
  • Gravity kick: If you open a little bit of data, it is hard to stop opening it.
  • Making data free pays: Change/update outdated business models, find new ways of getting the revenue. Don’t charge for data, make money from good services based on open data instead!
  • Killer app myth: Don’t wait for killer app to show you the $$ value of open data, it might be already there, but you just don’t see it. Remember email and SMS.
  • Barrier nutcracker: There will always be barriers on different levels preventing opening up the data. Brake them by building the community, open up discussion, make a dialog. And yes, DO send an email to government agencies if you can get the data you’re interested in. Guess what, you just might get it!
  • Dinosaurs beware: If you don’t want to die, keep up the pace. Walters Kluwer Deutschland – one of the players in the publishing industry knows this. They’re making their way towards becoming publishers of Open Data. Hell yeah! :)
  • Game of less & more: We need less data fragmentation and more data roaming. Elvis would say: A little less conversation, a little more action please.

It was a very busy day, indeed. You can find more about the plenary at the LOD2 blog.

Enhanced by Zemanta

March 19, 2012
by Swizec Teller
5 Comments

What’s the best thumbnail for this page?

| 5 Comments

A feature exists that we all use tens of times a day. A feature almost every social startup implements eventually. A feature so neglected not a single library is devoted to it.

goose

goose (Photo credit: nic_r)

Thumbnail extraction.

Whenever you share something on Facebook or G+, a thumbnail is extracted from the page. Sometimes good, sometimes bad, sometimes laughably attrocious.

There’s no denying it – pictures make things sexier. People engage with sexy things. If your <insert site> can drive engagement and clickthroughs … congratz! You have a successful product on your hands.

So why has nobody solved the problem of: Here is a link, which image best represents it?

Perhaps it’s just one of those problems that seem almost too trivial to solve. Sure, just throw a hacker at it for two nights and you’ll come up with something that works. It won’t be perfect, but as it turns out, users aren’t very demanding. The simple fix is just to give them a choice of three images and they’ll solve the problem for you.

This attitude ganks Don’t-Repeat-Yourself principle in a dark alleyway and beats it with a stick. Repeatedly.

Lucky for you, two solutions roam around Github – Goose and Reddit.

Goose

Goose is a web scraping library written originally in Java and being ported to Scala as we speak. Part of its offering is an image extractor.

Geese Over Odiham

Geese Over Odiham (Photo credit: Vampire Bear)

The approach is pretty simple:

  1. Does anything match a regex of known elements – hand tuned for a few sites
  2. If not, are there any large images? – download all images, check their dimensions, make sure to filter out any known duds (banner-y images, gifs, etc.)
  3. If not, check some meta tags

Since all three are implemented exclusively, the second approach is likeliest to find an image. This is actually a somewhat slow approach, but it works well enough. Just implementing this and you’re probably golden …

… if you aren’t processing tens of thousands of pages a day.

Reddit

Reddit’s approach is even simpler – a 1688 sloc python script.

Rally to Restore Sanity and/or Fear.

Image via Wikipedia

  1. Traverse DOM looking for images
  2. Download image until you can read its metadata
  3. Abort download
  4. If area is smaller than 5000, ignore
  5. If image is wide or narrow, ignore
  6. If name contains “sprite”, penalize

Then all found images are ordered by area and the first one becomes the lucky winner of the Being a Thumbnail award. This still basically downloads all images, but with the caveat that it actually doesn’t – just enough to determine size, whereas Goose actually writes all images to disk.

But the most amazing part of Reddit’s algorithm is what happens next.

The best candidate is cropped … according to colour entropy. This basically ensures the thumbnail isn’t just sky or whatever, but actually something useful to look at.

A bit of simple maths on the image’s histogram and voila, perfect thumbnails.

Zemanta

Zemanta’s approach seems light years more sophisticated – according to what @tomaz describes, I didn’t get to poke around the code. Rather than just slapping together a bunch of heuristics, they take a more disciplined approach:

  1. get image candidates from DOM – just like reddit and, to an extent, Goose
  2. get media:thumbnail from the RSS
  3. look at opengraph metadata (og:image) and other embed hints
  4. very dataset specific, but looking for WordPress‘s featured image can be of great help
  5. images embedded with zemanta have special css classes – very helpful

With the candidates collected, all it takes now is performing a priority sort according to extracted meta data, shove all the images in a rabbitMQ and wait for the workers to process each useful image.

So what IS the best thumbnail for this page?

To be honest, I don’t know. Unlike article extraction, it doesn’t seem anyone anywhere has ever put a lot of thought into getting thumbnails out of a website. Let alone performing a precision/recall study of different approaches.

Guess it’s just not something users care much about and almost any old image will do as long as it’s found on that page and not on too many other pages.

But it’s an interesting little challenge, if someone makes it into a service I bet a lot of people will be very happy – one less problem to think about.

Enhanced by Zemanta