Fruit Blog

NoSQL meetup, report

Posted by andraz, under Uncategorized on November 4th, 2009

It’s ApacheCon week (happening in Oakland)! Meetups galore! Yesterday, NoSQL meetup was going on, so here’s the report.

The “NoSQL” name. It’s bad. It’s a problem. It’s negative. People that invented the name (Eric Evans) are trying to change it. Current contenders: AlongSideSQL, NotOnlySQL. The name problem stems from the fact that for a long time SQL had a complete grip on the database world. What’s going on is a Cambrian explosion of new paradigms of data storage related to cloud computing and new requirements. However these projects really don’t have much in common except for the fact that they are not SQL.

In my humble opinion Steve Yen had the most insightful presentation: NoSQL is a horseless carriage (a.k.a the car). I found one slide especially important. He proposed a taxonomy of NoSQL, which is:

  • key‐value‐cache – memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
  • key‐value‐store – keyspace, flare, schema‐free, RAMCloud
  • eventually‐consistent key‐value‐store – dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
  • ordered‐key‐value‐store – tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
  • data‐structures server – redis
  • tuple‐store – gigaspaces, coord, apache river
  • object database – ZopeDB, db4o, Shoal
  • document store – CouchDB, Mongo, Jackrabbit, XMLDatabases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
  • wide columnar store – BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
Apache CouchDB
Image via Wikipedia

And this is not an extensive list of types and implementations! It’s madness. Madness, I tell you! There were two important additional theses:

  • There will be just a few winners
  • And functionality wise they will all converge to SQLdx (dx being a derivate).

Relax we shall not.

People generally consider these solutions are throwing away ACID and transactions and introducing:

  • scalability by adding nodes without fuzz
  • reliability
  • availability
  • high throughputs or low latency

Next up was Ryan Rawson introducing HBase 0.20 which is basically an open source implementation of Google’s BigTable. Used by many companies, but most famously by StumbleUpon. Something to freak out old-school SQLers: millions of rows in a record. Fun. Some of these projects come from use-cases that are being rooted in new views on data, business and processes. Basically you recognize that the data and the real world are inconsistent anyway (”Your database is already eventually consistent with reality”) , so why not partly relax some of the constraints in order to be able to scale in one way or another. Good quote: “When in doubt, take customer’s order!”.

Brian Cooper from Yahoo presented their PNUTS (or internally called Sherpa) data store. It is interesting what kind of challenges Yahoo has. Moving data around the data centers around the world to be closer to the user and recognizing that there’s nothing wrong if away status change for a user looking from west coast is different than from east coast as long as you keep the time short (to a few seconds). Many data centers, many racks, many machines. Interesting.

Project Voldermort was presented by Alex Feinberg (who has a cool Twitter name @strlen). It’s LinkedIn’s implementation of the famous Amazon’s Dynamo paper. It’s open source and already used for a lot of LinkedIn’s internal back-end processing. [Update: Alex mailed me more correct explanation that LinkedIn employs Voldermort for user-facing data-driven features, such as "People You May Know" and the "Recommendation Engine"].

The last presentation was from Eric Evans describing Cassandra. The first promise was: “No dataloss anymore, honestly!”. Cassandra’s history is interesting. It’s a project that started inside Facebook, and was then released as tarball code dump. Later it was adopted by the community and brought into Apache family. Nobody knows what Facebook is doing internally currently. Oh, and the story about Cassandra’s name is also interesting: it’s a play on Oracle. In Greek mythology Cassandra was predicting the future and was always right, but no one believed her.

All in all what’s coming regarding NoSQL in long term is thesis (SQL), antithesis (NoSQL), synthesis (NoSQL offering SQL features and vice versa).

Before I end this report, two more points:

  • ZooKeeper is a corner stone of many projects. I am sorry not to know about it in beginning of 2008 when we implemented something very similar internally for Zemanta. Have to look into possibility of adopting it.
  • The standard scenario that people use for describing their product’s reliability: “Imagine big San Francisco earthquake…”, and then they explain how their solution in question keeps running. Let me repeat: Imagine big San Francisco earthquake shutting down all the datacenters in SF and Silicon Valley. Imagine.
Reblog this post [with Zemanta]
  • gtani
    The taxonomy is valuable. I think the other important dimension you alluded to with "eventual consistency" is CAP. For example, of the erlang db's, Mnesia gives up P (and with enough dirty writes, potentially C) ; CouchDB gives up C, Scalaris gives up A (as a set of very loose generalizations). Transactions in mnesia can be expensive, I'm not sure how scalaris mitigates that. And riak, from my limited blog reading, seems to give you more control with the N, R, W settings.
  • andraz
    Hi tani!

    Yeah, there needs to be a whole matrix of properties and these solutions laid out on them. This is all about balancing the requirements in one way or another and you can't have it all. It's gonna be a gold mine for writers of white-papers and analysts.

    However the question for architects right now is which solutions are here to stay and we can bet our infrastructures on. This is even more important than perfect fit to business needs, because when you chose something that goes away and is not maintained anymore you really need to spend a lot of resources to switch.

    To anyone wanting to know more about CAP theorem, I'd suggest reading http://www.julianbrowne.com/article/viewer/brew...
  • gtani
    People have tried

    http://spreadsheets.google.com/ccc?key=0Ale_YaC...

    The threshold questions before adopting a db framework of whatever variety are: 1) could you maintain it yourself if you had to? Some of the listings are 13K SLOC and under. It would be nice if they had numbers committers, coverage ratio, numbers of bugs submitted and closed out, stuff like that. And i don't think answering "Expansion?" and "Partitoning?" as yes/not questions is all that meaningful.

    2) Where are early adopters/people submitting patches? And how many people put it into production. The spreadsheet is a decent first attempt at collecting all that.
blog comments powered by Disqus