It’s ApacheCon week (happening in Oakland)! Meetups galore! Yesterday, NoSQL meetup was going on, so here’s the report.
The “NoSQL” name. It’s bad. It’s a problem. It’s negative. People that invented the name (Eric Evans) are trying to change it. Current contenders: AlongSideSQL, NotOnlySQL. The name problem stems from the fact that for a long time SQL had a complete grip on the database world. What’s going on is a Cambrian explosion of new paradigms of data storage related to cloud computing and new requirements. However these projects really don’t have much in common except for the fact that they are not SQL.
In my humble opinion Steve Yen had the most insightful presentation: NoSQL is a horseless carriage (a.k.a the car). I found one slide especially important. He proposed a taxonomy of NoSQL, which is:
- key?value?cache – memcached, repcached, coherence, infinispan, eXtreme scale, jboss cache, velocity, terracoqa
- key?value?store – keyspace, flare, schema?free, RAMCloud
- eventually?consistent key?value?store – dynamo, voldemort, Dynomite, SubRecord, Mo8onDb, Dovetaildb
- ordered?key?value?store – tokyo tyrant, lightcloud, NMDB, luxio, memcachedb, actord
- data?structures server – redis
- tuple?store – gigaspaces, coord, apache river
- object database – ZopeDB, db4o, Shoal
- document store – CouchDB, Mongo, Jackrabbit, XMLDatabases, ThruDB, CloudKit, Perservere, Riak Basho, Scalaris
- wide columnar store – BigTable, Hbase, Cassandra, Hypertable, KAI, OpenNeptune, Qbase, KDI
And this is not an extensive list of types and implementations! It’s madness. Madness, I tell you! There were two important additional theses:
- There will be just a few winners
- And functionality wise they will all converge to SQLdx (dx being a derivate).
Relax we shall not.
People generally consider these solutions are throwing away ACID and transactions and introducing:
- scalability by adding nodes without fuzz
- high throughputs or low latency
Next up was Ryan Rawson introducing HBase 0.20 which is basically an open source implementation of Google’s BigTable. Used by many companies, but most famously by StumbleUpon. Something to freak out old-school SQLers: millions of rows in a record. Fun. Some of these projects come from use-cases that are being rooted in new views on data, business and processes. Basically you recognize that the data and the real world are inconsistent anyway (“Your database is already eventually consistent with reality”) , so why not partly relax some of the constraints in order to be able to scale in one way or another. Good quote: “When in doubt, take customer’s order!”.
Brian Cooper from Yahoo presented their PNUTS (or internally called Sherpa) data store. It is interesting what kind of challenges Yahoo has. Moving data around the data centers around the world to be closer to the user and recognizing that there’s nothing wrong if away status change for a user looking from west coast is different than from east coast as long as you keep the time short (to a few seconds). Many data centers, many racks, many machines. Interesting.
Project Voldermort was presented by Alex Feinberg (who has a cool Twitter name @strlen). It’s LinkedIn’s implementation of the famous Amazon’s Dynamo paper. It’s open source and already used for a lot of LinkedIn’s internal back-end processing. [Update: Alex mailed me more correct explanation that LinkedIn employs Voldermort for user-facing data-driven features, such as "People You May Know" and the "Recommendation Engine"].
The last presentation was from Eric Evans describing Cassandra. The first promise was: “No dataloss anymore, honestly!”. Cassandra’s history is interesting. It’s a project that started inside Facebook, and was then released as tarball code dump. Later it was adopted by the community and brought into Apache family. Nobody knows what Facebook is doing internally currently. Oh, and the story about Cassandra’s name is also interesting: it’s a play on Oracle. In Greek mythology Cassandra was predicting the future and was always right, but no one believed her.
All in all what’s coming regarding NoSQL in long term is thesis (SQL), antithesis (NoSQL), synthesis (NoSQL offering SQL features and vice versa).
Before I end this report, two more points:
- ZooKeeper is a corner stone of many projects. I am sorry not to know about it in beginning of 2008 when we implemented something very similar internally for Zemanta. Have to look into possibility of adopting it.
- The standard scenario that people use for describing their product’s reliability: “Imagine big San Francisco earthquake…”, and then they explain how their solution in question keeps running. Let me repeat: Imagine big San Francisco earthquake shutting down all the datacenters in SF and Silicon Valley. Imagine.
Related articles by Zemanta
- The Future Is Big Data in the Cloud (gigaom.com)
- Getting Closer to Real Time With Hadoop (gigaom.com)
- Flybridge Leads 10gen B Round (xconomy.com)
- Mass-scale computing: Why Hadoop is hot but Java is not (venturebeat.com)