Bad Wolf, in progress
The following products were reviewed to assess their capabilities:
Used by Facebook, Twitter, Digg
We also looked at these products, but ruled them out for various qualitative reasons:
Used by LinkedIn
Project Voldemort and Hbase did not have sufficient python drivers and were ruled out for analysis. Redis was ruled out because sharding was implemented outside of the project in a non-standardized manner.
We tested each technology using the login data for ~ 1.5 million users. The first test inserted all the rows into the database, and the second test queried 100,000 random usernames. These were timed to provide a way of comparing relative performance.
Disk Usage: ~3.1G
Disk Usage: ~6.5G
Disk Usage: ~1.0G
Cassandra provided acceptable performance. Cassandra had the best cluster management. Schema design was difficult and required much more foresight. Single node performance was not as fast as Mongo’s, but presumably would do better with scale.
CouchDB had the most attractive development environment but was unacceptably slow and heavy.
MongoDB had the fastest single node performance. MongoDB’s toolset was well rounded and easy to understand. MongoDB was also the most flexible system — much easier to write ad hoc queries. The clustering support exists but is not as advanced as Cassandra.
Appendix: Project descriptions and links
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
MongoDB (from “humongous”) is a scalable, high-performance, open source, dynamic-schema, document-oriented database.