July 2010
Mon Tue Wed Thu Fri Sat Sun
« Jun   Sep »
 1234
567891011
12131415161718
19202122232425
262728293031  

Month July 2010

Bad Wolf, in progress

Bad Wolf, in progress

Farm Wedding

Greenroom: a PHP / Mongo Framework

Link: Greenroom: a PHP / Mongo Framework

Creating a new PHP / Mongo Framework with the following guidelines:

1. Existing frameworks depend on SQL for much of their CRUD functionality.  A fresh start will allow for the highest quality Mongo framework.

2. Code management is not responsibility of the framework.

3. A high quality default CRUD interface is a must.  The most important API is the Field and Model API.

During the Alpha stage, exploratory work implementing basic types will be performed.  Once the types crystalize, final behavior will be documented, tested, and implemented.

NoSQL Cloud Database Evaluation

The following products were reviewed to assess their capabilities:

Cassandra
Used by Facebook, Twitter, Digg
http://cassandra.apache.org/

CouchDB
http://couchdb.apache.org/

MongoDB
http://www.mongodb.org/display/DOCS/Home

We also looked at these products, but ruled them out for various qualitative reasons:

Project Voldemort
Used by LinkedIn
http://project-voldemort.com/

Redis
http://code.google.com/p/redis/

Hbase
http://hadoop.apache.org/hbase/

Project Voldemort and Hbase did not have sufficient python drivers and were ruled out for analysis.  Redis was ruled out because sharding was implemented outside of the project in a non-standardized manner.

Methodology

We tested each technology using the login data for ~ 1.5 million users. The first test inserted all the rows into the database, and the second test queried 100,000 random usernames.  These were timed to provide a way of comparing relative performance.

Cassandra Results
Load: 52m9.835s
Query: 1m42.434s
Disk Usage: ~3.1G

CouchDB Results
Load: 198m3.774s
Query: 15m54.026s
Disk Usage: ~6.5G

MongoDB Results
Load: 15m38.976s
Query: 1m1.990s
Disk Usage: ~1.0G

Cassandra provided acceptable performance. Cassandra had the best cluster management. Schema design was difficult and required much more foresight. Single node performance was not as fast as Mongo’s, but presumably would do better with scale.

CouchDB had the most attractive development environment but was unacceptably slow and heavy.

MongoDB had the fastest single node performance. MongoDB’s toolset was well rounded and easy to understand. MongoDB was also the most flexible system — much easier to write ad hoc queries. The clustering support exists but is not as advanced as Cassandra.

Appendix: Project descriptions and links

Cassandra
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
http://cassandra.apache.org/
http://incubator.apache.org/thrift/about/
http://github.com/vomjom/pycassa/
http://github.com/digg/lazyboy
http://stackoverflow.com/questions/1502735/whats-the-best-practice-in-designing-a-cassandra-data-model
http://jetfar.com/installing-cassandra-and-thrift-on-snow-leopard-a-quick-start-guide/
http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02426.html

CouchDB
Apache CouchDB is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.
http://couchdb.apache.org/
http://code.google.com/p/couchdb-python/
http://davidwatson.org/2008/02/python-couchdb-rocks.html

MongoDB
MongoDB (from “humongous”) is a scalable, high-performance, open source, dynamic-schema, document-oriented database.
http://www.mongodb.org/