Categories

Category Tech

Basic Development Infrastructure is a Silver Bullet

Below is a typical complaint from a new developer, as well as the typical response:

I’ve inherited 200K lines of spaghetti code — what now?

The typical problem is that a developer has inherited a complicated, undocumented application. The typical response is the rigid imposition best practices.

Unfortunately, to be viable process improvement must deliver fruit within an organization’s funding cycle. Thus, any investment in best practices is tightly coupled to organization size.

The first improvement should always be to document the basics:

  1. What is the “code” ?
  2. How are changes made to the “code”?
  3. What is the “data”?
  4. Where / how is the “data” stored?

Next establish the following environments:

  1. production: where users do work
  2. staging: where developers and users work together
  3. development: where developers work alone

The key is identifying a reliable way for “code” to move from development to staging to production, and for “data” to move from production to staging to development.

There are many forms of distribution. Could be copying a file, updating working copies, or uploading rpm’s to mirrored package repositories. It doesn’t matter so much what way is identified, so long as it is reliable for the organization.

Similarly, there are many ways of backing up and restoring data. There are likely many vendor specific tools. Again, reliable is the key.

Establishing the development architecture allow any developer to quickly gain traction and implement other improvements.

MongoDB Content Taxonomy Schema

The greatest thing about MongoDB are the replica sets. It is a nice feeling to have failover and distributed queries across 3 database machines. It’s also nice to replace all 3 in one evening with constant application availability.

But the next great thing doing CMS work using MongoDB is that all records are documents. That is to say each record is an open ended collection of key/term pairs. It is nice to tack on additional fields as the content evolves to more structured forms.

Where this agility breaks down is with evolving query requirements. Scalable MongoDB performance requires indexes, but there is a hard limit on the number of available indexes.

In particular, context taxonomy can easily lead to an explosion of fields with a subsequent explosion of query techniques an indexes. For example, this post might have the following fields:

  • Category: Tech
  • Tags: MongoDB, Taxonomy

This is a collection of key / value pairs. Both keys and values are short with a few words as the maximum length. A given key can have one or more values. The values for a given key should be a unique set with no duplicates, though two keys may have identical values.

Finally, on a request I will start with slugs rather than actual values:

http://hexane.org/blog/category/tech

So not only does the database need to support queries along the structure outlined above, but also needs to lookup by slug. Thankfully there are a limited number of keys (<20) making hard coding the mappings an option. Values can be anything, so the schema needs to accommodate.

Initially this data was stored in discrete fields on the root document. Each field had its own index. Each unique combination of query fields also necessitated a unique index. Furthermore, there was a centralized and growing collection of slug2term mappings. Things got out of hand quickly.

The Solution

The solution was to leverage MongoDB’s dot notation queries. You can index on the basis of a dot notation query.

Inserting the Data

Insert the data as an array of key / term / slug documents under a single field name:

db.article.insert({
  title:'wealth news article',
  taxonomy:[
    {key:'section', term:'News', slug:'news'},
    {key:'topics', term:'Wealth', slug:'wealth'}
  ]
})
db.article.insert({
  title:'retirement news article',
  taxonomy:[
    {key:'section', term:'News', slug:'news'},
    {key:'topics', term:'Retirement', slug:'retirement'}
  ]
})
db.article.insert({
  title:'wealth blog article',
  taxonomy:[
    {key:'section', term:'Blogs', slug:'blogs'},
    {key:'topics', term:'Wealth', slug:'wealth'}
  ]
})
db.article.insert({
  title:'retirement blog article',
  taxonomy:[
    {key:'section', term:'Blogs', slug:'blogs'},
    {key:'topics', term:'Retirement', slug:'retirement'}
  ]
})
db.article.ensureIndex({'taxonomy.key':1,'taxonomy.term':1})
db.article.ensureIndex({'taxonomy.key':1,'taxonomy.slug':1})

Note a couple things here:

  • The data is not normalized. This is quite intentional. Replica sets are fast, but in general you want to minimize the number of centralized tables when designing a distributed database.
  • The generated slug is stored alongside the term. Losing some disk, but otherwise unconcerned.
  • There is one set of indexes for all possible taxonomy fields!

Querying the data

So now here are the queries. To properly match a particular key / term pair I use the $elemMatch operator. To execute multiple queries I use the $all operator.

db.article.find({
  taxonomy:{
    $elemMatch:{'key':'section', 'term':'News'}
  }
}).explain()
db.article.find({
  taxonomy:{
    $elemMatch:{'key':'section', 'slug':'news'}
  }
}).explain()
db.article.find({
  taxonomy:{$all:[
    {$elemMatch:{'key':'section', 'slug':'news'}},
    {$elemMatch:{'key':'topics', 'slug':'retirement'}}
  ]}
}).explain()

At the end of the day, any database is going to require some queries. These are not the most beautiful things in the world. But scaling is harder than performing development, and replica sets make it all worth it.

PasteBin of the above solution

Greenroom: a PHP / Mongo Framework

Link: Greenroom: a PHP / Mongo Framework

Creating a new PHP / Mongo Framework with the following guidelines:

1. Existing frameworks depend on SQL for much of their CRUD functionality.  A fresh start will allow for the highest quality Mongo framework.

2. Code management is not responsibility of the framework.

3. A high quality default CRUD interface is a must.  The most important API is the Field and Model API.

During the Alpha stage, exploratory work implementing basic types will be performed.  Once the types crystalize, final behavior will be documented, tested, and implemented.

NoSQL Cloud Database Evaluation

The following products were reviewed to assess their capabilities:

Cassandra
Used by Facebook, Twitter, Digg
http://cassandra.apache.org/

CouchDB
http://couchdb.apache.org/

MongoDB
http://www.mongodb.org/display/DOCS/Home

We also looked at these products, but ruled them out for various qualitative reasons:

Project Voldemort
Used by LinkedIn
http://project-voldemort.com/

Redis
http://code.google.com/p/redis/

Hbase
http://hadoop.apache.org/hbase/

Project Voldemort and Hbase did not have sufficient python drivers and were ruled out for analysis.  Redis was ruled out because sharding was implemented outside of the project in a non-standardized manner.

Methodology

We tested each technology using the login data for ~ 1.5 million users. The first test inserted all the rows into the database, and the second test queried 100,000 random usernames.  These were timed to provide a way of comparing relative performance.

Cassandra Results
Load: 52m9.835s
Query: 1m42.434s
Disk Usage: ~3.1G

CouchDB Results
Load: 198m3.774s
Query: 15m54.026s
Disk Usage: ~6.5G

MongoDB Results
Load: 15m38.976s
Query: 1m1.990s
Disk Usage: ~1.0G

Cassandra provided acceptable performance. Cassandra had the best cluster management. Schema design was difficult and required much more foresight. Single node performance was not as fast as Mongo’s, but presumably would do better with scale.

CouchDB had the most attractive development environment but was unacceptably slow and heavy.

MongoDB had the fastest single node performance. MongoDB’s toolset was well rounded and easy to understand. MongoDB was also the most flexible system — much easier to write ad hoc queries. The clustering support exists but is not as advanced as Cassandra.

Appendix: Project descriptions and links

Cassandra
The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model.
http://cassandra.apache.org/
http://incubator.apache.org/thrift/about/
http://github.com/vomjom/pycassa/
http://github.com/digg/lazyboy
http://stackoverflow.com/questions/1502735/whats-the-best-practice-in-designing-a-cassandra-data-model
http://jetfar.com/installing-cassandra-and-thrift-on-snow-leopard-a-quick-start-guide/
http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02426.html

CouchDB
Apache CouchDB is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.
http://couchdb.apache.org/
http://code.google.com/p/couchdb-python/
http://davidwatson.org/2008/02/python-couchdb-rocks.html

MongoDB
MongoDB (from “humongous”) is a scalable, high-performance, open source, dynamic-schema, document-oriented database.
http://www.mongodb.org/

Takahashi Glitch

Entry for Rhizome’s Tiny Sketch Competition

Continued exploration of linking background color and foreground text. Grey scales are used to reduce source code size (one variable instead of 3).

glitch

Entry for Rhizome’s Tiny Sketch Competition

I was interested random text as a texture, and the numerical symmetry between ascii and rgb values. The contest limitations broke my experiment in a frenetic manner.