The greatest thing about MongoDB are the replica sets. It is a nice feeling to have failover and distributed queries across 3 database machines. It’s also nice to replace all 3 in one evening with constant application availability.
But the next great thing doing CMS work using MongoDB is that all records are documents. That is to say each record is an open ended collection of key/term pairs. It is nice to tack on additional fields as the content evolves to more structured forms.
Where this agility breaks down is with evolving query requirements. Scalable MongoDB performance requires indexes, but there is a hard limit on the number of available indexes.
In particular, context taxonomy can easily lead to an explosion of fields with a subsequent explosion of query techniques an indexes. For example, this post might have the following fields:
- Category: Tech
- Tags: MongoDB, Taxonomy
This is a collection of key / value pairs. Both keys and values are short with a few words as the maximum length. A given key can have one or more values. The values for a given key should be a unique set with no duplicates, though two keys may have identical values.
Finally, on a request I will start with slugs rather than actual values:
http://hexane.org/blog/category/tech
So not only does the database need to support queries along the structure outlined above, but also needs to lookup by slug. Thankfully there are a limited number of keys (<20) making hard coding the mappings an option. Values can be anything, so the schema needs to accommodate.
Initially this data was stored in discrete fields on the root document. Each field had its own index. Each unique combination of query fields also necessitated a unique index. Furthermore, there was a centralized and growing collection of slug2term mappings. Things got out of hand quickly.
The Solution
The solution was to leverage MongoDB’s dot notation queries. You can index on the basis of a dot notation query.
Inserting the Data
Insert the data as an array of key / term / slug documents under a single field name:
db.article.insert({
title:'wealth news article',
taxonomy:[
{key:'section', term:'News', slug:'news'},
{key:'topics', term:'Wealth', slug:'wealth'}
]
})
db.article.insert({
title:'retirement news article',
taxonomy:[
{key:'section', term:'News', slug:'news'},
{key:'topics', term:'Retirement', slug:'retirement'}
]
})
db.article.insert({
title:'wealth blog article',
taxonomy:[
{key:'section', term:'Blogs', slug:'blogs'},
{key:'topics', term:'Wealth', slug:'wealth'}
]
})
db.article.insert({
title:'retirement blog article',
taxonomy:[
{key:'section', term:'Blogs', slug:'blogs'},
{key:'topics', term:'Retirement', slug:'retirement'}
]
})
db.article.ensureIndex({'taxonomy.key':1,'taxonomy.term':1})
db.article.ensureIndex({'taxonomy.key':1,'taxonomy.slug':1}) |
Note a couple things here:
- The data is not normalized. This is quite intentional. Replica sets are fast, but in general you want to minimize the number of centralized tables when designing a distributed database.
- The generated slug is stored alongside the term. Losing some disk, but otherwise unconcerned.
- There is one set of indexes for all possible taxonomy fields!
Querying the data
So now here are the queries. To properly match a particular key / term pair I use the $elemMatch operator. To execute multiple queries I use the $all operator.
db.article.find({
taxonomy:{
$elemMatch:{'key':'section', 'term':'News'}
}
}).explain()
db.article.find({
taxonomy:{
$elemMatch:{'key':'section', 'slug':'news'}
}
}).explain()
db.article.find({
taxonomy:{$all:[
{$elemMatch:{'key':'section', 'slug':'news'}},
{$elemMatch:{'key':'topics', 'slug':'retirement'}}
]}
}).explain() |
At the end of the day, any database is going to require some queries. These are not the most beautiful things in the world. But scaling is harder than performing development, and replica sets make it all worth it.
PasteBin of the above solution