Calculating Frequent Itemsets Using NoSQL Databases
I have long been interested in use cases where a Map/Reduce implementation using a NoSQL database would be preferable to a simple SQL database. A group of students and I decided to examine the effects of converting a MySQL/PHP based version of the Apriori algorithm to a Couchbase/JS based one. The Apriori algorithm is used to determine the most common sets of items, given a large set of “baskets”. The most common example of this is finding the most common sets of products sold together in the grocery store. In this example, a basket would simply be all of the products a customer purchased at one time. We are interested in discovering which products, or sets of products (say, “beer, milk, and cheese”), appear in a high percentage of baskets.
The implementation of Apriori that we tested against was for the Brovine gene database. In this database, the customer stores various information about an animal’s genetic information, including specific genes found and transcription factors. Transcription factors are proteins that bind to a specific part of the DNA strand and copy that part of DNA, which leads to gene expression. The customers are interested in finding sets of transcription factors (products) that occur together in different genes (market baskets).
Our group decided to implement several different implementations for the Berkeley DB, Couchbase, and MongoDB separately in an attempt to rule out any differences based on the specific database or programmatic choice. We decided to use Python for the BerkeleyDB and the MongoDB implementations, and Node.js for the Couchbase implementation. BerkeleyDB is a key-value store that touts very fast retrieval times, and Couchbase and MongoDB are both document stores. An attempt was made to do as much of the work as possible on the database using MapReduce for the MongoDB and Couchbase implementations, and the BerkeleyDB implementation was used as a control: we wanted to see whether MapReduce is actually a good choice for the Apriori algorithm. The only operation BerkeleyDB is used for in that implementation is to store and retrieve the basekets.
For our experiments, we tested each implementation (the original Brovine implementation, BerkeleyDB, MongoDB, and Couchbase) on the original dataset, which consisted of about 50 baskets with 100 items each. To view the effects of a larger database on each implementation, we also duplicated the dataset 10 and 100 times, and reran the tests.
The results we received were very surprising. The BerkeleyDB implementation literally crushed all other implementations because of the database retrieval time, being more than seven times slower than the other implementations. However, every new implementation we wrote was faster than the original algorithm, which used MySQL/PHP; the BerkeleyDB implementation was almost 50 times faster.
If you’d like to read more, view the full paper [PDF].