By Kevin Sookocheff – @soofaloofa
Vendasta recently released an outstanding marketing platform, MAST 10X, that sends automated email to business users based on how interested they are in buying the product. It’s an automated way for our partners to generate leads for their sales people. Our partners love it, but they all had the same question — “How do I know it’s working”? We took this as a challenge to prove to our partners that marketing automation works by using real life data and showing the results. This article outlines how we went about provisioning a data aggregation tool that fits our needs, highlighting the powerful analytical tools available within Elasticsearch.
Let’s recap the exact problem. We needed to store data related to email marketing campaigns and offer analytics over that data. We did not know ahead of time what types of analytics we needed, and required a flexible solution that could grow with our needs.
Let’s begin with Google Cloud Datastore, a storage solution we’ve used in the past and continue to use for a lot of our data storage and processing tasks. Our core product stores roughly five terabytes of data hosted in Cloud Datastore, and we rely on Cloud Datastore for its automated distribution and replication of data, automatic scaling, and its easy to use management tools and API . Since we were already familiar with the Datastore, we could easily use it to store our email data. What about displaying analytics about that data? To understand why this is a problem for datastore, we need to take a brief digression to talk about NoSQL databases.
Datastore is a NoSQL database that stores documents as key-value pairs on top of Google’s BigTable technology. This allows for easy distribution and replication of the data, but makes it difficult to perform joins and queries over the documents. The limitation of Datastore most relevant to this discussion is that, at any point in time, you don’t actually know how many documents are being stored in your database and what their composition is — you need to iterate over all keys matching the type of document you are searching for until you exhaust your inputs. Coming back to our email analytics example, if you store a document in the database for each email being sent, it isn’t feasible to compute statistics over this data by iterating through the entire dataset each time you need to show a statistic. For example, if you want to count the number of emails being sent to a particular user you would need to iterate over the data item by item and roll-up the numbers. The problem gets worse when you need more complex statistics or when your data is distributed across different machines. Hadoo and MapReduce provide great tools for running batch jobs to compute statistics, but is not nearly performant enough for real-time analytics. We needed something different.
Elasticsearch is primarily a text search platform. It is built on top of the Apache Lucene text search engine that builds a reverse index of all the words in your document. Think of the reverse index as an index in a book. If you flip to the end of a textbook you might see entries like the following.
cloud[12-35], 45, 50
This short snippet from an index says that the word “cloud” can be found on pages 12 to 35, and also on pages 45 and 50 and that the word “computing” can be found on page 67 and also on pages 84 to 99. Given an index in a book, finding all of the pages containing a particular word is a fast operation. The index tells you exactly where to look! Elasticsearch works the same way. For each document inserted into the database, Elasticsearch builds a reverse index mapping a term in your document to a location in the database where the document can be found. Now when we query for a particular term, Elasticsearch immediately knows which documents are being returned; it looks in its index to know exactly where they are.
Elasticsearch takes this idea one step further with aggregations. A search with an attached aggregation first returns all the results of the search and places those results in appropriate buckets for the aggregation being computed. Then, within all documents of a bucket, an aggregate statistic is calculated. For example, you can define a search query to return all documents referencing the word “cloud.” Then within all of those documents, calculate the sum of the documents returned. Elasticsearch does this by executing your query and loading each document matching the query into a bucket. Once your document is in the bucket, the aggregation sums the documents and returns your result, without having to iterate over all data in your database. This concept allows you to compute multiple aggregations during a single query; the documents returned during your query can be placed in multiple buckets matching the aggregations being performed and each aggregation calculated concurrently. For example, you can use Elasticsearch to query for all emails being sent to a particular user and compute multiple statistics over the returned data during a single request. The end result is near real-time analytics over your data.
Rolling out Elasticsearch
After evaluating Elasticsearch and its suitability for this job, we decided to roll it out for production use at Vendasta. Our previous experience with Google App Engine as a managed service meant we did not have to deal with provisioning or monitoring servers. As a company, moving to a non-managed database like Elasticsearch put us in less travelled territory. We needed to procure our own secure cluster of virtual machines serving Elasticsearch for production use cases.
An Elasticsearch cluster is comprised of machines taking on one or both of two roles: “Master” or “Data.” Each node in your cluster must be assigned a role and becomes part of the entire cluster’s configuration for data-sharing and replication, as well as quorum size. We experimented (and continue to experiment) with different cluster arrangements. We currently deploy a six node cluster with three “Master” nodes and three “Data” nodes. These are distributed across Compute Engine environments in the US Region so that each Zone has a single “Master” and single “Data” node.
Once we had a clearer picture of the cluster arrangement, we set to work automating the deployment, management and security using custom Docker containers defining an Nginx proxy as a security layer over top of an Elasticsearch server. Nginx is configured to keep a constant connection open to Elasticsearch to minimize connection setup and teardown latency. Kubernetes is used for resiliency of the Docker containers and data replication is handled by the Elasticsearch cluster itself. Our Docker images are being stored and served from Google Cloud Storage, keeping network requests within Google’s network to reduce latency. Logs from these virtual machines are sent to Google Cloud Monitoring, which allows querying log data across all nodes in our cluster. We continue to optimize our monitoring and deployment strategy as new problems and solutions emerge.
We now have another tool in our data storage and processing workflow that can compute near real-time analytics on stored data. From conception to delivery, deploying an Elasticsearch cluster involved a lot of collaboration and discussion. By moving to a non-managed service, Vendasta has solidified its expertise in deploying, managing and monitoring custom computing environments. This expertise will continue to pay dividends as we constantly explore new ways to deliver value to our customers through technological innovation.
Like to talk tech? Check out Kevin’s website at sookocheff.com