At Alpine Data Labs, our strength is providing fast machine learning algorithms that will help you analyze your data. It is imperative that our machine learning algorithms be correct. Given the variety of tasks that fall under “Machine Learning,” determining how to quantify “correctness” and how to test it in an automated way is quite a challenge.
When testing correctness, we aren’t interested in building a good model in the classic machine learning sense. It is certainly good practice to do a test/train split or examine the bias/variance trade-off when model-building, but none of that matters if the algorithm isn’t doing what the user expects it to do. To ensure that we are getting the intended behavior we need to test either the fit parameters or the predictions of the model against other machine learning software. Which should we choose?
THE GOLD STANDARD
There are a number of very good open-source machine learning tools that we considered using, although none is an exact overlap with Alpine’s capabilities. After experimenting with Python scikit-learn and R, we decided to use R as our “Gold Standard” for several reasons. Although both are open-source, widely used, well-documented and have good coverage of machine learning algorithms, R seems to be a better overlap with the kind of statistical and machine learning tests that we do at Alpine Data Labs. Most importantly, R has a highly extensible package system that allows users to add functionality. Because of this package system, most machine learning algorithms are already implemented, many several times (http://cran.r-project.org/web/views/MachineLearning.html). As an example, the Recursive Partitioning (aka decision trees) section lists about 12 different tree models. Alpine Chorus features several tree models, and matching our algorithms with an equivalent R package isn’t immediately obvious. One interesting result did come about as part of my exploratory testing with R. There was a discrepancy in the Naive Bayes class predictions between Alpine Chorus and the e1071 R library on a publicly available spam email data set (http://archive.ics.uci.edu/ml/data sets/Spambase). After some digging into the data set, I was able to characterize the observations that were misclassified – they all had very small unnormalized posterior probabilities. In this rare situation, the class prediction was getting switched by R due to a bug in the Laplace smoothing. After exchanging a few emails with one of the authors of the package, we agreed that a change in one line of code would fix the issue. The recently released version 1.6-4 of the e1071 includes the fix (http://cran.r-project.org/web/packages/e1071/news.html).
CHOOSING DATA SETS
We also want to test each Alpine Chorus algorithm against a variety of data sets. We might choose a set with many features, a set with many observations and a set with missing values. By choosing a variety of data sets we increase the chances that we will uncover any unintended behavior.
HOW DO WE DESIGN AND RUN A TEST?
Some algorithms are deterministic. For example, a basic logistic regression run against a specific data set, will always return the same coefficients, residuals, p-values and whatever else you want to test on. Alpine Chorus and R should return the same values and we can use Avalanche to verify this! Built using Robot Framework and Python, Avalanche is Alpine Data Labs’ automated testing framework. The test described below is just one of many that make up the Correctness library part of Avalanche. Let’s say we want to test logistic regression on a certain data set. First we run the logistic regression in R, pick a few variables to test on and save the results to a JSON file. Since our library is written in Python the JSON format is a natural choice. The results of a logistic regression might be:
“Null deviance”: 24667.028647045
We run the same analysis using Alpine and save the workflow. Finally we can set up a Robot test case to upload and run the workflow, then compare the results of the tests.
|read R results||path_to_R_results|
|read Alpine results||logistic_regression.afm|
|compare Alpine results to R||alpine_results||r_results|
This works great for many algorithms. But others have inherent randomness! Running a k-Means clustering algorithm on a data set starts by randomly choosing initial centroids and so will generally converge to a slightly different solution each time. Another example is choosing hyperparameters via k-fold cross-validation. In these cases we have to use probabilistic tests of fit parameters.
SCOPE OF CORRECTNESS TESTING
The last step in implementing our logistic regression test is to run it against all the data sources we support. The modular nature of Avalanche means that running tests on various data sources is as easy as changing a single variable in the test definition. If we want to run the same test against a HAWQ data source, we can replace the second line above with:
Avalanche and automated correctness testing at Alpine are a work in progress, but we are currently testing many of our machine learning algorithms on databases and Hadoop nightly.