Machine learning is not magic, it’s not AI and it doesn’t really learn like humans. Instead, it’s a very powerful set of tools that can help you gain insights and build predictive systems if you know how it works and how to use it.
Thanks to the explosive growth of big data services and the emergence of various success stories of machine learning in media [1, 2, 3], there are probably more people interested in machine learning now than ever . Additionally, the abundance of online educational resources (such as the popular free class in machine learning) has made it very easy to jump into the topic and the availability of many high quality machine learning tools has enabled technically savvy people to apply machine learning to their problems. In fact, many of the machine learning tools are so well designed (including our own Alpine!) that a person doesn’t have to understand what’s going on underneath to train classifiers (among other models).
Unfortunately, it seems that the hype and the easy tools have led a lot of organizations to jump into the bandwagon without really understanding what makes machine learning work. For example, a couple of heads of data analytics teams I interviewed with told me (paraphrasing), “now we have all these logs, we must surely be able to apply machine learning and improve our services!”, “I don’t know how these algorithms work, but I heard Support Vector Machine is great, so let’s use that to train classifiers!”.
Although these are but a couple of anecdotal examples, I have a strong suspicion that such an attitude might in fact be common among many companies out there, who typically start out with expertise in their particular domains but data science comes later when they acquire enough data from customers. And based on my encounters, I believe that often the people in charge of data analytics tend to misjudge the difficulty of applying machine learning and believe that machine learning can be applied as a black-box component that can automatically discover patterns in their data.
Unfortunately, with this sort of attitude, companies are restricting themselves from fully leveraging their data, and in fact they might be harming themselves by misinterpreting data and doing wrong things. To be a more mature data science oriented organization, one should drop the black box approach and be more serious about understanding various internals of data science.
What’s the danger in treating machine learning as a black box component? This becomes more obvious once one realizes that machine learning is just a tool. It doesn’t magically build a predictive model for you. A person still has to build the model. For e.g., if one is building a model through supervised learning, he/she should already know his/her problem domain well enough to know the data and features that are needed to predict the target. Additionally, this person should know details of his tools (machine learning models), their properties, strengths, weaknesses, etc. (e.g. can the model learn non-linear or non-monotonic relationships?) A good analogy may be a programmer, his/her application domain and his/her programming language. Would you trust a person with little expertise in any language to build a high quality SW system? Can a person who knows nothing about the application area build the system?
Despite the name ‘machine-learning’ (statistics people will tell you that it’s a PR achievement by computer scientists), machine learning algorithms are rarely capable of discovering completely new insights from arbitrary data. E.g. they will rarely discover patterns if they are fed raw data without proper transformations. Machine learning algorithms usually work best if the person using them already has a good ‘theory’ about how the prediction system should be structured (e.g. it helps to know whether variables are monotonically related.). The tools are merely there to help the person ‘configure’ the structure (e.g. find coefficients of linear models, find split points of trees, etc.). Although there are active research areas (such as deep belief network mentioned in this article) that are trying to get machines to learn and discover new structures, once you actually try these algorithms, you’ll quickly realize that even these state-of-the-art algorithms are bound in what they can learn. And despite their intent to be ‘automatic’ feature learners, ironically, these new algorithms require even more in-depth knowledge from the user to be successful.
What are some specific examples of machine learning knowledge that you need to be better at data science? The academic trove of machine learning knowledge is so broad that if one says all of statistics and computer science, he/she may not be exaggerating. But besides thousands of pages of math and engineering knowledge, I believe that there is a more concise list of practical tips that can help trained people avoid certain common mistakes. E.g., here is a partial list of some obvious and some not-so-obvious caveats in machine learning applications.
- Research the problem domain in advance, invest time in reasonable feature collections/transformations/engineering and form reasonable hypotheses about relationship between predictors and the predicted. As I mentioned above, machine learning usually doesn’t work if you don’t already have some good ideas about the problem, the features and the structure of the solution. In particular, figuring out the killer features and transformations will likely consume most of your time.
- For example, you might find that, when you are doing stock analyses of many different types of companies, you may do much better when features are normalized ‘per-company-type’ rather than globally.
- Other forms of domain knowledge may be incorporated in the forms of model’s bias (linear, SVM with polynomial kernels, trees, etc.), constraints (e.g. you might already know that a variable should always have a positive coefficient), intercept (in some problems, not having a bias term may make more sense), etc.
- Know the bias of your data. This is probably one of the most important things you have to know in advance. Your data (both training and validation) often do not represent ‘true population’. There are all kinds of biases that you might not be aware of. Don’t assume that a classifier that you trained, even if it performs well in your validation data, can be applied to random population samples, because your training/validation samples may not be random.
- Some common biases include selection bias (your sensors are not random), presentation bias (e.g. if you’re collecting click information, your data are heavily skewed to what the user sees on the first few pages). More subtle ones are survivorship bias (e.g., when you are doing stock analysis, you may only have data of ‘surviving’ companies and not the failed ones), etc.
- Failure to know the bias of your data may lead to a catastrophic result in some domains. As an example, I’ve heard a story that the downfall of Long-Term Capital Management in the 1990’s [5, 6] can be partly attributed to its risk model that was trained with a biased sample that contained very few downturns.
- Know the inductive bias of the model and how this will limit what you can learn. Most of the machine learning algorithms out there start with assumptions in the relationship between predictor and the predicted variables (more formally, this is referred to as the hypothesis space of the model).
- For instance, I suspect that most of the people out there already have pretty intuitive ideas about what linear models mean. However, it helps to be more explicit about details.
- It turns out that even non-linear algorithms such as decision trees have inductive biases.
- E.g. with a linear classifier, you can’t learn an XOR function and even decision trees may not be able to learn it with a balanced data set and a vanilla approach. In short it helps to know algorithmic details if you want to learn particular patterns.
- It often helps to do proper scaling/normalization on data. E.g., with certain algorithms scaling your numeric data could yield very different results.
- Regularized linear models may not make much sense unless you equally normalize all the predictor variables.
- Neural networks, deep-belief networks are also heavily affected by scaling of features (I found that scaling outputs from lower layers could also help in some cases.).
- If you want to get some feel for variable importance in unregularized linear models, predictors should be properly normalized.
- Algorithms based on decision trees (e.g., Random Forest, Boosted Trees, etc.) are usually immune to this sort of ‘monotonic’ feature transformations.
- There are certain classes of algorithms that could be used as black-boxes while yielding good results. E.g., algorithms such as random forest require very little tuning and can capture a lot of non-linear relationships. When you have very little idea about the problem domain initially, these algorithms could provide a good starting point. However, these guys still can benefit greatly from good features and transformations. Additionally, more expressive models like trees are more prone to over-fitting, particularly if you don’t have a lot of data.
- There are usually multiple ways to interpret models, and it may help to look at all these different aspects. For example, Support Vector Machine is usually introduced as a ‘max-margin’ classifier. However, another way to look at SVM is as a L2 regularized hinge-loss model. When looked at as a regularized model, one realizes that feature normalization would be an important prerequisite.
- Be aware of predictors’ origins. Some of them may be derived from the same source as the label. And in such cases, it makes no sense to use them as predictors. Large organizations often have hundreds and thousands of predictor variables and sometimes you may not realize that your predictor variables have the same source as the label. E.g., when your click prediction seems incredibly accurate, one of your predictors could be also be derived from the click information itself. This is sometimes referred to as information leakage.
- In typical industrial data sets, outliers are very common and they can mess up your conclusions. Certain algorithms like linear regression are heavily affected by outliers. E.g., say you are trying to predict house price from square footage, they might have positive linear relationship, but if you have a few strange data samples (say a couple of extremely cheap sales for really large houses), your learned model would get all messed up. You can either do outlier removal or robust-statistic based learning algorithms (trimmed regression, huber loss regression, etc.). I feel that robust techniques are not mentioned often enough in typical literature.
- Explore ‘hyper-parameters’ of the algorithms but be aware that you can over-fit hyper-parameters as well. Algorithms based on support vector machine, boosting, regularization, neural network, etc. have additional parameters you can tune (e.g., lambda or cost variable in SVM that controls the tradeoff between the loss and the regularization terms) and changing these can yield very different results. You should know what these hyper-parameters really mean – e.g. it helps to know the difference between L1 and L2 regularizations, etc.
- Additionally, when you are trying to find ‘optimal’ hyper-parameters of algorithms, you are essentially doing greedy learning of the hyper-parameters. It’s a good idea to divide the data up into three sets – a training set, a validation set 1, and a validation set 2 (or you can do cross-validations).
- You train your model on the training set with a particular set of hyper-parameters.
- Measure the performance of the model on the validation set 1.
- Repeat the steps 1 and 2 with different hyper-parameters. Find the hyper-parameters that yield the best results on the validation set 1. This is essentially hyper-parameter learning.
- Your real generalization performance should be measured on the validation set 2.
As a useful reference, this paper talks about the potential for over-fitting hyper-parameters.
- Keep in mind that the accuracy is not the only important criterion. If your service depends on run-time prediction, you might prefer simpler models that run faster. Often organizations make more money by processing more requests, rather than being more accurate but slower in their predictions. E.g., the Netflix prize winning algorithm, while impressive in its predictive performance, may not be practical for runtime product recommendations.
- Don’t believe that the state-of-the-art results in literature could be readily applied in your domain. Nowadays, neural network based algorithms (such as deep-belief-net and drop-out network) are breaking all kinds of records with famous data sets and they are exciting to watch. However, training these neural networks is a very difficult task and often times you’ll find that they don’t work miracles when you try to apply them in your domains (or will require a lot of tuning).
- The lack of a strong pattern in your data when you train a model doesn’t prove that there isn’t one. You may not have transformed features properly or you may not be using the right model. I’ve heard a story about some person concluding that there’s no use for a particular feature in predicting stock prices because his linear model said so. However, some other person found that in fact the feature was very useful, with proper transformations.
- Never, ever mix up training and validation data. This sounds like such a basic thing and some people may feel insulted that this is mentioned at all. But I’ve actually seen a senior guy who mixed up training and validation data and refused to acknowledge that his conclusions may be wrong.
Because of all these caveats, it often takes time and several iterations to properly explore/experiment with your data and come up with reasonable conclusions.
In short, if you want to fully take advantages of data your organization has accumulated over time, do not believe in black-box magic of machine learning! Emphasize the skills to understand, interpret and apply internals of machine learning models and algorithms.