7 tools in every data scientist’s toolbox

There is huge number of machine learning methods, statistical tools and data mining techniques available for a given data related task, from self organizing maps to Q-learning, from streaming graph algorithms to gradient boosted trees. Many of these methods, while powerful in specific domains and problem setups, are arcane and utilized or even understood by few.

On the other hand, there are some methods and concepts that are widely used and consistently useful (or downright irreplaceable) in a large variety of domains, problem settings and scales. Knowing and understanding them well will give practitioners a solid base to tackle a large subset of common data related problems when complemented by programming, data manipulation and visualization skills.

Here’s a list of statistical and machine learning concepts that are in every data scientist’s toolbox.

Tree based methods

Some of the most universally useful methods in data science are decision tree based: decision trees, random forests and gradient boosted trees. Decision trees as base learners have a lot of very useful characteristics most of which are inherited by derived methods such as random forests. They’re

Robust to outliers
Can deal with both continuous and categorical data
Can learn non-linear relationships in the data well
Require very little input preparation (see previous three points)
Easy to interpret, via plotting the tree or extracting the tree rules. This can be very useful to give you the “feel” of the data

The main negative of decision trees is that they are a high variance method and tend to overfit, i.e. do not generalize well. This is where using decision trees as base learners for ensemble methods comes in.

Random forests are simply sets of decision trees, trained using bootstrapped data and random feature selection. This fixes the high variance problem of decision trees, making random forests one of the most versatile and widely used machine learning methods. They have high accuracy and low variance all the while inheriting most benefits from decision trees. When compared to many other more sophisticated models, they require very little tuning. In general, it is pretty hard to train a really bad performing random forest models. Even with out of the box hyperparameters, random forest models perform quite well in general. Finally, they are trivially parallelizable in both training and testing phase.

One drawback that is often highlighted about random forests is that they are a black box, i.e. that there is no way to interpret the model or the resulting predictions. Fortunately, this is no really true due to recent developments in making random forests more interpretable. There are methods for decomposing random forest predictions into feature contributions, selecting compact rule sets and summarizing the extracted tree rules (inTrees package in R).

There are excellent implementations for tree based methods in most mainstream languages, with python (scikit learn) and R(randomForest, party) probably being the most accessible.

Linear (regularized) models

Linear models (such as linear and logistic regression) are typically one of the first models to be taught in ML courses and covered in textbooks, and for good reason. They are very powerful for their relative simplicity. They are fast to train and used especially often when good interpretability is of essence. The general form: \(y = a +b_1X_1+ \ldots +b_nX_n\) means that it’s easy to see the relative importance and contribution of each feature and sanity check the model.

A drawback of linear models is that unlike tree based methods, they are much more sensitive to outliers (requiring input sanitation), require explicit handling of categorical features (via one-hot encoding) and expect a linear relation between input and the response variable.
It is possible to overcome the latter via basis expansion: i.e. by including transformations of the input features by logarithmic, polynomial or some other transformation, depending on the data at hand. This is usually most efficiently done when combining with with regularization (Lasso and Ridge regression). These are very powerful techniques for feature selection and for preventing over fitting, allowing to filter out irrelevant features (and irrelevant transforms in case of basis expansion).

Another great aspect of linear models is that very effective online (streaming) algorithms exist, making training models even on massive datasets easily accessible, by requiring constant memory.

There are an excellent set of linear model and regularization libraries in Python (scikit learn, statsmodel) and R(lm). For large datasets, there are online learning tools available such as vowpal wabbit.

Quantifying confidence: hypothesis testing, confidence- and prediction intervals

Being able to quantify the certainty in the estimates and predictions that are produced based on data is often one of the most crucial aspects in a data scientist’s work. If you don’t take variance into account in your estimates, it becomes easy to come to arbitrary conclusions. Thus, understanding and using hypothesis testing is something every data scientist utilizes.

There are multiple ways to do hypothesis testing. Statistics courses spend a lot of time on statistical tests (such as t-, z– and F-test ) and their closed form solution. In practice, confidence intervals are often better alternatives for hypothesis testing, by providing more information about the estimates, quantifying both their location and precision. In Bayesian world, credible intervals offer a similar benefit.

While there has been a lot of controversy around using p-values (due to them having been misused and abused a lot in some scientific circles), they remain a valuable tool when applied correctly. For example for categorical data, chi-squared test can be an excellent tool for understanding if the effect you see in your bar charts is real.

Finally, it is important to understand what hypothesis testing is really about. It’s often viewed as some arcane formula, which tells you the right answer by magically producing a p-value that can then be compared to 0.05. In the end, every test is the same: trying to answer the question whether the observed effect is real or not. And while having a “test” as a closed closed form solution for calculating the test score and p-values is great, you can achieve the same thing — an answer to the question: is this effect real? — via simulation. In fact it can be better to use simulation if you’re uncertain whether all the assumptions that need to hold for the analytical test actually do. There is a great writeup on this topic: There is only on test.

Speaking of simulation and bootstrapping…

Resampling methods: bootstrapping, cross validation, Monte Carlo

Resampling methods are a powerful set of tools that employ resampling to produce new hypothetical sample sets as if they were sampled from the underlying population. They are excellent when parametric approaches are difficult to use or just don’t apply. As such they are often crucial for many data analysis and machine learning tasks.

Bootstrapping, or sampling with replacement allows obtaining measures of confidence such as variance or confidence intervals to sample estimates (such as mean or median). The estimates can be obtained by sampling with replacement from the observed dataset, measuring the estimate we’re interested in (for example the mean) and then repeating the process until we have enough readings to compute confidence intervals, variance or any other property of the estimate we want (for example via percentile bootstrap).

Depending on the underlying distribution (for example skewed vs symmetric), the estimations can be biased. There are ways to mitigate that, for example via bias corrected and accelerated bootstrap (good overview here). R’s boot package has multiple different bootstrap methods available. But often, going beyond percentile bootstrap could be overkill: we are often interested in the order of magnitude of the measure of confidence anyway, not necessarily an exact value.

Bootstrapping also has its uses in machine learning, for example for creating ensembles of models (such as random forests).

Cross-validation is another resampling method, used to make sure that the results we see on our sample set would actually apply to an independent dataset. In other words, to make sure we are not overfitting our models. This is a must in machine learning tasks where prediction is involved. Just as with bootstrapping, the idea is simple: randomly partition the dataset into two — training set and test set — measure the performance of the model trained on the train set on the test set, and then repeat the experiment after spitting the data randomly again. After enough experiments, averaging over the result gives a good estimate of how the model would perform on a new dataset sampled from the underlying population. There are a lot of methods for that in R and an excellent set of libraries in scikit-learn

Both approaches mentioned in the section are similar in spirit to the wide spectrum of Monte Carlo methods, employed in physics since the 40s.

Finding hidden groups: (centroid-based) clustering

Clustering is one of the most commonly used approaches in unsupervised learning, used to find hidden groups or partitionings in the the data. There are a large number of different approaches to this: hierarchical, density-, distribution and centroids-based clustering; there is a nice visual summary of many of the methods in scikit-learn’s clustering page.

Clustering methods, being easy to apply and shown early on in most introductory textbooks, enjoy a wide popularity. What seems to happen often though is that beginning practitioners turn to them to obtain a set of clusters… and stop there. More often than not there isn’t a lot of value in only doing the clustering and leaving it at that. Usually, clustering is more useful as a tool when chained together with other data analysis methods. For example, it can be very effective as a tool for dimensionality reduction, for further analyzing how different groups of objects behave. A good example for this is combining clustering with time dimension. It can be difficult to track the evolution of the dataset under study when you have thousands of features. Reducing it to a smaller set of clusters and observing how the cluster distributions change can on the other hand reveal interesting patterns in the data not visible otherwise.

Scikit-learn includes an excellent set of clustering methods, likewise for for R’s cluster package.

Feature selection

Feature selection is usually not treated as a separate topic in ML or data science literature, rather it is viewed as a a loose set of techniques that are mostly natural side effects of other, more fundamental methods such as lasso, random forest, etc. While this is technically true, I’ve found that having a unified understanding of feature selection to be greatly beneficial in many data science and machine learning tasks. Understanding feature selection methods well leads to better performing models, better understanding of the data and better intuition about the algorithms underlying many machine learning models.

There are in general two reasons why feature selection is used:

Reducing the number of features, to reduce overfitting and improve the generalization of models.
To gain a better understanding of the features and their relationship to the response variables.

An important factor to take into account is that these two goals are often at odds with each other and thus require different approaches: depending on the data at hand a feature selection method that is good for goal (1) isn’t necessarily good for goal (2) and vice versa. For the previous, model based methods (for example linear model based and random forest based) are usually better, while for the latter, univariate feature selection methods can be the most useful, since they do not underestimate feature’s importance due to correlation with other features, like model based feature selection methods tend to do.

Measuring performance: metrics, loss functions, measures of relevance

The questions of how good is an estimator is the first one to come up once a model is built. It is easy to apply a measure, but it can also be easy to interpret it incorrectly. For example, accuracy of 95% in a classification task can sound wonderful, until you realize that in 95% of the cases, your data has a particular response, so your classifier is simply predicting this constant value. This doesn’t mean that accuracy is a bad metric per se, simply that one needs to be careful how and where it is applied. Thus, it is crucial to understand the metric you are applying in the context to the data at hand.
For classification ROC curve and AUC are excellent measures for classifier performance. They are not necessarily intuitive at first sight, so it’s worth taking the time to understand how the numbers are calculated. Similarly, precision and recall, confusion matrix and F1 are widely used for evaluating classification. Typical minimization targets in ML classification tasks are logistic loss and hinge loss.

For regression tasks, R^2 is an excellent measure that shows how good your estimators are in terms of a trivial estimator that predicts the mean. Additionally, it has a strong relation to correlation coefficient, namely R^2 is the square of the correlation coefficient between outcomes and predicted values. Typical minimization target in ML tasks are squared error and absolute error.

From data science point of view, interpretability of the measures can be important, in which case they can roughly be divided into 3 groups. Firstly, measures that lie in a given range (e. g. AUC lying between 0.5 and 1, R^2 between 0 and 1) are excellent in the sense that their numbers are comparable when the underlying data or responses change. Secondly, measures such as accuracy or mean absolute error are good at returning values that are easy to interpret in the context of concrete data, as they lie in the same scale as the underlying data and are therefor easy for humans to evaluate: are we off by 1%, 10% or 50% on average. Finally, measures such as squared error or log loss can be useful as optimization targets, but in general not as great for a quick interpretation by humans.

Scikit-learn provides a nice set of metrics and scoring functions in its metrics module.

Summary

The list above only scratches the surface of modern ML and statistical tools. There are many, many powerful and widely used methods left untouched in this post: deep learning, Bayesian methods, SVMs, recommender systems, graph algorithms etc, etc. Yet, i would say that mastering the above will give a practioner a very solid baseline to tackle a very wide area of data related tasks, and furthermore will make stepping into other, more sophisticated methods much easier.