However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them. A classic example of a relation where a linear combination of inputs cannot capture the output is exclusive or (XOR), defined as

X1 | X2 | OUT |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

In this case, neither X1 nor X2 provide anything towards predicting the outcome in isolation. Their value only becomes predictive in conjunction with the the other input feature.

A decision tree can easily learn a function to classify the XOR data correctly via a two level tree (depicted below). However, if we consider feature contributions at each node, then at first step through the tree (when we have looked only at X1), we haven’t yet moved away from the bias, so the best we can predict at that stage is still “don’t know”, i.e. 0.5. And if we would have to write out the contribution from the feature at the root of the tree, we would (incorrectly) say that it is 0. After the next step down the tree, we would be able to make the correct prediction, at which stage we might say that the second feature provided all the predictive power, since we can move from a coin-flip (predicting 0.5), to a concrete and correct prediction, either 0 or 1. But of course attributing this to only the second level variable in the tree is clearly wrong, since the contribution comes from both features and should be equally attributed to both.

This information is of course available along the tree paths. We simply should gather together all conditions (and thus features) along the path that lead to a given node.

As you can see, the contribution of the first feature at the root of the tree is 0 (*value* staying at 0.5), while observing the second feature gives the full information needed for the prediction. We can now combine the features along the decision path, and correctly state that X1 and X2 together create the contribution towards the prediction.

The joint contribution calculation is supported by v0.2 of the treeinterpreter package (clone or install via pip). Joint contributions can be obtained by passing the *joint_contributions* argument to the *predict *method, returning the triple [prediction, contributions, bias], where contribution is a mapping from tuples of feature indices to absolute contributions.

Here’s an example, comparing two datasets of the Boston housing data, and calculating which feature combinations contribute to the difference in estimated prices

from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston from treeinterpreter import treeinterpreter as ti, utils boston = load_boston() rf = RandomForestRegressor() # We train a random forest model, ... rf.fit(boston.data[:300], boston.target[:300]) # take two subsets from the data ... ds1 = boston.data[300:400] ds2 = boston.data[400:] # and check what the predicted average price is print (np.mean(rf.predict(ds1))) print (np.mean(rf.predict(ds2)))

21.9329

17.8863207547

The average predicted price is different for the two datasets. We can break down why and check the joint feature contribution for both datasets.

prediction1, bias1, contributions1 = ti.predict(rf, ds1, joint_contribution=True) prediction2, bias2, contributions2 = ti.predict(rf, ds2, joint_contribution=True)

Since biases are equal for both datasets (because the the model is the same), the difference between the average predicted values has to come only from (joint) feature contributions. In other words, the sum of the feature contribution differences should be equal to the difference in average prediction.

We can make use of the *aggregated_contributions* convenience method which takes the contributions for individual predictions and aggregates them together for the whole dataset

aggregated_contributions1 = utils.aggregated_contribution(contributions1) aggregated_contributions2 = utils.aggregated_contribution(contributions2) print (np.sum(list(aggregated_contributions1.values())) - np.sum(list(aggregated_contributions2.values()))) print (np.mean(prediction1) - np.mean(prediction2))

4.04657924528

4.04657924528

Indeed we see that the contributions exactly match the difference, as they should.

Finally, we can check which feature combination contributed by how much to the difference of the predictions in the too datasets:

res = [] for k in set(aggregated_contributions1.keys()).union( set(aggregated_contributions2.keys())): res.append(([boston["feature_names"][index] for index in k] , aggregated_contributions1.get(k, 0) - aggregated_contributions2.get(k, 0))) for lst, v in (sorted(res, key=lambda x:-abs(x[1])))[:10]: print (lst, v)

(['RM', 'LSTAT'], 2.0317570671740883)

(['RM'], 0.69252072064203141)

(['CRIM', 'RM', 'LSTAT'], 0.37069750747155134)

(['RM', 'AGE'], 0.11572468903150034)

(['INDUS', 'RM', 'AGE', 'LSTAT'], 0.054158313631716165)

(['CRIM', 'RM', 'AGE', 'LSTAT'], -0.030778806073267474)

(['CRIM', 'RM', 'PTRATIO', 'LSTAT'], 0.022935961564662693)

(['CRIM', 'INDUS', 'RM', 'AGE', 'TAX', 'LSTAT'], 0.022200426774483421)

(['CRIM', 'RM', 'DIS', 'LSTAT'], 0.016906509656987388)

(['CRIM', 'INDUS', 'RM', 'AGE', 'LSTAT'], -0.016840238405056267)

The majority of the delta came from the feature for number of rooms (RM), in conjunction with demographics data (LSTAT).

Making random forest predictions interpretable is pretty straightforward, leading to a similar level of interpretability as linear models. However, in some cases, tracking the feature interactions can be important, in which case representing the results as a linear combination of features can be misleading. By using the *joint_contributions* keyword for prediction in the treeinterpreter package, one can trivially take into account feature interactions when breaking down the contributions.

The latter is most commonly tackled by the most straightforward: calculating some point estimates, typically the mean or median and track these. This could be time based such as calculating the daily mean, or be based on some unit of “work done” such as for batches in a production line or versions in software product. The calculated point estimates can be displayed on dashboards to be checked visually or a delta computed between consequtive numbers, which can be compared against some pre-defined threshold to see if an alarm should be raised.

The problem with such point estimates is that they don’t necessarily capture the change in the underlying distribution of the tracked measure well. A change in the distribution can be signficicant while the mean or the median remains unchanged.

In cases when change needs to be detected, measuring it directly on the distribution is typically a much better option.

Histogram intersection calculates the similarity of two discretized probability distributions (histograms), with possible value of the intersection lying between 0 (no overlap) and 1 (identical distributions). Given bin edges and two normalized histogram, it can be calculated by

def histogram_intersection(h1, h2, bins): bins = numpy.diff(bins) sm = 0 for i in range(len(bins)): sm += min(bins[i]*h1[i], bins[i]*h2[i]) return sm

For example for the two distributions \(\mathcal{N}(2,1)\) and \(\mathcal{N}(3,1.5)\), the intersection is ~0.66, easy to represent graphically

Histogram intersection has a few extra benefits:

- It works equally well on categorical data, where we can use category frequencies to compute the intersection
- Dealing with null values comes for free, simply by making nulls part of the distribution: if there is an increase in nulls, it changes the intersection, even if non-null values will continue to be distributed the same way. In contrast, when tracking point estimates such as the mean, null value checks need to be explicitly added as an additional item to track.

As always, there are some caveats. One issue is that the intersection depends on how the bins have been selected. This becomes an issue especially for long tailed distributions. For example for the lognormal distribution, with same \(\mu\) and \(\sigma\) as above, the histogram might look something like the following, with practically all density concentrated in the first bin. Calculating the intersection on that would clearly give very misleading results.

This can be tackled by either moving the histogram to log-scale, or simply by clipping the long tail to use it as an approximtion. For this particular examples clipping gives the same result as the “true” intersection calculted in log scale.

Two methods often recommended for detecting change in distribution are Kullback-Leibler divergence and statistical tests, for example a chi-squared test which could be used both on categorical data and a histogram of continous data. Both of the methods have significant drawbacks however.

Kullback-Leibler divergence is measured in bits and unlike histogram intersection, does not lie in a given range. This makes comparison on two different dataset difficult. Secondly, it is only defined when zeroes in the distributions lie in the same bins. Ths can be overcome by transferrning some of the probability mass to locations with zero probability, however, this is another extra complexity that needs to be handled. Finally, KL divergence is not a true metric, i.e. it does not obey triangle inequality, and is non-symmetric, i.e. in general, when comparing two different distributions \(P\) and \(Q\), \(KL(P,Q) \neq KL(Q,P)\).

Using a statistical test, for example a chi-squared test might seem like a good option, since it gives you a p-value you can use to estimate how likely it is that the distribution has changed. The problem is that the p-value of a test is a function of the size of the data, so we can obtain tiny p-values with very small differences in the distributions. In the following, we are comparing \(\mathcal{N}(0,1)\) and \(\mathcal{N}(0.005,1)\) in the first plot, and \(\mathcal{N}(0,1)\) and \(\mathcal{N}(1,1)\) in the second plot.

In both cases, we have tiny p-values, in fact it is even smaller for the first plot. From practical point of view, the change in the first case is likely irrelevant, since the absolute change is so small.

A chi-squared test can of course still be very useful and complement the histogram intersection, by letting us know if the amount of data is small enough that measuring the intersection likely won’t give meaningful results.

One drawback of histogram intersection is that it does not consider distances between bins, which can be important in case of ordinal data.

For example, consider the following plot with three different histograms. Histogram intersection between histograms 1 and 2, and 1 and 3 are the same. However, assuming it is ordinal data, we might want to say that histgorams 1 and 2 are actually more similar to each other, since the changed bins are closer to each other than the change between 1 and 3.

There are various methods to take this into account, the most well known being earth mover’s distance. There is a nice overview of these methods here.

]]>What we can do though is to take a purely statistical approach, by estimating which fighters are the likeliest to have the highest skill level, considering their wins and the quality of their opponents. This blog post intends to do just that, by using state of the art statistical methods that have been successfully used in similar settings, for example to analyze and rank chess players (see

TrueSkill Through Time: Revisiting the History of Chess) and for ranking players in competitive online games.

The data we will utilise is obtained from Sherdog fight finder, a comprehenseive database of MMA matches, dating back from 1980 (with the first match between Casemiro Martins and Rickson Gracie). The dataset includes results of over 200,000 matches and over 94000 fighters in total. The following chart displays the number of matches per year in the dataset.

In this data, we have full details of match outcomes, including win type (if the fight ended with a win for one of the contestants), but also information about matches ending in no contest or disqualification. Since the study tries to esimate skill, we ignore data about both NC and disqualified matches (even though the latter means a win for one of the fighters, it is not possible to infer the skill level based on that fact alone, Jon Jones vs Matt Hamill being a good example ). All computatios are done on the snapshot of the data as of Dec. 20, 2015.

Our goal is to assign a skill level for each fighter, based on observed match outcomes. As mentioned earlier, the methodology we will employe is based on the same Bayesian statistical approach that is used by Microsoft for Xbox called Trueskill to determine the skill level players so that they can be matched up optimally competitive multiplayer games such as Halo. Also, the same approach has been employed to rank chess players using historic chess match outcome data from 1850 to 2006 (see TrueSkill Through Time: Revisiting the History of Chess).

We rely on Bayesian inference to infer the skill of competitors based on the observed match outcomes. As skills of fighters are unknown, they are assigned a probability distribution, in the current case a Gaussian with mean \(\mu\) and variance of \(\sigma^2\). The mean \(\mu\) of a fighter skill specifies the average performance of the fighter. Our uncertainty in the skill level is specified by variance \(\sigma^2\). The match outcome is determined by the performance of both contestants. Since performance of a fighter can vary from match to match (good days and bad days), it can be thought of as a noisy version of skill. The winner of the match is the one with hgher performance in that particular match.

The following is the high level summary of the model we base the stastical inference on

- Each fighter’s skill is a normally distributed (Gaussian) random variable (the mean and variance of which we will lern from the data). Since a fighter’s skill changes over the years, we have a skill ranking for a fighter for each year he is active. These are the values we want to infer.
- In every match, fighters performance comes from skill, i.e. is drawn from Gaussian \( \mathcal{N}(skill, performanceVariance)\), where \(perfromanceVariance\) is another hidden variable, learned from data.
- A fight ends in a decision victory for fighter \(f_1\), if his performance \(p_1\) is greater than the performannce of fighter \(f_2\) + some threshold \(decisionThreshold\), i.e. \(p_1 > p_2+ decisionThreshold\).
- If the skill difference is less than the decision threshold, the fight ends in a draw.
- A fight ends in a finish victory (sbmission, KO or TKO) for fighter \(f_1\) if \( p1 > p2 + finishThreshold\)
- Both finish and decision threshold themseves are hidden variables that are learned from data
- Finish threshold is constrained to be strictly larger than decision threshold, i.e. \(finishThreshold > decisionThreshold\)
- Two contestants in a match are assumed to be on a somewhat similar level, i.e. their skill level is very unlikely to be vastly different. In other words, UFC level fighters are in general matched up with other UFC level fighters, not low level amateurs. This assumption helps us in inferring the skill level of fighters who have a very small number of fights. Thus for each match, we assume \(\mid skill_1 – skill_2 \mid \lt matchupThreshold \). The value of matchup threshold is again a hidden variable learned from data.

To implement the model we use Infer.net, a Bayesian inference framework. The model is in large parts based on the Chess analysis model built by Microsoft Research on the same toolset. The code for the model can be found in Github. The final inference is done model is inferred based on the full Sherdog fight database.

While most matches are within a weight class, there are also matches where contestants in different weight classes are matched up. In this case, a heavier fighter has an advantage on average. This translates into a bias, where the average skill levels of fighters in higher weight classes will be observed to be higher than those of lower ones. To create a true pound for pound ranking, we can normalize the skill by weight class, by dividing each fighter’s skill level by the average skill level of the given weightclass.

The mean skill level over all fighters is 1000 (since this is chosen as the prior/baseline). The following are the weight class averages (NA denotes no information about a fighter’s weightclass).

Weight class | Average inferred skill |
---|---|

Heavyweight | 1159 |

Light Heavyweight | 1141 |

Middleweight | 1140 |

Welterweight | 1132 |

Lightweight | 1119 |

Featherweight | 1115 |

Bantamweight | 1085 |

Flyweight | 1085 |

NA | 893 |

A note about this table: this shouldnt be interepreted in absolute terms (“flyweights are only 6% weaker that heavyweights”). Rather, this is a bias correction table for the pound for pound ranking. In p4p terms, every weight class should have the same average skill level. Since across-weight class matches bias the numbers, this is the table we can use to correct it.

Long story short, statistically, Jon Jones has the highest skill ranking of any fighter in the database. The following graph shows the computed skill level of the top 10 male fighters out of 94000 fighters in the database.

Jones stands quite a bit higher above other fighters in terms of his skill rating. Where other top fighters’ ratings are very close to each other, Jon Jones’ rating is clearly above. Statistically, this isn’t a surprise. He is without loss (the model disregards DQ losses as irrelevant to skill), and with wins over extremely high level competition.

Looking at the rest of the list, there are some interesting results there. While Daniel Cormier isn’t typically ranked among the very top, statistically speaking, he should be. His only loss is to number one in Jones, and almost all of his wins are over very high quality opponents. The combined record of his opponents is the best in all of top 10.

Anderson’s Silva’s position as an all-time great is statistically somewhat hurt by the start and end of his career, his two losses to Chris Weidman and the losses to relatively weak fighters early in the career mean that from purely statistical point of view, he does not quite reach Jon Jones’s level. But even so, he is among the very top.

Ben Askren is definitely a surprise in the list. His undefeated record is the main reason he is this high. Since we have not observed a loss for him, it makes his probable skill very high from a statistical perspective. This is in some sense a drawback of the model, but fundementally, it *is* difficult to draw a conclusion about undefeated fighters. Khabib Nurmagomedov is another example for this. Normally, he isn’t considered p4p top 10. Statistically however, it makes a lot of sense, being 22-0 and having a very solid win list, including the current champion.

Two fighter have made meteoric rises in the rankings. Rafael dos Anjos and Conor Mcgregor, starting from a relatively low baseline(due to their lower level competionion and losses earlier in the career) have risen to top 10 very quickly, thanks to their very strong wins in the recent year(s).

One thing that should be emphasized is that the statistical skill levels for the top fighters are actually extremely close. For example, Conor Mcgregor came into the top 10 ranking only after beating Aldo, who was top 5 prior to that. This means a single fight can change the rankings very significantly.

Additionally, as mentioned earlier there is also variance associated with each inferred skill which reflects our statistical uncertainty in the estimation. An this uncertainty is (naturally) very high, due to the low number of fights each fighter has (unlike say chess, where the number of games can be in the thousands for a competitor). This is illustrated by the following chart, where we plot 95% confidence intervals around the mean for fighters often considered to be the top 3 of all time.

As can be seen, the intervals are fairly wide and overlapping. What this means is that while Jones has the highest average skill *given the match outcomes*, we cannot conclude with absolute certainty that his skill level is above the other top ranked fighters, simply that it is likely to be the case. And indeed, a slight underperfomance in one of his fights (such a loss to Gustafsson) would have changed the rankings drastically, showing how fragile the ranking estimation actually is. This all boils down to the fact that since the number of fights of a typical competitor is fairly small and there is a lot of “luck” involved, small mistakes making large changes in the overall picture. And our model is well able to capture and quantify this uncertainty.

Many consider the long-reigning flyweight champion Demetrious Johnson to be among the all time best, so him not being in top 10 statstically is somewhat surprising. Looking at the data further, the main reason for this seems to be overall competition, which surprisingly seems slightly weaker in the division. For example, his oppponents’ and their opponents’ combined win percentage is 62.9%. For comparison Chris Weidman’s opponents and their opponents win ratio is 64.6%, for McGregor it is 64.1% . This translates to Demetrious’s opponents ranking be on average lower than the opponents of the fighters in the top 10, in turn translating to his score being lower.

Bayesian inference provides an excellent toolset for inferring hidden parameters in the data, especially when the amount of data to draw conclusions on is small. As a result it is especially useful in a setting such as reasoning about the skill levels of top MMA competitors. Based on the available data, Jon Jones is the best ever from a statistical perspective. But it has to be kept in mind that the variance (uncertainty) in the estimates is very high and small changes in the data, such as one win or loss, can significantly alter the rankings. And this reflects the real world intuition well: a loss or two can easily change the perception of the fighter in the list of all time greats.

The code for the model to replicate the results or develop it further can be found at Github.

]]>On the other hand, there are some methods and concepts that are widely used and consistently useful (or downright irreplaceable) in a large variety of domains, problem settings and scales. Knowing and understanding them well will give practitioners a solid base to tackle a large subset of common data related problems when complemented by programming, data manipulation and visualization skills.

Here’s a list of statistical and machine learning concepts that are in every data scientist’s toolbox.

Some of the most universally useful methods in data science are decision tree based: decision trees, random forests and gradient boosted trees. Decision trees as base learners have a lot of very useful characteristics most of which are inherited by derived methods such as random forests. They’re

- Robust to outliers
- Can deal with both continuous and categorical data
- Can learn non-linear relationships in the data well
- Require very little input preparation (see previous three points)
- Easy to interpret, via plotting the tree or extracting the tree rules. This can be very useful to give you the “feel” of the data

The main negative of decision trees is that they are a high variance method and tend to overfit, i.e. do not generalize well. This is where using decision trees as base learners for ensemble methods comes in.

Random forests are simply sets of decision trees, trained using bootstrapped data and random feature selection. This fixes the high variance problem of decision trees, making random forests one of the most versatile and widely used machine learning methods. They have high accuracy and low variance all the while inheriting most benefits from decision trees. When compared to many other more sophisticated models, they require very little tuning. In general, it is pretty hard to train a really bad performing random forest models. Even with out of the box hyperparameters, random forest models perform quite well in general. Finally, they are trivially parallelizable in both training and testing phase.

One drawback that is often highlighted about random forests is that they are a black box, i.e. that there is no way to interpret the model or the resulting predictions. Fortunately, this is no really true due to recent developments in making random forests more interpretable. There are methods for decomposing random forest predictions into feature contributions, selecting compact rule sets and summarizing the extracted tree rules (inTrees package in R).

There are excellent implementations for tree based methods in most mainstream languages, with python (scikit learn) and R(randomForest, party) probably being the most accessible.

Linear models (such as linear and logistic regression) are typically one of the first models to be taught in ML courses and covered in textbooks, and for good reason. They are very powerful for their relative simplicity. They are fast to train and used especially often when good interpretability is of essence. The general form: \(y = a +b_1X_1+ \ldots +b_nX_n\) means that it’s easy to see the relative importance and contribution of each feature and sanity check the model.

A drawback of linear models is that unlike tree based methods, they are much more sensitive to outliers (requiring input sanitation), require explicit handling of categorical features (via one-hot encoding) and expect a linear relation between input and the response variable.

It is possible to overcome the latter via basis expansion: i.e. by including transformations of the input features by logarithmic, polynomial or some other transformation, depending on the data at hand. This is usually most efficiently done when combining with with regularization (Lasso and Ridge regression). These are very powerful techniques for feature selection and for preventing over fitting, allowing to filter out irrelevant features (and irrelevant transforms in case of basis expansion).

Another great aspect of linear models is that very effective online (streaming) algorithms exist, making training models even on massive datasets easily accessible, by requiring constant memory.

There are an excellent set of linear model and regularization libraries in Python (scikit learn, statsmodel) and R(lm). For large datasets, there are online learning tools available such as vowpal wabbit.

Being able to quantify the certainty in the estimates and predictions that are produced based on data is often one of the most crucial aspects in a data scientist’s work. If you don’t take variance into account in your estimates, it becomes easy to come to arbitrary conclusions. Thus, understanding and using hypothesis testing is something every data scientist utilizes.

There are multiple ways to do hypothesis testing. Statistics courses spend a lot of time on statistical tests (such as t-, z– and F-test ) and their closed form solution. In practice, confidence intervals are often better alternatives for hypothesis testing, by providing more information about the estimates, quantifying both their location and precision. In Bayesian world, credible intervals offer a similar benefit.

While there has been a lot of controversy around using p-values (due to them having been misused and abused a lot in some scientific circles), they remain a valuable tool when applied correctly. For example for categorical data, chi-squared test can be an excellent tool for understanding if the effect you see in your bar charts is real.

Finally, it is important to understand what hypothesis testing is really about. It’s often viewed as some arcane formula, which tells you the right answer by magically producing a p-value that can then be compared to 0.05. In the end, every test is the same: trying to answer the question whether the observed effect is real or not. And while having a “test” as a closed closed form solution for calculating the test score and p-values is great, you can achieve the same thing — an answer to the question: is this effect real? — via simulation. In fact it can be better to use simulation if you’re uncertain whether all the assumptions that need to hold for the analytical test actually do. There is a great writeup on this topic: There is only on test.

Speaking of simulation and bootstrapping…

Resampling methods are a powerful set of tools that employ resampling to produce new hypothetical sample sets as if they were sampled from the underlying population. They are excellent when parametric approaches are difficult to use or just don’t apply. As such they are often crucial for many data analysis and machine learning tasks.

Bootstrapping, or sampling with replacement allows obtaining measures of confidence such as variance or confidence intervals to sample estimates (such as mean or median). The estimates can be obtained by sampling with replacement from the observed dataset, measuring the estimate we’re interested in (for example the mean) and then repeating the process until we have enough readings to compute confidence intervals, variance or any other property of the estimate we want (for example via percentile bootstrap).

Depending on the underlying distribution (for example skewed vs symmetric), the estimations can be biased. There are ways to mitigate that, for example via bias corrected and accelerated bootstrap (good overview here). R’s boot package has multiple different bootstrap methods available. But often, going beyond percentile bootstrap could be overkill: we are often interested in the order of magnitude of the measure of confidence anyway, not necessarily an exact value.

Bootstrapping also has its uses in machine learning, for example for creating ensembles of models (such as random forests).

Cross-validation is another resampling method, used to make sure that the results we see on our sample set would actually apply to an independent dataset. In other words, to make sure we are not overfitting our models. This is a must in machine learning tasks where prediction is involved. Just as with bootstrapping, the idea is simple: randomly partition the dataset into two — training set and test set — measure the performance of the model trained on the train set on the test set, and then repeat the experiment after spitting the data randomly again. After enough experiments, averaging over the result gives a good estimate of how the model would perform on a new dataset sampled from the underlying population. There are a lot of methods for that in R and an excellent set of libraries in scikit-learn

Both approaches mentioned in the section are similar in spirit to the wide spectrum of Monte Carlo methods, employed in physics since the 40s.

Clustering is one of the most commonly used approaches in unsupervised learning, used to find hidden groups or partitionings in the the data. There are a large number of different approaches to this: hierarchical, density-, distribution and centroids-based clustering; there is a nice visual summary of many of the methods in scikit-learn’s clustering page.

Clustering methods, being easy to apply and shown early on in most introductory textbooks, enjoy a wide popularity. What seems to happen often though is that beginning practitioners turn to them to obtain a set of clusters… and stop there. More often than not there isn’t a lot of value in only doing the clustering and leaving it at that. Usually, clustering is more useful as a tool when chained together with other data analysis methods. For example, it can be very effective as a tool for dimensionality reduction, for further analyzing how different groups of objects behave. A good example for this is combining clustering with time dimension. It can be difficult to track the evolution of the dataset under study when you have thousands of features. Reducing it to a smaller set of clusters and observing how the cluster distributions change can on the other hand reveal interesting patterns in the data not visible otherwise.

Scikit-learn includes an excellent set of clustering methods, likewise for for R’s cluster package.

Feature selection is usually not treated as a separate topic in ML or data science literature, rather it is viewed as a a loose set of techniques that are mostly natural side effects of other, more fundamental methods such as lasso, random forest, etc. While this is technically true, I’ve found that having a unified understanding of feature selection to be greatly beneficial in many data science and machine learning tasks. Understanding feature selection methods well leads to better performing models, better understanding of the data and better intuition about the algorithms underlying many machine learning models.

There are in general two reasons why feature selection is used:

- Reducing the number of features, to reduce overfitting and improve the generalization of models.
- To gain a better understanding of the features and their relationship to the response variables.

An important factor to take into account is that these two goals are often at odds with each other and thus require different approaches: depending on the data at hand a feature selection method that is good for goal (1) isn’t necessarily good for goal (2) and vice versa. For the previous, model based methods (for example linear model based and random forest based) are usually better, while for the latter, univariate feature selection methods can be the most useful, since they do not underestimate feature’s importance due to correlation with other features, like model based feature selection methods tend to do.

The questions of how good is an estimator is the first one to come up once a model is built. It is easy to apply a measure, but it can also be easy to interpret it incorrectly. For example, accuracy of 95% in a classification task can sound wonderful, until you realize that in 95% of the cases, your data has a particular response, so your classifier is simply predicting this constant value. This doesn’t mean that accuracy is a bad metric per se, simply that one needs to be careful how and where it is applied. Thus, it is crucial to understand the metric you are applying in the context to the data at hand.

For classification ROC curve and AUC are excellent measures for classifier performance. They are not necessarily intuitive at first sight, so it’s worth taking the time to understand how the numbers are calculated. Similarly, precision and recall, confusion matrix and F1 are widely used for evaluating classification. Typical minimization targets in ML classification tasks are logistic loss and hinge loss.

For regression tasks, R^2 is an excellent measure that shows how good your estimators are in terms of a trivial estimator that predicts the mean. Additionally, it has a strong relation to correlation coefficient, namely R^2 is the square of the correlation coefficient between outcomes and predicted values. Typical minimization target in ML tasks are squared error and absolute error.

From data science point of view, interpretability of the measures can be important, in which case they can roughly be divided into 3 groups. Firstly, measures that lie in a given range (e. g. AUC lying between 0.5 and 1, R^2 between 0 and 1) are excellent in the sense that their numbers are comparable when the underlying data or responses change. Secondly, measures such as accuracy or mean absolute error are good at returning values that are easy to interpret in the context of concrete data, as they lie in the same scale as the underlying data and are therefor easy for humans to evaluate: are we off by 1%, 10% or 50% on average. Finally, measures such as squared error or log loss can be useful as optimization targets, but in general not as great for a quick interpretation by humans.

Scikit-learn provides a nice set of metrics and scoring functions in its metrics module.

The list above only scratches the surface of modern ML and statistical tools. There are many, many powerful and widely used methods left untouched in this post: deep learning, Bayesian methods, SVMs, recommender systems, graph algorithms etc, etc. Yet, i would say that mastering the above will give a practioner a very solid baseline to tackle a very wide area of data related tasks, and furthermore will make stepping into other, more sophisticated methods much easier.

]]>I’ve a had quite a few requests for code to do this. Unfortunately, most random forest libraries (including scikit-learn) don’t expose tree paths of predictions. The implementation for sklearn required a hacky patch for exposing the paths. Fortunately, since 0.17.dev, scikit-learn has two additions in the API that make this relatively straightforward: obtaining leaf node_ids for predictions, and storing all intermediate values in all nodes in decision trees, not only leaf nodes. Combining these, it is possible to extract the prediction paths for each individual prediction and decompose the predictions via inspecting the paths.

Without further ado, the code is available at github, and also via `pip install treeinterpreter`

*Note: this requires scikit-learn 0.17, which is still in development. You can check how to install it at http://scikit-learn.org/stable/install.html#install-bleeding-edge*

Let’s take a sample dataset, train a random forest model, predict some values on the test set and then decompose the predictions.

from treeinterpreter import treeinterpreter as ti from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston boston = load_boston() rf = RandomForestRegressor() rf.fit(boston.data[:300], boston.target[:300])

Lets pick two arbitrary data points that yield different price estimates from the model.

instances = boston.data[[300, 309]] print "Instance 0 prediction:", rf.predict(instances[0]) print "Instance 1 prediction:", rf.predict(instances[1])

Instance 0 prediction: [ 30.76]

Instance 1 prediction: [ 22.41]

Predictions that the random forest model made for the two data points are quite different. But why? We can now decompose the predictions into the bias term (which is just the trainset mean) and individual feature contributions, so we see which features contributed to the difference and by how much.

We can simply call the treeinterpreter `predict `

method with the model and the data.

prediction, bias, contributions = ti.predict(rf, instances)

Printint out the results:

for i in range(len(instances)): print "Instance", i print "Bias (trainset mean)", biases[i] print "Feature contributions:" for c, feature in sorted(zip(contributions[i], boston.feature_names), key=lambda x: -abs(x[0])): print feature, round(c, 2) print "-"*20

Instance 0

Bias (trainset mean) 25.2849333333

Feature contributions:

RM 2.73

LSTAT 1.71

PTRATIO 1.27

ZN 1.04

DIS -0.7

B -0.39

TAX -0.19

CRIM -0.13

RAD 0.11

INDUS 0.06

AGE -0.02

NOX -0.01

CHAS 0.0

--------------------

Instance 1

Bias (trainset mean) 25.2849333333

Feature contributions:

RM -4.88

LSTAT 2.38

DIS 0.32

AGE -0.28

TAX -0.23

CRIM 0.16

PTRATIO 0.15

B -0.15

INDUS -0.14

CHAS -0.1

ZN -0.05

NOX -0.05

RAD -0.02

The feature contributions are sorted by their absolute impact. We can see that in the first instance where the prediction was high, most of the positive contributions came from RM, LSTAT and PTRATIO feaures. On the second instance the predicted value is much lower, since RM feature actually has a very large negative impact that is not offset by the positive impact of other features, thus taking the prediction below the dataset mean.

But is the decomposition actually correct? This is easy to check: bias and contributions need to sum up to the predicted value:

print prediction print biases + np.sum(contributions, axis=1)

[ 30.76 22.41]

[ 30.76 22.41]

Note that when summing up the contributions, we are dealing with floating point numbers so the values can slightly different due to rounding errors

One use case where this approach can be very useful is when comparing two datasets. For example

- Understanding the exact reasons why estimated values are different on two datasets, for example what contributes to estimated house prices being different in two neighborhoods.
- Debugging models and/or data, for example understanding why average predicted values on newer data do not match the results seen on older data.

For this example, let’s split the remaining housing price data into two test datasets and compute the average estimated prices for them.

ds1 = boston.data[300:400] ds2 = boston.data[400:] print np.mean(rf.predict(ds1)) print np.mean(rf.predict(ds2))

22.1912

18.4773584906

We can see that the average predicted prices for the houses in the two datasets are quite different. We can now trivially break down the contributors to this difference: which features contribute to this different and by how much.

prediction1, bias1, contributions1 = ti.predict(rf, ds1) prediction2, bias2, contributions2 = ti.predict(rf, ds2)

We can now calculate the mean contribution of each feature to the difference.

totalc1 = np.mean(contributions1, axis=0) totalc2 = np.mean(contributions2, axis=0)

Since biases are equal for both datasets (because the training data for the model was the same), the difference between the average predicted values has to come only from feature contributions. In other words, the sum of the feature contribution differences should be equal to the difference in average prediction, which we can trivially check to be the case

print np.sum(totalc1 - totalc2) print np.mean(prediction1) - np.mean(prediction2)

3.71384150943

3.71384150943

Finally, we can just print out the differences of the contributions in the two datasets. The sum of these is exactly the difference between the average predictions.

for c, feature in sorted(zip(totalc1 - totalc2, boston.feature_names), reverse=True): print feature, round(c, 2)

LSTAT 2.8

CRIM 0.5

RM 0.5

PTRATIO 0.09

AGE 0.08

NOX 0.03

B 0.01

CHAS -0.01

ZN -0.02

RAD -0.03

INDUS -0.03

TAX -0.08

DIS -0.14

Exactly the same method can be used for classification trees, where features contribute to the estimated probability of a given class.

We can see this on the sample iris dataset.

from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris iris = load_iris() rf = RandomForestClassifier(max_depth = 4) idx = range(len(iris.target)) np.random.shuffle(idx) rf.fit(iris.data[idx][:100], iris.target[idx][:100])

Let’s predict now for a single instance.

instance = iris.data[idx][100:101] print rf.predict_proba(instance)

Breakdown of feature contributions:

prediction, bias, contributions = ti.predict(rf, instance) print "Prediction", prediction print "Bias (trainset prior)", bias print "Feature contributions:" for c, feature in zip(contributions[0], iris.feature_names): print feature, c

Prediction [[ 0. 0.9 0.1]]

Bias (trainset prior) [[ 0.36 0.262 0.378]]

Feature contributions:

sepal length (cm) [-0.1228614 0.07971035 0.04315104]

sepal width (cm) [ 0. -0.01352012 0.01352012]

petal length (cm) [-0.11716058 0.24709886 -0.12993828]

petal width (cm) [-0.11997802 0.32471091 -0.20473289]

We can see that the strongest contributors to predicting the second class were petal length and width, which had the larges impact on updating the prior.

Making random forest predictions interpretable is actually pretty straightforward, and leading to similar level of interpretability as linear models. With treeinterpreter (`pip install treeinterpreter`

), this can be done with just a couple of lines of code.

For regression, a prediction returning a single value (typically meant to minimize the squared error) likewise does not relay any information about the underlying distribution of the data or the range of response values we might later see in the test data.

Looking at the following plots, both the left and right plot represent similar, learned models for predicting Y from X. But while the model predictions would be similar, confidence in them would be quite different for obvious reasons: we have much less and more spread out data in the second case.

A useful concept for quantifying the latter issue is **prediction intervals**. A prediction interval is an estimate of an interval into which the future observations will fall with a given probability. In other words, it can quantify our confidence or certainty in the prediction. Unlike confidence intervals from classical statistics, which are about a parameter of population (such as the mean), prediction intervals are about individual predictions.

For linear regression, calculating the predictions intervals is straightforward (under certain assumptions like the normal distribution of the residuals) and included in most libraries, such as R’s predict method for linear models.

But how to calculate the intervals for tree based methods such as random forests?

A general method for finding confidence intervals for decision tree based methods is Quantile Regression Forests.

The idea behind quantile regression forests is simple: instead of recording the mean value of response variables in each tree leaf in the forest, record all observed responses in the leaf. The prediction can then return not just the mean of the response variables, but the full conditional distribution \(P(Y \leq y \mid X = x)\) of response values for every \(x\). Using the distribution, it is trivial to create prediction intervals for new instances simply by using the appropriate percentiles of the distribution. For example, the 95% prediction intervals would be the range between 2.5 and 97.5 percentiles of the distribution of the response variables in the leaves. And of course one could calculate other estimates on the distribution, such as median, standard deviation etc. Unfortunately, quantile regression forests do not enjoy too wild of a popularity. While it is available in R’s quantreg packages, most machine learning packages do not seem to include the method.

But here’s a nice thing: one can use a random forest as quantile regression forest simply by expanding the tree fully so that each leaf has exactly one value. (And expanding the trees fully is in fact what Breiman suggested in his original random forest paper.) Then a prediction trivially returns individual response variables from which the distribution can be built if the forest is large enough. One caveat is that expanding the tree fully can overfit: if that does happen, the intervals will be useless, just as the predictions. The nice thing is that just like accuracy and precision, the intervals can be cross-validated.

Let’s look at the well-known Boston housing dataset and try to create prediction intervals using vanilla random forest from scikit-learn:

from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston boston = load_boston() X = boston["data"] Y = boston["target"] size = len(boston["data"])

We’ll use 400 samples for training, leaving 106 samples for test. The size of the forest should be relatively large, so let’s use 1000 trees.

trainsize = 400 idx = range(size) #shuffle the data np.random.shuffle(idx) rf = RandomForestRegressor(n_estimators=1000, min_samples_leaf=1) rf.fit(X[idx[:trainsize]], Y[idx[:trainsize]])

We can now define a function to calculate prediction intervals for every prediction:

def pred_ints(model, X, percentile=95): err_down = [] err_up = [] for x in range(len(X)): preds = [] for pred in model.estimators_: preds.append(pred.predict(X[x])[0]) err_down.append(np.percentile(preds, (100 - percentile) / 2. )) err_up.append(np.percentile(preds, 100 - (100 - percentile) / 2.)) return err_down, err_up

Let’s compute 90% prediction intervals and test how many observations in the test set fall into the interval.

err_down, err_up = pred_ints(rf, X[idx[trainsize:]], percentile=90) truth = Y[idx[trainsize:]] correct = 0. for i, val in enumerate(truth): if err_down[i] <= val <= err_up[i]: correct += 1 print correct/len(truth)

0.905660377358

This is pretty close to what we expected: 90.6% of observations fell into the prediction intervals. Plotting the true values and predictions together with error bars visualizes this nicely.

If we set prediction interval to be 50% for the same model and test data, we see 51% of predictions fall into the interval, again very close to the expected value . Plotting the error bars again, we see that they are significantly smaller:

What can also be observed on the plot that on average, predictions that are more accurate have smaller prediction intervals since these are usually “easier” predictions to make. The correlation between absolute prediction error and prediction interval size is ~0.6 for this dataset.

And again, just as one can and should use cross-validation for estimating the accuracy of the model, one should also cross-validate the intervals to make sure that they give unbiased results on the particular dataset at hand. And just like one can do probability calibration, interval calbiration can also be done.

There are situations when the tree is not expanded fully, such that there is more than one data point per leaf. This can happen because either

a) a node is already pure, so splitting further makes no sense, or

b) the node is not pure, but the feature vector is exactly the same for all responses, so there isn’t anything to do a further split on.

In case of a), we know the response and node size, so we still know the distribution perfectly and can use it for calculating the intervals. If case b) happens, we are in trouble, since we don’t know the distribution of responses in the non-expanded leaf. Luckily, the latter very rarely happens with real world datasets and is easy to check for.

Utilizing prediction intervals can be very beneficial in many machine learning and data science tasks, since they can tell a lot about the underlying data that we are learning about and provide a simple way to sanity check our results. While they seem to enjoy relatively widespread use for linear models due to the ease of access with these methods, they tend to be underutilized for tree-based methods such as random forests. But actually they are relatively straightforward to use (keeping the caveats in mind) by utilizing the fact that a random forest can return a conditional distribution instead of just the conditional mean. In fact, estimating the intervals this way can be more robust than prediction intervals for linear methods, since it does not rely on the assumption of normally distributed residuals.

]]>One question that has intrigued me is whether among the topics that are regularly on the front page, are there any that are consistently preferred over others in terms of upvotes or comments? For example, do science news get more or less upvotes than stories on gaming on average? Do bitcoin stories get more or less comments than posts on web design?

Obviously, the variance in such user feedback is huge, so any topic will have both very big and largely ignored stories. But given that tens of thousands of stories make the front page every year, we should be able to see the aggregate differences in topic popularity, if these differences exists.

Luckily, there are tools and libraries available that make answering such questions not only possible, but fairly straightforward.

- Requests – an excellent Python HTTP library for downloading post contents
- Boilerpipe – a library for extracting an article from a webpage and removing all the boilerplate
- BeautifulSoup – HTML parsing, for when boilerpipe fails
- Gensim – robust topic modelling
- Seaborn – Matplotlib extension for beautiful plots

Hckr news provides a convenient archive for accessing all posts that have made it to the front page of Hacker News, together with the number of comments and upvotes for each submission. The total number of posts for 2014 was 35981. Around 34500 of them were still accessible as of January 2015, the rest were returning 4xx or 5xx.

Note that this is different from the set of *all* Hacker News submissions, since the majority of submissions never make it to the front page.

The following plots show the distribution of upvotes and comments in Hacker News front page stories in 2014 on log scale (for comments, what is shown is log(comments + 1) to accommodate zero comment posts)

The following scatter-plot shows that while there is a strong correlation between upvotes and comments, there are also quite a few outliers: posts with no comments but a very significant number of upvotes and posts with high comment count but very few upvotes.

Most webpages include (a lot) of boilerplate text that is usually irrelevant to the actual story that is being linked: “popular posts”, “latest headlines”, terms of service, “follow us” links, etc. All of these pollute the content by introducing a lot of keywords that are not actually related to the article. Fortunately, there are libraries available for removing such boilerplate and extracting the actual article from the page. Boilerpipe is an excellent Java library for doing just that (and it also has python bindings).

Removing the boilerplate and extracting the the textual content is as easy as running

from boilerpipe.extract import Extractor extractor = Extractor(extractor='ArticleExtractor', html=html) extracted_text = extractor.getText()

As usual with such heuristics, it doesn’t work on all pages. In case boilerpipe returns an empty string, I simply extract the textual content of the page (minus content from style and script tags) using BeautifulSoup.

soup = BeautifulSoup.BeautifulSoup(html) #Strip out style and script content strip_tags = ["script", "style"] for tag in soup(strip_tags): tag.extract() text = soup.getText(separator = " ")

There are a number of different methods for extracting topics from a text corpus, from simple tf-idf based heuristics to Latent semantic analysis to sophisticated generative model based approaches. The most popular among the latter is latent Dirichlet allocation.

LDA is a generative model whereby every document is assumed to come from a mixture of a topics, and every topic is viewed as a multinomial distribution of words. Learning these topics is then a an exercise in Bayesian inference. (Here’s a detailed technical overview paper on LDA and the necessary background on model-based machine learning.)

There are a number of libraries that implement LDA, including Vowpal Wabbit, plda, Mallet etc. The one I’ve found to strike the best balance between ease-of-use and efficiency is Gensim. It implements an online version of LDA, which converges to a good solution relatively fast. It also comes with methods for text cleaning and parsing (including stemming).

LDA assumes that the number of topics is given a priori (similar to many clustering algorithms). There are approaches available which learn the number of topics from the data, such as hierarchical Dirichlet process. In practice however, the inferred topic counts and resulting topics are often not what’s expected. The optimal number of topics from the structural/syntactic point of view isn’t necessarily optimal from the semantic point of view. Thus in practice, running LDA with a number of different topic counts, evaluating the learned topics, and deciding whether the number topics should be increased or decreased usually gives better results.

For this dataset, 30 topics (and running 3 passes over the dataset) seemed to strike a good balance of covering a wide enough array of domains, while keeping “junk” and overlapping topics at minimum.

lda = LdaModel(corpus, num_topics = 30, id2word=corpus.dictionary, passes = 3)

The LDA topics are distributions over words, which naturally lends itself as a naming scheme: just take a number (for example 5-10) of most probable words in the distribution as the topic descriptor. This typically works quite well. There are interesting approaches on how to improve topic naming, for example taking into account word centrality in the word network in the corpus etc. For this particular dataset, the naive top words approach turned out to be descriptive enough.

When looking at the Hacker news dataset, the topics LDA is able to extract from the submitted stories closely tracks the expected, intuitive topics distribution. There are multiple topics on programming, computer systems, startups, but also science, health, government, gaming etc., i.e. topics that are often featured.

Here’s the full list, each named by the top 8 words in the distribution, roughly grouped by their domain . (The latter is done manually and admittedly somewhat arbitrarily.)

Topic | Domain |
---|---|

function-value-code-string-return-list-int-byte | Programming, code |

memory-thread-performance-code-write-process-data-run | |

number-point-algorithm-value-example-result-set-problem | |

return-value-class-self-def-object-method-string | |

code-javascript-api-html-function-web-application-user | |

language-type-program-code-programmer-java-write-class | |

network-device-power-design-technology-system-internet-machine | Computer systems, servers, data and databases |

data-database-map-table-analysis-information-graph-model | |

server-client-http-request-service-ruby-connection-user | |

nameserver-file-net-kernel-dn-com-type-process | |

file-run-command-install-build-package-docker-version | |

science-research-human-paper-scientist-university-theory-researcher | Science, space |

space-nasa-tesla-rocket-launch-star-china-nuclear | |

project-bug-source-fix-code-open-support-software | Software development, open source |

company-business-product-market-year-startup-team-million | Startups, business |

text-color-gallery-line-image-display-file-font | (Web) design |

currency-btc-usd-fiat-bitcoin-money-price-bank | Money, bitcoin, investing |

women-health-drug-study-brain-children-patient-cell | Health |

water-air-land-light-plant-film-earth-sea | Environment |

security-key-attack-password-hacker-encryption-network-secure | Security, cryptography |

game-play-design-player-video-work-create-look | Gaming |

app-apple-phone-device-mobile-android-user-ios | Mobile,apps |

government-law-state-court-case-public-report-information | Government, law |

user-link-email-site-service-customer-post-account | Online, computer usage |

search-page-facebook-web-google-site-music-yahoo | |

google-microsoft-windows-video-browser-user-support-chrome | |

year-world-city-american-york-live-state-told | Location, travel, geography |

work-think-know-way-good-year-look-lot | General |

com-http-www-emacs-lisp-org-book-pdf | No clear theme ("junk" topics) |

php-frac-julia-theta-xwl-fff-drosophila-aubyn |

Here are a few examples of top pages within some of the topics, i.e. pages where the proportion of the particular topic is the highest.

**Company-business-product-market-year-startup-team-million:**

- Airbnb valued at $13B ahead of staff stock sale
- SendGrid Replaces CEO
- Slack raises $120M on $1.12B valuation

**Function-value-code-string-return-list-int-byte:**

**Security-key-attack-password-hacker-encryption-network-secure:**

- How PGP Works Under the Hood
- MiniLock – File encryption software that does more with less
- Anonyfish – Chat Anonymously With Another Secret User

**Science-research-human-paper-scientist-university-theory-researcher:**

- Biefeld–Brown effect
- Solving a mystery of thermoelectrics
- ‘Solid’ light could compute previously unsolvable problems

**Game-play-design-player-video-work-create-look:**

- FPV with Oculus Rift and a Quadcopter
- 2048 game to the Atari 2600 VCS
- Unofficial Demake Port Of Super Smash Bros Arrives On TI-83/84 Calculators

**Google-microsoft-windows-video-browser-user-support-chrome:**

- 64-Bit Chrome for Windows
- Mobile Internet Explorer’s New User Agent
- Microsoft announces the Surface Pro 3

**Year-world-city-american-york-live-state-told**

- The Return of Africa’s Strongmen
- Homeless with GoPro Cameras in SF
- Watch a Street Collapse Swallow an Entire Block of Cars in Baltimore

**Women-health-drug-study-brain-children-patient-cell**

- Why Seven Hours of Sleep Might Be Better Than Eight
- Long-Term Culture of Stem Cells from Adult Human Liver
- Aspirin could dramatically cut cancer risk, according to biggest study yet

The top posts for each topic seem to match the topic descriptors quite well, which is a nice high level sanity check that they indeed capture the expected content of the stories.

When looking at the distributions of the topics, i.e. the average proportion of a topic in a page, we see that the general topics are at the top, while the junk, hard-to-interpret topics are at the bottom, showing that they are indeed outliers. The proportion of programming or computer system related topics are not at the top individually, but this is due to the fact that they are divided into several subtopics. On aggregate, programming and computer related topics clearly dominate.

We are finally ready to look at the upvote and comment distribution per topic. For this, we divide the upvotes and comments of each page proportionally between the topics assigned to that page. We then normalize the scores for each topic by the proportion of that topic in the whole corpus, so that each topic’s scores lies on the same scale. The following plot shows the average upvotes per topic, together with 95% confidence interval bars to make sure that the differences are statistically significant (the green line denotes the mean number of upvotes over all front page posts):

The most upvoted topics are software and open source project related, topics on government and law (prominent examples of the latter being posts on Snowden, EFF etc) and mainstream tech news (such as major announcements from Google and Microsoft) and game related stories.

At the bottom of the scale are science, health, environment, geography related news. These are topics that aren’t the “core” topics for Hacker news (which is technology and startups) and they are in general also non-controversial, when compared to topics related to government/law, money/bitcoin etc. This is well illustrated by the plot of the average number of comments per topic:

In the top we again see mainstream tech related posts, but also money and bitcoin related news, law/government, mobile/apps stories. The topics with the least amount of comments seem to be the most technical ones: science, algorithms, data/databases, code, optimizations etc.

When looking at the scatter plot of upvotes vs comments, we see that while there is a correlation between the two, the ratio can vary quite significantly. There are a few things that stand out. Bitcoin/money related news, while not that highly upvoted get a lot of discussion; the (open source) software topic gets the most upvotes despite being only mediocre in the amount of discussion it generates; the programming topics cluster, which are mostly similar in being low volume in terms of discussion, but still getting solid upvoting from the community. And finally, science is clearly the least favorite front page topic on Hacker News.

We can see that there are clear differences between topics in terms of upvoting and commenting. In general, the results seem to match the intuitive expectations of Hacker News content quite well. First of all, the extracted topics and their descriptions are clearly in line with the intent of the page. Secondly, we see that topics that are more controversial or polarizing indeed create the most amount of discussion and to an extent it also translates to upvotes. At he same time, the more technical topics receive a lot fewer comments on average.

There are a number of interesting things still to look at:

- The evolution of topics over time – which topics have increased and which ones decreased in popularity over time?
- Analyzing all submissions. In this analysis, we only looked at posts that actually made it to the front page. Another interesting study would be to look at all posts that are submitted. This is however non-trivial, since important stories usually have multiple posts from different sites covering the same story. Generally only one of them gets picked up to to be on the front page. For the analysis, these close duplicates would have to be filtered out.
- The structure of posts – is there anything in the post structure, word usage or similar that makes it hit big on the front page?

]]>

In this post, I’ll look at two other methods: stability selection and recursive feature elimination (RFE), which can both considered wrapper methods. They both build on top of other (model based) selection methods such as regression or SVM, building models on different subsets of data and extracting the ranking from the aggregates.

As a wrap-up I’ll run all previously discussed methods, to highlight their pros, cons and gotchas with respect to each other.

Stability selection is a relatively novel method for feature selection, based on subsampling in combination with selection algorithms (which could be regression, SVMs or other similar method). The high level idea is to apply a feature selection algorithm on different subsets of data and with different subsets of features. After repeating the process a number of times, the selection results can be aggregated, for example by checking how many times a feature ended up being selected as important when it was in an inspected feature subset. We can expect strong features to have scores close to 100%, since they are always selected when possible. Weaker, but still relevant features will also have non-zero scores, since they would be selected when stronger features are not present in the currently selected subset, while irrelevant features would have scores (close to) zero, since they would never be among selected features.

Sklearn implements stability selection in the randomized lasso and randomized logistics regression classes.

from sklearn.linear_model import RandomizedLasso from sklearn.datasets import load_boston boston = load_boston() #using the Boston housing data. #Data gets scaled automatically by sklearn's implementation X = boston["data"] Y = boston["target"] names = boston["feature_names"] rlasso = RandomizedLasso(alpha=0.025) rlasso.fit(X, Y) print "Features sorted by their score:" print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), names), reverse=True)

Features sorted by their score:

[(1.0, 'RM'), (1.0, 'PTRATIO'), (1.0, 'LSTAT'), (0.62, 'CHAS'), (0.595, 'B'), (0.39, 'TAX'), (0.385, 'CRIM'), (0.25, 'DIS'), (0.22, 'NOX'), (0.125, 'INDUS'), (0.045, 'ZN'), (0.02, 'RAD'), (0.015, 'AGE')]

As you can see from the example, the top 3 features have equal scores of 1.0, meaning they were always selected as useful features (of course this could and would change when changing the regularization parameter, but sklearn’s randomized lasso implementation can choose a good \(\alpha\) parameter automatically). The scores drop smoothly from there, but in general, the drop off is not sharp as is often the case with pure lasso, or random forest. This means stability selection is useful for both pure feature selection to reduce overfitting, but also for data interpretation: in general, good features won’t get 0 as coefficients just because there are similar, correlated features in the dataset (as is the case with lasso). For feature selection, I’ve found it to be among the top performing methods for many different datasets and settings.

Recursive feature elimination is based on the idea to repeatedly construct a model (for example an SVM or a regression model) and choose either the best or worst performing feature (for example based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. Just as non-regularized regression can be unstable, so can RFE when utilizing it, while using ridge regression can provide more stable results.

Sklearn provides RFE for recursive feature elimination and RFECV for finding the ranks together with optimal number of features via a cross validation loop.

from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] #use linear regression as the model lr = LinearRegression() #rank all features, i.e continue the elimination until the last one rfe = RFE(lr, n_features_to_select=1) rfe.fit(X,Y) print "Features sorted by their rank:" print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))

Features sorted by their rank:

[(1.0, 'NOX'), (2.0, 'RM'), (3.0, 'CHAS'), (4.0, 'PTRATIO'), (5.0, 'DIS'), (6.0, 'LSTAT'), (7.0, 'RAD'), (8.0, 'CRIM'), (9.0, 'INDUS'), (10.0, 'ZN'), (11.0, 'TAX'), (12.0, 'B'), (13.0, 'AGE')]

I’ll now take all the examples from this post, and the three previous ones and run the methods on a sample dataset to compare them side by side. The dataset will be the so called Friedman #1 regression dataset (from Friedman’s Multivariate Adaptive Regression Splines paper). The data is generated according to formula \(y = 10sin(\pi x_1 x_2) + 20(x_3 – 0.5)^2 + 10X_4 + 5X_5 +\epsilon\), where the \(x_1\) to \(x_5\) are drawn from uniform distribution and \(\epsilon\) is the standard normal deviate \(N(0,1)\). Additionally, the original dataset had five noise variables \(x_6,…,x_{10}\), independent of the response variable. We will increase the number of variables further and add four variables \(x_{11},…,x_{14}\) each of which are very strongly correlated with \(x_1,…,x_4\), respectively, generated by \(f(x) = x + N(0, 0.01)\). This yields a correlation coefficient of more than 0.999 between the variables. This will illustrate how different feature ranking methods deal with correlations in the data.

We’ll apply run each of the above listed methods on the dataset and normalize the scores so that that are between 0 (for lowest ranked feature) and 1 (for the highest feature). For recursive feature elimination, the top five feature will all get score 1, with the rest of the ranks spaced equally between 0 and 1 according to their rank.

from sklearn.datasets import load_boston from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso) from sklearn.feature_selection import RFE, f_regression from sklearn.preprocessing import MinMaxScaler from sklearn.ensemble import RandomForestRegressor import numpy as np from minepy import MINE np.random.seed(0) size = 750 X = np.random.uniform(0, 1, (size, 14)) #"Friedamn #1” regression problem Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 + 10*X[:,3] + 5*X[:,4] + np.random.normal(0,1)) #Add 3 additional correlated variables (correlated with X1-X3) X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4)) names = ["x%s" % i for i in range(1,15)] ranks = {} def rank_to_dict(ranks, names, order=1): minmax = MinMaxScaler() ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0] ranks = map(lambda x: round(x, 2), ranks) return dict(zip(names, ranks )) lr = LinearRegression(normalize=True) lr.fit(X, Y) ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names) ridge = Ridge(alpha=7) ridge.fit(X, Y) ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names) lasso = Lasso(alpha=.05) lasso.fit(X, Y) ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names) rlasso = RandomizedLasso(alpha=0.04) rlasso.fit(X, Y) ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names) #stop the search when 5 features are left (they will get equal scores) rfe = RFE(lr, n_features_to_select=5) rfe.fit(X,Y) ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1) rf = RandomForestRegressor() rf.fit(X,Y) ranks["RF"] = rank_to_dict(rf.feature_importances_, names) f, pval = f_regression(X, Y, center=True) ranks["Corr."] = rank_to_dict(f, names) mine = MINE() mic_scores = [] for i in range(X.shape[1]): mine.compute_score(X[:,i], Y) m = mine.mic() mic_scores.append(m) ranks["MIC"] = rank_to_dict(mic_scores, names) r = {} for name in names: r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2) methods = sorted(ranks.keys()) ranks["Mean"] = r methods.append("Mean") print "\t%s" % "\t".join(methods) for name in names: print "%s\t%s" % (name, "\t".join(map(str, [ranks[method][name] for method in methods])))

Here’s the resulting table (sortable by clicking on the column header), with the results from each method + the mean

Feature | Lin. corr. | Linear reg. | Lasso | MIC | RF | RFE | Ridge | Stability | Mean |
---|---|---|---|---|---|---|---|---|---|

x1 | 0.3 | 1.0 | 0.79 | 0.39 | 0.18 | 1.0 | 0.77 | 0.61 | 0.63 |

x2 | 0.44 | 0.56 | 0.83 | 0.61 | 0.24 | 1.0 | 0.75 | 0.7 | 0.64 |

x3 | 0.0 | 0.5 | 0.0 | 0.34 | 0.01 | 1.0 | 0.05 | 0.0 | 0.24 |

x4 | 1.0 | 0.57 | 1.0 | 1.0 | 0.45 | 1.0 | 1.0 | 1.0 | 0.88 |

x5 | 0.1 | 0.27 | 0.51 | 0.2 | 0.04 | 0.78 | 0.88 | 0.6 | 0.42 |

x6 | 0.0 | 0.02 | 0.0 | 0.0 | 0.0 | 0.44 | 0.05 | 0.0 | 0.06 |

x7 | 0.01 | 0.0 | 0.0 | 0.07 | 0.0 | 0.0 | 0.01 | 0.0 | 0.01 |

x8 | 0.02 | 0.03 | 0.0 | 0.05 | 0.0 | 0.56 | 0.09 | 0.0 | 0.09 |

x9 | 0.01 | 0.0 | 0.0 | 0.09 | 0.0 | 0.11 | 0.0 | 0.0 | 0.03 |

x10 | 0.0 | 0.01 | 0.0 | 0.04 | 0.0 | 0.33 | 0.01 | 0.0 | 0.05 |

x11 | 0.29 | 0.6 | 0.0 | 0.43 | 0.14 | 1.0 | 0.59 | 0.39 | 0.43 |

x12 | 0.44 | 0.14 | 0.0 | 0.71 | 0.12 | 0.67 | 0.68 | 0.42 | 0.4 |

x13 | 0.0 | 0.48 | 0.0 | 0.23 | 0.01 | 0.89 | 0.02 | 0.0 | 0.2 |

x14 | 0.99 | 0.0 | 0.16 | 1.0 | 1.0 | 0.22 | 0.95 | 0.53 | 0.61 |

The example should highlight some the interesting characteristics of the different methods.

With **linear correlation** (Lin. corr.), each feature is evaluated independently, so the scores for features \(x_1…x_4\) are very similar to \(x_{11}…x_{14}\), while the noise features \(x_5…x_{10}\) are correctly identified to have almost no relation with the response variable. It’s not able to identify any relationship between \(x_3\) and the response variable, since the relationship is quadratic (in fact, this applies almost all other methods except for MIC). It’s also clear that while the method is able to measure the linear relationship between each feature and the response variable, it is not optimal for selecting the top performing features for improving the generalization of a model, since all top performing features would essentially be picked twice.

**Lasso** picks out the top performing features, while forcing other features to be close to zero. It is clearly useful when reducing the number of features is required, but not necessarily for data interpretation (since it might lead one to believe that features \(x_{11}…x_{13}\) do not have a strong relationship with the output variable).

**MIC** is similar to correlation coefficient in treating all features “equally”, additionally it is able to find the non-linear a relationship between \(x_3\) and the response.

**Random forest’s** impurity based ranking is typically aggressive in the sense that there is a sharp drop-off of scores after the first few top ones. This can be seen from the example where the third ranked feature has already 4x smaller score than the top feature (whereas for the other ranking methods, the drop-off is clearly not that aggressive).

**Ridge regression** forces regressions coefficients to spread out similarly between correlated variables. This is clearly visible in the example where \(x_{11}…x_{14}\) are close to \(x_1…x_4\) in terms of scores.

**Stability selection** is often able to make a useful compromise between data interpretation and top feature selection for model improvement. This is illustrated well in the example. Just like Lasso it is able to identify the top features (\(x_1\), \(x_2\), \(x_4\), \(x_5\)). At the same time their correlated shadow variables also get a high score, illustrating their relation with the response.

Feature ranking can be incredibly useful in a number of machine learning and data mining scenarios. The key though is to have the end goal clearly in mind and understand which method works best for achieving it. When selecting top features for model performance improvement, it is easy to verify if a particular method works well against alternatives simply by doing cross-validation. It’s not as straightforward when using feature ranking for data interpretation, where stability of the ranking method is crucial and a method that doesn’t have this property (such as lasso) could easily lead to incorrect conclusions. What can help there is subsampling the data and running the selection algorithms on the subsets. If the results are consistent across the subsets, it is relatively safe to trust the stability of the method on this particular data and therefor straightforward to interpret the data in terms of the ranking.

]]>In this post, I’ll discuss random forests, another popular approach for feature ranking.

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy.

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

This is the feature importance measure exposed in sklearn’s Random Forest implementations (random forest classifier and random forest regressor).

from sklearn.datasets import load_boston from sklearn.ensemble import RandomForestRegressor import numpy as np #Load boston housing dataset as an example boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] rf = RandomForestRegressor() rf.fit(X, Y) print "Features sorted by their score:" print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), reverse=True)

Features sorted by their score:

[(0.5298, 'LSTAT'), (0.4116, 'RM'), (0.0252, 'DIS'), (0.0172, 'CRIM'), (0.0065, 'NOX'), (0.0035, 'PTRATIO'), (0.0021, 'TAX'), (0.0017, 'AGE'), (0.0012, 'B'), (0.0008, 'INDUS'), (0.0004, 'RAD'), (0.0001, 'CHAS'), (0.0, 'ZN')]

There are a few things to keep in mind when using the impurity based ranking. Firstly, feature selection based on impurity reduction is biased towards preferring variables with more categories (see Bias in random forest variable importance measures). Secondly, when the dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable.

The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely. In the following example, we have three correlated variables \(X_0, X_1, X_2\), and no noise in the data, with the output variable simply being the sum of the three features:

size = 10000 np.random.seed(seed=10) X_seed = np.random.normal(0, 1, size) X0 = X_seed + np.random.normal(0, .1, size) X1 = X_seed + np.random.normal(0, .1, size) X2 = X_seed + np.random.normal(0, .1, size) X = np.array([X0, X1, X2]).T Y = X0 + X1 + X2 rf = RandomForestRegressor(n_estimators=20, max_features=2) rf.fit(X, Y); print "Scores for X0, X1, X2:", map(lambda x:round (x,3), rf.feature_importances_)

Scores for X0, X1, X2: [0.278, 0.66, 0.062]

When we compute the feature importances, we see that \(X_1\) is computed to have over 10x higher importance than \(X_2\), while their “true” importance is very similar. This happens despite the fact that the data is noiseless, we use 20 trees, random selection of features (at each split, only two of the three features are considered) and a sufficiently large dataset.

One thing to point out though is that the difficulty of interpreting the importance/ranking of correlated variables is not random forest specific, but applies to most model based feature selection methods.

Another popular feature selection method is to directly measure the impact of each feature on accuracy of the model. The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant variables, the permutation should have little to no effect on model accuracy, while permuting important variables should significantly decrease it.

This method is not directly exposed in sklearn, but it is straightforward to implement it. Continuing from the previous example of ranking the features in the Boston housing dataset:

from sklearn.cross_validation import ShuffleSplit from sklearn.metrics import r2_score from collections import defaultdict X = boston["data"] Y = boston["target"] rf = RandomForestRegressor() scores = defaultdict(list) #crossvalidate the scores on a number of different random splits of the data for train_idx, test_idx in ShuffleSplit(len(X), 100, .3): X_train, X_test = X[train_idx], X[test_idx] Y_train, Y_test = Y[train_idx], Y[test_idx] r = rf.fit(X_train, Y_train) acc = r2_score(Y_test, rf.predict(X_test)) for i in range(X.shape[1]): X_t = X_test.copy() np.random.shuffle(X_t[:, i]) shuff_acc = r2_score(Y_test, rf.predict(X_t)) scores[names[i]].append((acc-shuff_acc)/acc) print "Features sorted by their score:" print sorted([(round(np.mean(score), 4), feat) for feat, score in scores.items()], reverse=True)

Features sorted by their score:

[(0.7276, 'LSTAT'), (0.5675, 'RM'), (0.0867, 'DIS'), (0.0407, 'NOX'), (0.0351, 'CRIM'), (0.0233, 'PTRATIO'), (0.0168, 'TAX'), (0.0122, 'AGE'), (0.005, 'B'), (0.0048, 'INDUS'), (0.0043, 'RAD'), (0.0004, 'ZN'), (0.0001, 'CHAS')]

In this example `LSTAT`

and `RM`

are two features that strongly impact model performance: permuting them decreases model performance by ~73% and ~57% respectively. Keep in mind though that these measurements are made only after the model has been trained (and is depending) on all of these features. This doesn’t mean that if we train the model without one these feature, the model performance will drop by that amount, since other, correlated features can be used instead.

Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

Next up: Stability selection, recursive feature elimination, and an example comparing all discussed methods side by side.

]]>