I'd like to take a look at how we as machine learning practitioners evaluate our models and especially how we compare our work to that of others. This is broad area with a lot of tricky corner cases. If you are in academia, you might want to compare your new idea to the current state-of-the-art. If you are in business, you might be trying out N models and looking for the most accurate one to commercialise. In both cases, you will need to evaluate competing models correctly in order to be sure you've got the best one. This post will show some common pitfalls.

To keep things simple, we'll pick a binary classification task with an approximately uniform class distribution. We'll train basic models only, without trying to learn anything about the data. We won't clean up the data in any way and we'll use the most basic performance measure (accuracy score).

The aim is to focus on how we report our results, not how we get to these results. We'll start with an obvious but downright incorrect evaluation method and we'll iteratively improve on it until we get to something reasonable.

First let's import some helpers and load a sample dataset. For the purposes of this post we don't need to know anything about the data.

from reporting import *
bc_dataset = datasets.load_breast_cancer()

1. Comparison on training set¶

The first thing one might be tempted to try is the cardinal sin of machine learning: training and evaluating a model on the same data. This does not happen outside of introductory courses, but people new to the field are often tempted to do this.

X, y = bc_dataset.data, bc_dataset.target
models = make_models()
for m in models:
    m.fit(X, y)
score(models, X, y)

Based on these results, we might conclude we are on to something- accuracy in the high 90s sounds quite good! However, we are testing on the training set, so we aren't really getting an estimate of how well the model will do "in the wild". This is because some models are more powerful than others and have the capacity to "bend over backwards" to fit the training data well, but this is not necessarily a good strategy. When a model is optimised too agressively for a given dataset, it can get by by "memorising" the data rather than learning the underlying patterns. Therefore testing a model on the exact same examples it learned from gives us an overly optimistic estimate of its performance.

2. Comparison on a fixed test set¶

To address the previous problem, it would be reasonable to split the data in two, so we can train on one part and test on the other. This used to be common in the natural language processing literature, e.g. in part-of-speech tagging or parsing, where sections 1-22 or the Penn Treebank Corpus are used for training and section 23 is used for testing.

X_tr, X_test, y_tr, y_test = train_test_split(X, y, 
                                              test_size=0.1, random_state=0)
models = make_models()
for m in models:
    m.fit(X_tr,y_tr)
score(models, X_test, y_test)

Getting better! The scores we get are similar to before (which in itself tells us something about the data, but that's for another post). We are now "examining" our models on a dataset that they haven't seen before, so we aren't cheating anymore. But how did we pick how to divide up the data into a train/test set? What if we just got lucky? Let's try again with a slightly different random seed (parameter which controls the splitting)

X_tr, X_test, y_tr, y_test = train_test_split(X, y, 
                                              test_size=0.1, random_state=1)
models = make_models()
for m in models:
    m.fit(X_tr, y_tr)
score(models, X_test, y_test)

Oh, some considerable changes here. Luck must be a factor, so let's try and eliminate it by using cross-validation.

3. Cross-validation and mean value¶

Cross-validation) (CV) is an extension of the train-test split method. First, we divide the data into N chunk. Then we train a model on the first N-1 chunks and evaluate it on the last chunk. Then we repeat the process, using the second-to-last chunk for evaluation and all other chunks for training. This gives us a total of N accuracy scores, each obtained by testing on unseen data.

Let's add CV to our experiment and use pandas to hold and wrangle the accuracy scores:

from sklearn.model_selection import cross_val_score

res = []
for clf in make_models(n=6):
    scores = cross_val_score(clf, bc_dataset.data, bc_dataset.target,
                             cv=25, n_jobs=-1)
    res.extend({'model': clf.__class__.__name__, 
                'accuracy': s} for s in scores)
    
df = pd.DataFrame(res)
df.groupby('model').mean().sort_values('accuracy')

Now we have repeated our experiment 25 times, so the effect of luck has hepefully averaged out. But if we deployed the best model into the real world, we can't control for luck. Maybe it'll get lucky and our users will be blown away, or on a bad day accuracy drops and our users are underwhelmed. This begs the question: how likely is a user to be disappointed? Let's find out.

4. Cross-validation with error bars¶

Beyond just looking at how well a model does on average, we probably want to know how much its accuracy can vary depending on luck. Let's look at the same results again, but this time we can visualise how much a model's accuracy varies across the 25 tests we did. A box plot is good starting point here:

import seaborn as sns
%matplotlib inline

chart = sns.boxplot(x="model", y="accuracy", data=df)
sns.despine(trim=True)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45);

Oh. The box is showing that while all the models we tried were similar on average, some of them can be quite inconsistent, with the difference between a "good" run and a "bad" run as high as 20 percentage points. This might be an important factor to consinder when deploying to production. Maybe we want a model that's slightly less accurate on average, but performs more consistently.

This kind of graph is probably sufficient for many low-value/low-risk applications. However, we still don't have a rigorous measure of the effect luck has on a model. In other words, is the most accurate (on average) model really better than the second best, or could the difference between them be explained sufficiently by sheer luck?

5. Testing for statistical significance¶

Let's pick two models, one of which seems better than the other, and check if this accuracy difference can be explained by chance. To do that, we'll perform a statistical significance test:

from scipy.stats import ttest_ind

model1_scores = df[df.model == "GaussianNB"].accuracy
model2_scores = df[df.model == "LogisticRegression"].accuracy
ttest_ind(model1_scores, model2_scores)

Ttest_indResult(statistic=-0.949993065148047, pvalue=0.3468747353057434)

The high p-value would suggest that a difference this small is likely to have been observed by chance, so we don't have enough evidence to show one model is better than the other.

WARNING Do not this this in real life. To keep it short, I deliberately committed several major statistical sins. First, a t-test is not appropriate here because the samples are not independent and accuracies are not normally distributed. Second, p-values have been widely criticised recently because they are often misunderstood and misused. Picking the correct statistical test is a science in its own right. I won't go into any detail here since the aim of this post is to argue that one should do significance testing, not to show how to do testing properly. I recommend this or this as a good introduction.

6. All-pairs statistical significance¶

So we found out that one of the models is not better than the other. To be thorough, we want to check each model against every other model. We'll do this in the most naive way possible, but keep in mind this method suffers from the multiple comparisons problem. In practice we should really be using some form of multiple-comparison correction, eg a Bonferroni correction:

from itertools import combinations

df1 = df.groupby('model').mean().sort_values('accuracy')
table, insignificant = [], []

for (name1, data1), (name2, data2) in combinations(df.groupby("model"), 2):
    test_res = ttest_ind(data1.accuracy, data2.accuracy)
    if test_res.pvalue > 0.05:
        insignificant.append((df1.index.get_loc(name1), df1.index.get_loc(name2)))
    table.append(
        {
            "Model A": name1, "Model B": name2, 
            "Difference": abs(data1.accuracy.mean() - data2.accuracy.mean()),
            "Significant": "❌" if test_res.pvalue > 0.05 else "✅"
        }
    )
pd.DataFrame(table)

We are now in a better position to tell which of the differences are "real" and which are down to chance. However, this very hard to read and it's not obvious which model we should be using in production. To help with clarity, we can draw a critical difference diagram. The accuracy of each model is plotted as a dot on an axis, with a bold line connecting methods that are not significantly different from one another.

from critical_difference.plot import do_plot
do_plot(df1.accuracy, insignificant, df1.index)

We now see clearly see that our top performer is in fact inside a cluster of 3 models, each of which performs about as well as the other two. That will be good to know when choosing which model to roll out to production!

That's it, hope you've learned something today. Please don't hesitate to reach out if you've got any feedback or want to talk.

The code for this post is available on github

	model	accuracy
0	MultinomialNB	0.896309
2	KNeighborsClassifier	0.947276
1	LogisticRegression	0.957821

	model	accuracy
0	MultinomialNB	0.877193
2	KNeighborsClassifier	0.894737
1	LogisticRegression	0.947368

	model	accuracy
0	MultinomialNB	0.877193
1	LogisticRegression	0.964912
2	KNeighborsClassifier	0.964912

	accuracy
model
LinearSVC	0.890988
MultinomialNB	0.894466
DecisionTreeClassifier	0.924585
KNeighborsClassifier	0.926008
GaussianNB	0.940395
LogisticRegression	0.952569

	Model A	Model B	Difference	Significant
0	DecisionTreeClassifier	GaussianNB	0.015810	❌
1	DecisionTreeClassifier	KNeighborsClassifier	0.001423	❌
2	DecisionTreeClassifier	LinearSVC	0.033597	❌
3	DecisionTreeClassifier	LogisticRegression	0.027984	✅
4	DecisionTreeClassifier	MultinomialNB	0.030119	✅
5	GaussianNB	KNeighborsClassifier	0.014387	❌
6	GaussianNB	LinearSVC	0.049407	✅
7	GaussianNB	LogisticRegression	0.012174	❌
8	GaussianNB	MultinomialNB	0.045929	✅
9	KNeighborsClassifier	LinearSVC	0.035020	❌
10	KNeighborsClassifier	LogisticRegression	0.026561	✅
11	KNeighborsClassifier	MultinomialNB	0.031542	✅
12	LinearSVC	LogisticRegression	0.061581	✅
13	LinearSVC	MultinomialNB	0.003478	❌
14	LogisticRegression	MultinomialNB	0.058103	✅