How to draw and interpret a precision-recall curve for a classifier. In this article, I will show you step by step how to calculate it theoretically and practically in scikit-learn. You will also learn how to interpret it, when to use it, and how to compare two models based on it.

In the entry on classification metrics I described the intuition and application of basic metrics for classification evaluation: precision, recall and F1. However, in the analysis of class recognition ability we can go a step further and use graphs for this purpose as well.
From this entry you will learn:

This entry is part of a series on measuring the quality of classification, the following have been published so far:

  1. Precision, Recall and F1 – Classifier Evaluation Measures
  2. Precision-Recall Curve How to Plot and Interpret It
  3. ROC curve (todo)

An image worth more than the F1 measure

The F1 measure is undoubtedly a useful construct, but visual data in the form of graphs or charts speaks to us better. Moreover, a single cumulative value does not give the full picture of the capabilities of our classifier. In the arsenal of a data scientist to assess its "power" we have two magic curves:

By analyzing their graphs, we can evaluate our classifier in a broader aspect. To plot both, we need "raw" values of the probabilities (certainties) of class membership. Most often, these are normalized values from the training algorithm before "projecting" onto target labels. The method of creating pairs of points for these curves is based on changing the threshold value, based on which we define the target classes. Only then do we calculate the appropriate quality measures (e.g. precision and recall), which at a given threshold (threshold) we put on the X and Y axes.

Important note that we plot such a curve for the selected class . By default in binary classification for the positive class, i.e. the one coded as "1". Of course, there is nothing to prevent such curves from being plotted successively for all classes, e.g. in a multi-class problem.

There are as many classifiers as there are thresholds

Let's look at a short example. Let's assume we have a trained model (we don't care which one at the moment). The only thing we care about is that as a result of prediction of our data ( data ), it returns an array of real numbers. It contains the decision value for each object from the test set (we can interpret it as a probability, but we don't have to). For different cutoff thresholds, we get different values of target classes. Let's look at a mini example:

 import numpy as e.g.
from sklearn import metrics

y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1])

# y_score are computed by our classification algorithm
# y_score = ClassificationAlgorithm.predict(data)
y_score = np.array([0.1, 0.4, 0.35, 0.7, 0.2, 0.3, 0.6, 0.8])

#thresholding, above 0.4
y_pred = (y_score>=0.4).astype(int)
# y_pred=[0 1 0 1 0 0 1 1]

# thresholding, above 0.6
y_pred = (y_score>=0.6).astype(int)
# y_pred=[0 0 0 1 0 0 1 1]

If we apply a threshold of 0.4 and assign everything equal to or above this value to class 1 and the rest to class 0, we will get the classification: y_pred=[0 1 0 1 0 0 1 1].
We can choose a different threshold value, e.g. 0.6, and then the distribution of predicted labels will be different: y_pred=[0 0 0 1 0 0 1 1] .

Which threshold is better? We can answer this question by counting different measures, e.g. precision and recall. You agree that in both cases they will be different.

This example shows that we can go a step further in model evaluation. It is not worth evaluating just one way of labeling the data (through a fixed threshold) but the “ability” of our classifier to distinguish classes at many threshold levels. A good classifier will give good results at different thresholds. Of course, there is a best threshold. The plotted curve will show the general characteristics that will help us choose a better model.

How to draw a precision-recall curve?

The precision-recall curve shows the relationship between precision and recall measures for different classifier cutoff values, we plot it for the selected class . It shows us the overall ability of the classifier to recognize . By default, in most libraries it is plotted for the positive class (+1). On the X-axis we plot the calculated recall values and on the Y-axis precision for the selected thresholds.

Steps to determine the precision-recal curve:

  1. we choose the classifier threshold values (thresholds) based on the values from the y_score array. We choose those that result in changing the labeling of at least one instance (this will change precision and recall) . In our case, the threshold array will look like this: [0.3 , 0.35, 0.4 , 0.6 , 0.7 , 0.8 ]. At each of these thresholds, the labeling of our elements changes.
  2. we label elements with y_score for the selected threshold. Then for such labeling we calculate precision and recall. This way we have the first pair of points (recall, precision) that we can put on the graph.
  3. we proceed as in point 2 for subsequent thresholds and labels obtained in this way. We put the obtained pair of values (recall, precision) on the graph.

Note that the obtained graph is not a function, i.e. often for a given recall value we will get several precision assignments. Also do not assume that this curve will be created "from left to right". Or even created from "right to left", because for the positive class with increasing threshold recall will change from 1 to 0. The best approach is that for each labeling we get a pair of points, which we place in a two-dimensional space. Finally we can connect these points with a line.

Precision-recall curve in scikit-learn

To generate the discussed curve we will use the scikit-learn library. An extended example is on my Github in the file precision_recall_curve.py

 """Example of computing precision recall curve
"""
#%%
import matplotlib.pyplot as plt
import numpy as e.g.
import sklearn.metrics as skm

y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1])
y_score = np.array([0.1, 0.4, 0.35, 0.7, 0.2, 0.3, 0.6, 0.8])

#%% compute precision recall for all classes

precision0, recall0, tresholds0 = skm.precision_recall_curve(y_true, 1-y_score, pos_label=0)

precision1, recall1, tresholds1 = skm.precision_recall_curve(y_true, y_score, pos_label=1)

#%% plot curve
plt.plot(recall0, precision0, 'ro')
plt.plot(recall0, precision0, 'r', label='class 0')

plt.plot(recall1, precision1, 'because')
plt.plot(recall1, precision1, 'b', label='class 1')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve')
plt.legend()
plt.show()

In the first lines we declare the data: y_true represents the actual labels, and y_score represents the numerical values of the classifier's decisions, before projection onto classes.

Then, using the function from sklearn.metrics.precision_recall_curve, we determine the precision, recall, and thresholds values for class 0 and 1 respectively. In the case of calculating measures for class 0, I took a shortcut and used a small hack. When setting the argument of the function pos_label=0 , we also need to reverse the values from the y_score array, because they were calculated assuming that the positive class is encoded as 1, not 0.

Finally, using matplotlib we mark the points on the graph plt.plot(recall0, precision0, 'ro') and then connect them with a continuous line plt.plot(recall0, precision0, 'r', label='class 0′)

After running the code, we should see the following graph.

Precision-recall curve for a binary classifier.
Precision-recall curve for a binary classifier.

Interpreting the precision-recall curve

A random event

Let's start with the worst case. Let our classifier work randomly according to the distribution of elements in the dataset. When the ratio of positive to negative classes is 1:1, the graph will run at about 0.5 (50% chance of hitting the correct class).

Below are 3 examples of such curves for different numbers of elements in the set: 10, 100, 1000. Notice that the curve converges to the expected value of 0.5

  • EN precision recall curve - random classifier 10 samples
    EN precision recall curve – random classifier 10 samples
  • EN precision recall curve - random classifier 100 samples
    EN precision recall curve – random classifier 100 samples
  • EN precision recall curve - random classifier 1000 samples
    EN precision recall curve – random classifier 1000 samples

However, if we change the class distribution in the ratio 1:2 (positive:negative), then for random assignment the curve will converge to the value of 0.33.

  • Precision recall curve for random classifier, unbalanced set (1:2) of 10 objects
    Precision recall curve for random classifier, unbalanced set (1:2) of 10 objects
  • Precision recall curve for random classifier, unbalanced set (1:2) of 100 objects
    Precision recall curve for random classifier, unbalanced set (1:2) of 100 objects
  • Precision recall curve for random classifier, unbalanced set (1:2) of 1000 objects
    Precision recall curve for random classifier, unbalanced set (1:2) of 1000 objects

The perfect case

Now let's move on to the next extreme case, the ideal classifier. Before you read on, please consider what such a curve might look like in the ideal case. At what level will the graph be, where does it begin and where does it end?

Well, you probably figured it out correctly. If the classifier is ideal, then the precision for different recall values should be 1. By increasing the threshold successively, we will obtain correct recognition, the only thing that will change (decrease!) is the recall.

Precision recall for the perfect model
Precision recall for the perfect model

Below is the code that generates the above graphs, and the detailed code with comments is on my GitHub in the ksopyla/scikit-learn-turorial project in the file /metrics/precision_recall_curve_edge_case.py .

 """Example of computing precision recall curve for random and ideal
classifier.
"""
# %%

import matplotlib.pyplot as plt
import numpy as e.g.
import sklearn.metrics as skm


# %% random classifier balanced data
# set random seed for reproducibility
np.random.seed(5)
N = 10 # change this number try: 10, 100, 1000

pos_class_prob=1.0/3 # try 1.0/2 1.0/3 2.0/3
# generate N random samples [0,1], positive examples are sampled with probability 'pos_class_prob'
y_true = np.random.choice(np.array([0, 1]),N, p=[1-pos_class_prob,pos_class_prob])
y_score=np.random.rand(N)

precision, recall, tresholds=skm.precision_recall_curve(y_true, y_score)

# % plot curve

plt.plot(recall, precision, 'bo')
plt.plot(recall, precision, 'b', label='class 1')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(f'Precision-Recall curve for random classifier, {N} samples')
plt.grid(True)
plt.show()

 # %%


# %% ideal classifier
# set random seed for reproducibility
np.random.seed(5)
N=50

# in order to generate ideal classifier, we first generate scores and then labels
y_score=np.random.rand(N)
y_true = (y_score>=0.5).astype(int)

precision, recall, tresholds=skm.precision_recall_curve(y_true, y_score)

#%% plot curve
plt.plot(recall, precision, 'bo')
plt.plot(recall, precision, 'b', label='class 1')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(f'Precision-Recall curve for ideal classifier, {N} samples')
plt.grid(True)
plt.show()

We compare two models to each other

By comparing the curves for two trained models we can easily assess which one is better in a situation where one curve dominates the other.

The precision-recall curve of model 1 dominates model 0.
The precision-recall curve of model 1 dominates model 0.

What about a situation where there is no dominant curve and they just intersect, like below?

The precision-recall curve comparison of the two models is no longer so obvious.
The precision-recall curve comparison of the two models is no longer so obvious.

In such a situation, we pay attention to which curve has a larger area under the graph. Of course, a better model will have a larger area. It seems simple, but it is not easy to read on the graph. Two additional measures can help us with this, the first is sklearn.metrics.average_precision_score and the second is sklearn.metrics.auc. Both calculate this area in a slightly different way (read the documentation). Using them in our case, we get:

 Model 0 average_precision=0.7180555555555556 area under curve=0.7434523809523809
Model 1 average_precision=0.7634920634920634 area under curve=0.7551587301587301

The short code for the model comparison example is below and the whole thing is on GitHub in the ksopyla/scikit-learn-turorial project in the file /metrics/precision_recall_curve_model_comparison.py

 """Example of computing precision recall curve
"""
# %%

import matplotlib.pyplot as plt
import numpy as e.g.
import sklearn.metrics as skm

y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

# looking only at curves it is not so obvious, which one is better
#output from model0
y_score0 = np.array([0.7, 0.45, 0.3, 0.35, 0.45, 0.7, 0.3, 0.33, 0.55, 0.8])
#output from model1
y_score1 = np.array([0.6, 0.3, 0.3, 0.55, 0.65, 0.4, 0.5, 0.33, 0.75, 0.3])

#%%
#firstmodel
precision0, recall0, tresholds0 = skm.precision_recall_curve(y_true, y_score0)
#secondmodel
precision1, recall1, tresholds1 = skm.precision_recall_curve(y_true, y_score1)

avg_prec0 = skm.average_precision_score(y_true, y_score0)
auc0 = skm.auc(recall0,precision0)
print(f"Model 0 average_precision={avg_prec0} area under curve={auc0}")

avg_prec1 = skm.average_precision_score(y_true, y_score1)
auc1 = skm.auc(recall1,precision1)
print(f"Model 1 average_precision={avg_prec1} area under curve={auc1}")


#%% plot curve
plt.plot(recall0, precision0, 'ro')
plt.plot(recall0, precision0, 'r', label='model 0')

plt.plot(recall1, precision1, 'bo')
plt.plot(recall1, precision1, 'b', label='model 1')

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve for 2 ml models')
plt.legend()
plt.show()

Summary

In the post I touched on the topic of assessing classifier quality using precision-recall curves. We discussed how to plot such a curve in scikit-learn, how to interpret it and use it to compare machine learning models.

All examples are located on github in the repository https://github.com/ksopyla/scikit-learn-tutorial.
Clone or fork them to yourself and I would be grateful if you marked them with a star.

Additional materials

If you found this post valuable, then Subscribe to the blog. You will receive information about new articles.

Sign me up 🙂

Photo by Nick Fewings on Unsplash