How to draw and interpret a precision-recall curve for a classifier. In this article, I will show you step by step how to calculate it theoretically and practically in scikit-learn. You will also learn how to interpret it, when to use it, and how to compare two models based on it.

In the entry on classification metrics I described the intuition and application of basic metrics for classification evaluation: precision, recall and F1. However, in the analysis of class recognition ability we can go a step further and use graphs for this purpose as well.

From this entry you will learn:

- how the precision-recall curve is created,
- how to interpret it and when to use it,
- how to plot it in scikit-learn,
- how to judge which model is better on this basis.

This entry is part of a series on measuring the quality of classification, the following have been published so far:

- Precision, Recall and F1 – Classifier Evaluation Measures
- Precision-Recall Curve How to Plot and Interpret It
- ROC curve (todo)

The F1 measure is undoubtedly a useful construct, but visual data in the form of graphs or charts speaks to us better. Moreover, a single cumulative value does not give the full picture of the capabilities of our classifier. In the arsenal of a data scientist to assess its "power" we have two magic curves:

- precision-recall curve
- ROC curve.

By analyzing their graphs, we can evaluate our classifier in a broader aspect. To plot both, we need "raw" values of the probabilities (certainties) of class membership. Most often, these are normalized values from the training algorithm before "projecting" onto target labels. The method of creating pairs of points for these curves is based on changing the threshold value, based on which we define the target classes. Only then do we calculate the appropriate quality measures (e.g. precision and recall), which at a given threshold (threshold) we put on the X and Y axes.

Important note that we plot such a curve for **the selected class** . By default in binary classification for the positive class, i.e. the one coded as "1". Of course, there is nothing to prevent such curves from being plotted successively for all classes, e.g. in a multi-class problem.

Let's look at a short example. Let's assume we have a trained model (we don't care which one at the moment). The only thing we care about is that as a result of prediction of our data ( *data* ), it returns an array of real numbers. It contains **the decision value for each object from the test set** (we can interpret it as a probability, but we don't have to). For different cutoff thresholds, we get different values of target classes. Let's look at a mini example:

import numpy as e.g. from sklearn import metrics y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1]) # y_score are computed by our classification algorithm # y_score = ClassificationAlgorithm.predict(data) y_score = np.array([0.1, 0.4, 0.35, 0.7, 0.2, 0.3, 0.6, 0.8]) #thresholding, above 0.4 y_pred = (y_score>=0.4).astype(int) # y_pred=[0 1 0 1 0 0 1 1] # thresholding, above 0.6 y_pred = (y_score>=0.6).astype(int) # y_pred=[0 0 0 1 0 0 1 1]

If we apply a threshold of 0.4 and assign everything equal to or above this value to class 1 and the rest to class 0, we will get the classification: *y_pred=[0 1 0 1 0 0 1 1].*

We can choose a different threshold value, e.g. 0.6, and then the distribution of predicted labels will be different: *y_pred=[0 0 0 1 0 0 1 1]* .

Which threshold is better? We can answer this question by counting different measures, e.g. precision and recall. You agree that in both cases they will be different.

This example shows that we can go a step further in model evaluation. It is not worth evaluating just one way of labeling the data (through a fixed threshold) but the “ability” of our classifier to distinguish classes at many threshold levels. A good classifier will give good results at different thresholds. Of course, there is a best threshold. The plotted curve will show the general characteristics that will help us choose a better model.

The precision-recall curve shows the relationship between precision and recall measures for different classifier cutoff values, we plot it for **the selected class** . It shows us the overall **ability of the classifier to recognize** . By default, in most libraries it is plotted for the positive class (+1). On the X-axis we plot the calculated recall values and on the Y-axis precision for the selected thresholds.

Steps to determine the precision-recal curve:

- we choose the classifier threshold values (thresholds) based on the values from the
*y_score*array. We choose those that result in changing the labeling of at least one instance (this will change precision and recall)*.*In our case, the threshold array will look like this: [0.3 , 0.35, 0.4 , 0.6 , 0.7 , 0.8 ]. At each of these thresholds, the labeling of our elements changes. - we label elements with
*y_score*for the selected threshold. Then for such labeling we calculate precision and recall. This way we have the first pair of points (recall, precision) that we can put on the graph. - we proceed as in point 2 for subsequent thresholds and labels obtained in this way. We put the obtained pair of values (recall, precision) on the graph.

Note that the obtained graph is not a function, i.e. often for a given recall value we will get several precision assignments. Also do not assume that this curve will be created "from left to right". Or even created from "right to left", because for the positive class with increasing threshold recall will change from 1 to 0. The best approach is that for each labeling we get a pair of points, which we place in a two-dimensional space. Finally we can connect these points with a line.

To generate the discussed curve we will use the scikit-learn library. An extended example is on my Github in the file precision_recall_curve.py

"""Example of computing precision recall curve """ #%% import matplotlib.pyplot as plt import numpy as e.g. import sklearn.metrics as skm y_true = np.array([0, 0, 0, 0, 0, 1, 1, 1]) y_score = np.array([0.1, 0.4, 0.35, 0.7, 0.2, 0.3, 0.6, 0.8]) #%% compute precision recall for all classes precision0, recall0, tresholds0 = skm.precision_recall_curve(y_true, 1-y_score, pos_label=0) precision1, recall1, tresholds1 = skm.precision_recall_curve(y_true, y_score, pos_label=1) #%% plot curve plt.plot(recall0, precision0, 'ro') plt.plot(recall0, precision0, 'r', label='class 0') plt.plot(recall1, precision1, 'because') plt.plot(recall1, precision1, 'b', label='class 1') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('2-class Precision-Recall curve') plt.legend() plt.show()

In the first lines we declare the data: *y_true* represents the actual labels, and *y_score* represents the numerical values of the classifier's decisions, before projection onto classes.

Then, using the function from *sklearn.metrics.precision_recall_curve,* we determine the precision, recall, and thresholds values for class 0 and 1 respectively. In the case of calculating measures for class 0, I took a shortcut and used a small hack. When setting the argument of the function *pos_label=0* , we also need to reverse the values from the *y_score* array, because they were calculated assuming that the positive class is encoded as 1, not 0.

Finally, using matplotlib we mark the points on the graph *plt.plot(recall0, precision0, 'ro')* and then connect them with a continuous line *plt.plot(recall0, precision0, 'r', label='class 0′)*

After running the code, we should see the following graph.

Let's start with the worst case. Let our classifier work randomly according to the distribution of elements in the dataset. When the ratio of positive to negative classes is 1:1, the graph will run at about 0.5 (50% chance of hitting the correct class).

Below are 3 examples of such curves for different numbers of elements in the set: 10, 100, 1000. Notice that the curve converges to the expected value of 0.5

However, if we change the class distribution in the ratio 1:2 (positive:negative), then for random assignment the curve will converge to the value of 0.33.

Now let's move on to the next extreme case, the ideal classifier. Before you read on, please consider what such a curve might look like in the ideal case. At what level will the graph be, where does it begin and where does it end?

Well, you probably figured it out correctly. If the classifier is ideal, then the precision for different recall values should be 1. By increasing the threshold successively, we will obtain correct recognition, the only thing that will change (decrease!) is the recall.

Below is the code that generates the above graphs, and the detailed code with comments is on my GitHub in the ksopyla/scikit-learn-turorial project in the file /metrics/precision_recall_curve_edge_case.py .

"""Example of computing precision recall curve for random and ideal classifier. """ # %% import matplotlib.pyplot as plt import numpy as e.g. import sklearn.metrics as skm # %% random classifier balanced data # set random seed for reproducibility np.random.seed(5) N = 10 # change this number try: 10, 100, 1000 pos_class_prob=1.0/3 # try 1.0/2 1.0/3 2.0/3 # generate N random samples [0,1], positive examples are sampled with probability 'pos_class_prob' y_true = np.random.choice(np.array([0, 1]),N, p=[1-pos_class_prob,pos_class_prob]) y_score=np.random.rand(N) precision, recall, tresholds=skm.precision_recall_curve(y_true, y_score) # % plot curve plt.plot(recall, precision, 'bo') plt.plot(recall, precision, 'b', label='class 1') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title(f'Precision-Recall curve for random classifier, {N} samples') plt.grid(True) plt.show() # %% # %% ideal classifier # set random seed for reproducibility np.random.seed(5) N=50 # in order to generate ideal classifier, we first generate scores and then labels y_score=np.random.rand(N) y_true = (y_score>=0.5).astype(int) precision, recall, tresholds=skm.precision_recall_curve(y_true, y_score) #%% plot curve plt.plot(recall, precision, 'bo') plt.plot(recall, precision, 'b', label='class 1') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title(f'Precision-Recall curve for ideal classifier, {N} samples') plt.grid(True) plt.show()

By comparing the curves for two trained models we can easily assess which one is better in a situation where one curve dominates the other.

What about a situation where there is no dominant curve and they just intersect, like below?

In such a situation, we pay attention to which curve has a larger area under the graph. Of course, a better model will have a larger area. It seems simple, but it is not easy to read on the graph. Two additional measures can help us with this, the first is *sklearn.metrics.average_precision_score* and the second is *sklearn.metrics.auc.* Both calculate this area in a slightly different way (read the documentation). Using them in our case, we get:

Model 0 average_precision=0.7180555555555556 area under curve=0.7434523809523809

Model 1 average_precision=0.7634920634920634 area under curve=0.7551587301587301

The short code for the model comparison example is below and the whole thing is on GitHub in the ksopyla/scikit-learn-turorial project in the file /metrics/precision_recall_curve_model_comparison.py

"""Example of computing precision recall curve """ # %% import matplotlib.pyplot as plt import numpy as e.g. import sklearn.metrics as skm y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1]) # looking only at curves it is not so obvious, which one is better #output from model0 y_score0 = np.array([0.7, 0.45, 0.3, 0.35, 0.45, 0.7, 0.3, 0.33, 0.55, 0.8]) #output from model1 y_score1 = np.array([0.6, 0.3, 0.3, 0.55, 0.65, 0.4, 0.5, 0.33, 0.75, 0.3]) #%% #firstmodel precision0, recall0, tresholds0 = skm.precision_recall_curve(y_true, y_score0) #secondmodel precision1, recall1, tresholds1 = skm.precision_recall_curve(y_true, y_score1) avg_prec0 = skm.average_precision_score(y_true, y_score0) auc0 = skm.auc(recall0,precision0) print(f"Model 0 average_precision={avg_prec0} area under curve={auc0}") avg_prec1 = skm.average_precision_score(y_true, y_score1) auc1 = skm.auc(recall1,precision1) print(f"Model 1 average_precision={avg_prec1} area under curve={auc1}") #%% plot curve plt.plot(recall0, precision0, 'ro') plt.plot(recall0, precision0, 'r', label='model 0') plt.plot(recall1, precision1, 'bo') plt.plot(recall1, precision1, 'b', label='model 1') plt.xlabel('Recall') plt.ylabel('Precision') plt.ylim([0.0, 1.05]) plt.xlim([0.0, 1.0]) plt.title('Precision-Recall curve for 2 ml models') plt.legend() plt.show()

In the post I touched on the topic of assessing classifier quality using precision-recall curves. We discussed how to plot such a curve in scikit-learn, how to interpret it and use it to compare machine learning models.

All examples are located on github in the repository https://github.com/ksopyla/scikit-learn-tutorial.

Clone or fork them to yourself and I would be grateful if **you marked them with a star.**

- https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/
- https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ – a post explaining what ROC curves and precision/recall are
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html – scikit-learn documentation on precision recall
- https://acutecaretesting.org/en/articles/precision-recall-curves-what-are-they-and-how-are-they-used
- http://signalsurgeon.com/how-and-when-to-use-roc-curves-and-precision-recall-curves-for-classification-in-python/

If you found this post valuable, then **Subscribe to the blog.** You will receive information about new articles.

Photo by Nick Fewings on Unsplash