I’ll start this post by asking you a question: what metric do you use to check the quality of your classification model? You can leave your answer in a comment 🙂 I hope your answer wasn’t accuracy (and if so, I hope you know when it should NOT be accuracy).

While writing this post I have doubts whether the topic is not too obvious, but on the other hand, if it helps at least one person, it will already be a success!

Popular metrics in classification

Let's start by explaining what metrics we have to measure classification efficiency and how to interpret them. First, let's present the error matrix (it will be useful to explain some metrics) - in the binary approach to classification (considering only the positive and negative classes), it is a 2×2 matrix, where the main diagonal contains the number of correctly classified observations (green fields). The red fields contain the number of observations falsely classified as positive and falsely classified as negative. Such a matrix can be easily extended to any number of classes. Then the matrix has more columns and rows. For simplicity, let's focus on the binary case.

error matrix
Error Matrix

Metrics:

In the case of accuracy, I noted that this value is independent of classes. This value can be considered as a metric of the model's effectiveness. The other metrics listed are calculated "per class", which means that if we want to use them to select the best model, we must, for example, average the values from the individual classes.

We already know what popular metrics we have and how to interpret them, now it's time to decide what to use . Well, and here I must say probably the favorite saying of data scientists " it depends ". First, think about what the proportion looks like in your classes . If you have significantly more values in one class than in the other, then this class will have a greater impact on the accuracy value.

I will now cite my favorite example. Let's say you want to predict that a given person suffers from a very rare disease. The disease is so rare that there are 10 times fewer observations from the positive class in your set . Of course, in such a situation, you can use various methods of artificially increasing sets or try to use outlier detection instead of classic classification methods, but more about that another time. Let's focus on the situation when we absolutely want / have to go into classification. In such a situation, first of all, we CANNOT use accuracy as a determinant of the model's effectiveness . Why is that? For the sake of clarity, let's assume that in the class of healthy people we have 990 people and in the class of sick people 10. Our model can always try to predict that the person is healthy and have 99% effectiveness. Of course, such 99% effectiveness is absolutely worthless 😅😅

So if not accuracy then what?

Here again, we have a lot of possibilities, it all depends on your data. I will now try to discuss all the cases, showing both when it is worth using individual metrics and the potential problems that come with them.

What if we used average recall or positive group recall?

Yes and no. Recall will show us how many observations from a given class have been assigned to it . It seems cool, because we immediately have information on how effectively we predict that a person is sick when they are actually sick. So what doesn't work in this approach? Option 1, i.e. positive class recall - again we can face a situation where the model assigns everyone to one class - this time positive and again we have a useless model. Maybe an average? After all, if it assigned everyone to one group, the Negative Class would receive a recall of zero, so it would ruin the average. Of course, by following the average we significantly reduce the chances of obtaining a model focused on one group, but there is another threat here . If we have such a significant difference in numbers as the one mentioned earlier, i.e. 990 vs. 10, the model could predict in 90 cases of healthy people that they are sick and the negative class recall would be very high, around 90%. However, we are then in a situation where if the model predicts that you are sick, you only have a 10% chance that you are actually sick. Is that bad? Ehhhh and I have to use these words again… it depends 😅😅

This time, there are a huge number of factors that matter. First of all, the severity of the disease will be important. If such a model were to support the initial diagnosis of a serious disease, then this proportion of 9/10 people who are healthy would probably be fine (if so far more healthy or less sick people have been referred for further testing). If, on the other hand, it were to be, for example, a test diagnosing Covid during a pandemic, then I can hardly imagine that 9 out of 10 people sent to quarantine are healthy. As you can see, "it depends" is a very good description in this case.

So maybe it’s worth focusing on precision?

Precision is an indicator that tells us about the “performance of the model” in the sense that we will find out what % of people classified as sick are actually sick . And here we return to an analogous situation to that of recall. This time we can have a situation where the model is so careful that out of 10 sick people it classified only one person as sick, but thanks to that it did not suggest any healthy person to be sick . In such a situation, precision for sick people is 100%, but at the same time recall will be very low. It is easy to see that recall and precision are very closely related . Can we find a way to combine information from both of these metrics? And so we come to the discussion of f1-score….

F1-score is a cure for everything

F1-score is the best metric of model effectiveness in a situation where the proportion of classes is not maintained in classification. The higher the value of both precision and recall at the same time, the higher the F1-score . On the other hand, a low value of any of the previously mentioned metrics results in a decrease in F1-score. Thus, the situations described earlier, where recall/precision are ideal, but the second one is very low, will translate into a low F1-score. If we are guided by the average F1-score value, we will probably choose the optimal classifier . Of course, in a situation where the classes are equal in number (or at least the number of observations per class is similar in each class), then accuracy will also be a good metric of effectiveness. Unfortunately, such an ideal situation is very rare and it is F1-score that should (in most situations) serve us as the basic metric for assessing the effectiveness of the model.

Summary

Each metric carries some information, which in individual cases may even be the only one we are looking for . It all depends on the problem, the seriousness of the situation and how much we can afford to obtain results that are False-positive or False-negative. Each situation should be considered individually. However, if we are simply looking for a metric that will always work, then F1-score is a metric whose value allows us to find the golden mean for the other metrics.

P.S. This summary omits the AUC metric and the ROC curve that goes with it. I think they deserve a bit more discussion. As soon as such a post appears, I'll throw a link here. In the meantime, let me know in the comments if you were *aware* of the potential problems discussed in the article.