Correlation is often mentioned in various analyses. At the same time, we (the creators) do not consider that perhaps for someone listening/reading this, the concept of "correlation" is not something obvious. If you are such a person, then this article is for you 😊😊

Correlation is said to be **negative** when it is less than zero and **positive** when it is greater, but what does that really mean? By **correlation** , we most often mean **a linear relationship** (there are more types of correlations in general, but more on that later). **What does “linear relationship” actually mean?**

Let's take the example of height ( **NOTE!** We look at the example given with a pinch of salt, because of course a few more variables are important, and **the data itself is made up by me** ). I am 158 cm, and my mother is 160 cm. I know someone who is 164 cm, and her mother is 162 cm. Now let's stand on the street and ask people for 2 values - how tall they are and how tall their mother is. In this way, we have 2 variables - the child's height and the mother's height. **We can present the collected information on a graph.**

We can also see that the “dots” are arranged along a straight line, let’s draw it!

The dots do not perfectly match the line - if they did, the correlation would be 1, but in this case it is 0.91 (so still a lot). The correlation value is calculated using the formula:

Here we have an example **of positive correlation** - as one variable increases, the other variable increases, and the resulting **straight line "goes up"** . In the case of negative correlation, as one variable increases, the other decreases, and the straight line "goes down".

So we have discussed the direction of correlation, but what about its strength? This is where it gets interesting, because depending on which source we reach for, we will get slightly different information 😅

I most often encounter 2 divisions:

First less detailed:

- |corr| < 0.3 – weak correlation
- |corr| < 0.7 – average dependence
- |corr| ≥ 0.7 – strong dependence

The second one is more detailed:

- |corr| < 0.2 – weak correlation
- |corr| < 0.4 – low dependency
- |corr| < 0.6 – moderate dependence
- |corr| < 0.8 – high dependency
- |corr| < 0.9 – very high correlation
- |corr| < 1.0 – almost complete dependence

In my opinion **, there is no division into better/worse, less correct/more correct** . Both are good, but they differ in the level of detail. Depending on why we determine the strength of correlation, we will reach for one or the other. Very often, when analyzing data (including, among others, analyzing correlations between variables), **a less detailed division is usually enough** . Especially in situations when we create a regression model, where the lack of linear dependence between independent variables is an important issue. **Then we can even assume that |corr| > 0.5 is already too high a value.**

We have shown on the graph what positive and negative correlation looks like (both examples involved a very high level of correlation, because |corr| > 0.9). Some things are best explained visually, so **let's see what different levels of correlation look like:**

The first thing that should catch our eye is the fact that **the closer we are to zero, the more the data forms a disorganized cloud** , while **the higher the absolute value, the more the data resembles a cloud centered around a straight line** .

**The correlation defining the linear relationship,** i.e. the one described by the example with growth, is the so-called **Pearson correlation** . We say that it is **a parametric correlation** , because to calculate it we need to estimate (approximate) the parameters (according to the formula, it is the mean, i.e. the estimator of the expected value and the standard deviation).

People who have already had something to do with statistics know (or at least should know 😅) that **if we have parameters to calculate, then the variables should also** ^{have} **a parametric distribution** (in short, the distribution shows what values a variable can take and how likely it is that each of these values will occur, and a parametric distribution is one that is described with the help of parameters). Since we have parameters, we usually **also have certain assumptions about our data** , and what if these assumptions are not met? Then nonparametric correlation comes to the rescue. The most popular **nonparametric correlations** are **the Spearman correlation coefficient** and **the Kendall Tau correlation coefficient** .

- If we calculate a parametric correlation, then theoretically we assume that the data comes from a parametric distribution (usually a normal distribution). However, practice shows that the issues of checking the assumptions are very often omitted, which can translate into calculating Pearson correlation on data for which it would be better to use a nonparametric correlation. ↩︎

Both **Spearman's correlation coefficient and Kendall's Tau refer to rank correlation** , which means that determining the relationship between variables is based on the order of these observations. When calculating the values of the coefficients, **we do not use the values of individual observations directly** (as is the case with Pearson's correlation), **but their position** if we were to arrange all observations in ascending order. If it happens that several observations have the same value, they are assigned a position that is the average of the positions they occupy.

However, it should be remembered that **the lack of correlation** (regardless of which correlation we are talking about) **does not necessarily mean the lack of dependence between variables.** A perfect example of this is the quadratic dependence, where the level of all three discussed correlations is at the same level for quadratic data as for the data cloud. This can be seen perfectly in the image below

We said that the absence of correlation does not necessarily mean the absence of dependence. We must also remember that this works both ways, i.e. **the existence of a statistically significant correlation between two variables does not necessarily mean that they are dependent on each other.** A perfect example of this would be the number of drownings in a given month in relation to the number of ice creams eaten in that month. Are these variables dependent on each other? Of course not! However, the correlation between such variables would be high. The values of both variables increase in the summer months. **The existence of a high correlation between variables that are actually independent is called spurious correlation .**

Correlation expresses the level of dependence between two variables, e.g. the dependence between the height of a parent and the height of a child. It takes values between -1 and 1, where zero means there is no dependence, and the further away from zero the greater the dependence. It should be remembered that correlation can take many forms, such as linear correlation (Pearson) or rank correlation (Spearman/Tau-Kendall).