- Dad! Quickly to the table! We are waiting for you as always - Otylka ran up with complaints.

– Yes, honey. Just a sec. Let me just check if my random forest is working properly.

- But Daddy, you don't have any trees here.

- Oh, I have hundreds, Honey. But on the computer - I saw the incomprehension in my younger daughter's eyes, so I tried again. - Imagine that cooking is searching for a secret recipe for a delicious cake. For this purpose, I hired 100 chefs, each of whom tries to make this cake as best they can. And at the end, we ask each of them what cake they made. Then we combine all the ideas to get the best recipe.

– This sounds like magical cooking!

– Exactly! The Random Forest is a bit like magical cooking in the computer world. Each tree brings its own unique flavor to the final dish.

Random forest, also known as random forest, is a powerful machine learning algorithm that is incredibly popular due to its effectiveness and wide application. Due to its flexibility and simplicity, random forest remains one of the most versatile tools in the machine learning arsenal. In this post, we will learn the secrets of how it works and why it is so good and effective!

Note! If you don't know how decision trees are created, I recommend that you first read the previous post [ LINK ].

Our example

Let's go back to the example from the previous post about decision trees. Our task was to classify bank customers, whether a given person would repay a loan or have problems paying back the money with interest.

The random forest algorithm works in a very simple way and consists of two steps.

Step 1. We build an ensemble of N decision trees, creating a so-called random forest. For each decision tree, we randomly select X data points from the training set and Y features. For the set prepared in this way, we create an independent decision tree.

And here we immediately have the answer to the question: “ Why is the random forest random ?” We call the forest random because the randomness occurs at two stages.

First, each tree gets a random number of observations with replacement ( boosting ).

Second, each tree has the same set of input features, but a different random subset of features is ultimately selected.

Step 2. Let's make a prediction for each tree built in the first stage, and the final result (in the case of classification) is considered based on the majority vote. In the case of regression, we can take for example the predicted average value from all trees.

And that's it! And did you know that the idea behind this approach is the so-called wisdom of the crowd?

The wisdom of the crowd

The wisdom of crowds is the phenomenon in which the collective opinions or decisions of many people or models are often more accurate than the opinion of an individual acting alone.

The most famous example of the wisdom of crowds is probably a contest held in 1906 at the Plymouth Cattle Fair. Eight hundred people (both old farmers and families with children) took part and had to estimate the weight of a slaughtered steer after dressing.

Francis Galton, the renowned statistician, collected all the answers and was astonished to find that the average value of these estimates was surprisingly close to the animal's actual weight, within just one percent!

This is the wisdom of the crowd! By collecting estimates from different people and using them to calculate an average value, we are more likely to get closer to the real answer. It is through the combination of different perspectives and experiences that we can make wise decisions.

Now let's just replace hundreds of people with hundreds of decision trees and we get a forest.

Random Forest History

It is also worth mentioning the creators and scientists who created such a great algorithm.

The origins of the random forest idea date back to the 1990s, when Tin Kam Ho helped lay the foundations for the technique. Ho proposed the “ random subspace method ” in 1995, which used randomly selected subsets of features to create various decision trees.

The early development of the random forest concept was also influenced by the work of [ LINK ] Amit and Geman, who introduced the idea of searching a random subset of the available decisions when splitting a node in the context of growing a single tree.

The currently known form of the algorithm was developed by Leo Breiman, who proposed the concept of a committee of decision trees that would cooperate with each other to obtain better results than a single decision tree.

Random Forest – Advantages

Random forest is a powerful tool in data analysis that has many advantages:

  1. Robustness to overfitting : Because a random forest is created from multiple decision trees based on random subsets of the data, it is less susceptible to overfitting than individual decision trees. This means it can better handle general data and avoid overfitting.
  2. Ability to work with large data sets : Even though the process of creating a random forest may be slightly slower than a single decision tree, it is still able to handle large data sets efficiently due to parallel processing.
  3. Results stability : Because random forest is based on averaging the results of multiple decision trees, its results are more stable and less susceptible to changes in the training data.
  4. Feature Importance Detection : Random forest can provide information on which features are important for prediction, which is useful in data analysis and interpretation.
  5. Ease of use : Random Forest is relatively easy to use and does not require too many advanced parameter settings, making it an accessible tool for people with varying levels of data analysis experience.

Random Forest – Disadvantages

Of course, there are no perfect algorithms. Although the random forest has many advantages, it also has some disadvantages:

  1. Slower Processing : The main limitation of the random forest is that the large number of trees can make the algorithm too slow and ineffective for real-time predictions. Decision trees require more processing power compared to a single decision tree or a linear regression algorithm (regression) or logistic regression algorithm (classification). For most real-world applications, the random forest algorithm is fast enough, but there are certainly situations where run-time performance is important and another approach would be preferred.
  2. Difficulty to interpret : The results obtained from a random forest can be more difficult to interpret than the results of a single decision tree. Because a random forest is made up of many trees, understanding what features are important for prediction can be more complicated.

Python code

Let's start as usual by importing the required libraries:

 from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
import time

and generating a data set (details of the ` make_classification ` function in the previous entry )

 # Create a balanced random dataset
X, y = make_classification(n_samples=10000, 
                           n_features=50, 
                           n_classes=2, 
                           weights=[0.5, 0.5], 
                           random_state=2024)

# Division into training and test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=2024)

print(f'X_train: {X_train.shape}; X_test: {X_test.shape}; ')
print(f'y_train: (1){y_train.sum()}; y_test: (1){y_test.sum()}; ')

Let's build our first random forest model with default parameters:

 # Create an instance of RandomForestClassifier
rf_model = RandomForestClassifier(random_state=2024)

# Train the model using your training data
rf_model.fit(X_train, y_train)

# Make predictions on test data
predictions = rf_model.predict(X_test)

# Calculating AUC on test data
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]
auc = round(roc_auc_score(y_test, y_pred_proba),3)
print(f"AUC: {auc}")

The power of our model is according to the AUC metric: 0.972. This is a very good result with the default parameters:

 # defaultparameters
rf_model.get_params()

Random forest - hyperparameters

Hyperparameters in a random forest are used to either increase the predictive power of the model or to speed up the model. Let's get acquainted with the main random forest hyperparameters that are worth knowing.

Let's first create two helper functions. The first one is for training the model, which will return the AUC and time metrics:

 def rf_calc(params):
    """
    Calculates the AUC score and time taken by a 
    RandomForestClassifier model.
    """
    start_time = time.time()
    
    # Build RandomForestClassifier based on imported parameters
    model = RandomForestClassifier(**params)
    model.fit(X_train, y_train)
    
    # Predict probabilities for the positive class
    y_pred_proba = model.predict_proba(X_test)[:, 1]

    #Calculate AUC score
    auc = round(roc_auc_score(y_test, y_pred_proba), 3)

    #Calculate the time taken
    t = round(time.time() - start_time, 2)
    
    # return metric auc and time 
    return auc, t

and the other for plotting the results so that we can better understand how the hyperparameters work.

 def chart_for_param_dict(d, param_name, min_auc_lim=0.9):
    """
    Creates a plot showing AUC values and corresponding 
    training times for different parameter values.
    """
    # Extracting data from dictionary and sorting by keys
    param_keys = sorted(list(d.keys()))
    auc_values = [d[depth]['auc'] for depth in param_keys]
    time_values = [d[depth]['time'] for depth in param_keys]

    # Creating the plot with specified figure size
    fig, ax1 = plt.subplots(figsize=(10, 5))

    # Plotting AUC values against the range of indices
    bars = ax1.bar(range(len(param_keys)), auc_values, color='gray', label='AUC')
    ax1.set_xlabel('Index')
    ax1.set_ylabel('AUC')
    ax1.set_xticks(range(len(param_keys)))  
    ax1.set_xticklabels(param_keys)  
    ax1.tick_params('y')
    ax1.set_ylim(min_auc_lim, 1.01)

    # Creating secondary y-axis for time values
    ax2 = ax1.twinx()
    ax2.plot(range(len(param_keys)), time_values,
             color='crimson', label='Time', marker='o')
    ax2.set_ylabel('Time (sec)', color='crimson')
    ax2.tick_params('y', colors='crimson')

    #Adding legends
    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    lines = lines1 + lines2
    labels = labels1 + labels2
    ax1.legend(lines, labels, loc='upper left')

    # Adding title
    plt.title(f'AUC and Time for {param_name} parameter')

    # Adding values on top of each bar
    for bar, value in zip(bars, auc_values):
        ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.001, f'{value:.3f}',
                 ha='center', va='bottom', rotation=45, fontsize=8)

    # Display the plot
    plt.show()

n_estimators

The n_estimators parameter in the case of random forests controls the number of decision trees that are created in the forest. The larger the value of this parameter, the larger the number of trees in the forest. It is worth remembering that each tree in a random forest operates independently of the other and generates its own predictions.

 param_name = 'n_estimators'

results = {}
for param in [1, 5, 10, 20, 30, 40, 50, 75, 100,
              150, 200, 250, 300, 400, 500, 750, 1000]:
    #changeparams
    params = {
        param_name:param,
        'random_state': 2024,
    }
    
    # build model and save auc & time
    auc, t = rf_calc(params)
    results[param] = {'auc': auc, 'time': t}
    print(f'For {param_name}:{param} auc={auc} in {t} sec.')
    
# Create chart :)
chart_for_param_dict(results, param_name, min_auc_lim=0.8)

A larger number of trees can lead to better and more stable predictions. However, increasing this value can also increase the model training time.

In our case, the parameter n_estimators = 150 seems to be the best balance between model power and computation time. To gain 0.001 AUC we would have to increase the number of trees twice to 300.

max_depth

The max_depth parameter specifies the maximum depth of each decision tree in the forest. The depth of the tree indicates how many splits (layers) can occur before reaching the leaves, which are the final decision nodes.

 param_name = 'max_depth'

results = {}
for param in range(1, 21):
    #changeparams
    params = {
        param_name:param,
        'n_estimators': 150,
        'random_state': 2024,
    }
    
    # build model and save auc & time
    auc, t = rf_calc(params)
    results[param] = {'auc': auc, 'time': t}
    print(f'For {param_name}:{param} auc={auc} in {t} sec.')
    
# Create chart :)
chart_for_param_dict(results, param_name)

The larger the max_depth value, the more complex the decision trees can be, allowing for a more precise fit to the training data. However, if the tree becomes too deep, it can lead to learning details from the training data and a decrease in the overall ability to generalize to new data.

I always try to choose a depth where the next depth levels do not significantly affect the model's power. This makes the model simpler, faster and easier to explain.

In our case I choose max_depth = 6.

Please also note that for this diverse data set, even if we increase the maximum tree depth to 20(*), the model will not overtrain due to random sampling.

So if you are afraid of overtraining, use random forest 😉

(*) 2^20 = 1,048,576 – this is the maximum number of leaves in a tree with 20 branches (i.e. the depth of the tree)

min_samples_split

The min_samples_split parameter controls the minimum number of samples required to split an internal node. If the number of available samples at a node is less than the specified min_samples_split value, the tree will no longer split into smaller branches at that point. This helps control how closely the tree goes about fitting to the details of the training data, which can help avoid overfitting.

 param_name = 'min_samples_split'

results = {}
for param in [2, 3, 4, 5, 10, 20, 50, 100, 200, 500, 750, 1000, 2000]:
    #changeparams
    params = {
        param_name:param,
        'n_estimators': 150,
        'max_depth': 6,
        'random_state': 2024,
    }
    
    # build model and save auc & time
    auc, t = rf_calc(params)
    results[param] = {'auc': auc, 'time': t}
    print(f'For {param_name}:{param} auc={auc} in {t} sec.')
    
# Create chart :)
chart_for_param_dict(results, param_name)

We can see how as the parameter increases, the time decreases because we have fewer opportunities to calculate.

min_samples_leaf

The min_samples_leaf parameter controls the minimum number of samples required to form a leaf in a decision tree. You can think of this as the minimum number of examples that must be in a leaf of the tree. If the number of samples at a given node falls below the min_samples_leaf value, then we will no longer continue splitting the tree in that direction, and that node will be treated as a leaf.

 param_name = 'min_samples_leaf'

results = {}
for param in [1, 2, 3, 4, 5, 10, 20, 50, 100, 200, 500, 750, 1000, 2000]:
    #changeparams
    params = {
        param_name:param,
        'n_estimators': 150,
        'max_depth': 6,
        'min_samples_split': 20,
        'random_state': 2024,
    }
    
    # build model and save auc & time
    auc, t = rf_calc(params)
    results[param] = {'auc': auc, 'time': t}
    print(f'For {param_name}:{param} auc={auc} in {t} sec.')
    
# Create chart :)
chart_for_param_dict(results, param_name)

A larger value of this parameter can lead to simpler trees, which can help prevent overfitting and increase the generalizability of the model.

Feature Importance

Another great advantage of the random forest algorithm is that it is very easy to measure the relative importance of each feature in the prediction.

Feature Importance in Random Forests is a way to determine which features have the greatest impact on a model's predictions. This helps us understand what information is most important to the model. For example, if a model is supposed to predict whether a person will survive a disaster, it may turn out that age or gender are key factors. This allows us to better understand how the model works and what features are relevant to our problem.

Feature Importance in the Scikit-learn library for random forests is calculated by measuring the contribution of each feature to improving prediction accuracy. This is done by analyzing how much each feature reduces the impurity of the decision tree nodes in the forest. The greater the reduction in impurity caused by a feature, the more important it is to the model. Scikit-learn automatically calculates these values after the model has been trained, scaling them so that the sum of all the importances is one. In other words, the higher the importance of a feature, the more important it is to the model in making decisions.

Here's what the features from our example look like:

 import pandas as pd

model = RandomForestClassifier(
    n_estimators=150,
    max_depth=6, 
    min_samples_split=20,
    random_state=2024)

model.fit(X_train, y_train)

# Create a list of feature names if available
feature_names = [f"Feature {i}" for i in range(1, X_train.shape[1]+1)] 

# Calculate Feature Importance
feature_importance = model.feature_importances_

# Create a DataFrame to store feature important data
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort the DataFrame by Feature Importance values in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True)

# Add a cumulative importance column
feature_importance_df['Cumulative Importance'] = feature_importance_df['Importance'].cumsum()

# Display sorted features and their importance - top 10
feature_importance_df.head(10)

As you can see, the first 5 of the 50 variables provide almost 96% of the model power. So, if we wanted to simplify the model, I would build it on just these 5 variables.

You can use this technique for feature selection – you can read about the details in this post .

Summary

I hope I've convinced you that random forest is a truly beautiful tool.

Choosing a random forest as a modeling algorithm is beneficial at the initial stage of a project, due to its simplicity and effectiveness. Building an ineffective random forest is harder than it might seem, which means that even during experiments we will get reasonable results.

Good luck!

Greetings from all my heart,