- Dad! Quickly to the table! We are waiting for you as always - Otylka ran up with complaints.
– Yes, honey. Just a sec. Let me just check if my random forest is working properly.
- But Daddy, you don't have any trees here.
- Oh, I have hundreds, Honey. But on the computer - I saw the incomprehension in my younger daughter's eyes, so I tried again. - Imagine that cooking is searching for a secret recipe for a delicious cake. For this purpose, I hired 100 chefs, each of whom tries to make this cake as best they can. And at the end, we ask each of them what cake they made. Then we combine all the ideas to get the best recipe.
– This sounds like magical cooking!
– Exactly! The Random Forest is a bit like magical cooking in the computer world. Each tree brings its own unique flavor to the final dish.
Random forest, also known as random forest, is a powerful machine learning algorithm that is incredibly popular due to its effectiveness and wide application. Due to its flexibility and simplicity, random forest remains one of the most versatile tools in the machine learning arsenal. In this post, we will learn the secrets of how it works and why it is so good and effective!
Note! If you don't know how decision trees are created, I recommend that you first read the previous post [ LINK ].
Let's go back to the example from the previous post about decision trees. Our task was to classify bank customers, whether a given person would repay a loan or have problems paying back the money with interest.
The random forest algorithm works in a very simple way and consists of two steps.
Step 1. We build an ensemble of N decision trees, creating a so-called random forest. For each decision tree, we randomly select X data points from the training set and Y features. For the set prepared in this way, we create an independent decision tree.
And here we immediately have the answer to the question: “ Why is the random forest random ?” We call the forest random because the randomness occurs at two stages.
First, each tree gets a random number of observations with replacement ( boosting ).
Second, each tree has the same set of input features, but a different random subset of features is ultimately selected.
Step 2. Let's make a prediction for each tree built in the first stage, and the final result (in the case of classification) is considered based on the majority vote. In the case of regression, we can take for example the predicted average value from all trees.
And that's it! And did you know that the idea behind this approach is the so-called wisdom of the crowd?
The wisdom of crowds is the phenomenon in which the collective opinions or decisions of many people or models are often more accurate than the opinion of an individual acting alone.
The most famous example of the wisdom of crowds is probably a contest held in 1906 at the Plymouth Cattle Fair. Eight hundred people (both old farmers and families with children) took part and had to estimate the weight of a slaughtered steer after dressing.
Francis Galton, the renowned statistician, collected all the answers and was astonished to find that the average value of these estimates was surprisingly close to the animal's actual weight, within just one percent!
This is the wisdom of the crowd! By collecting estimates from different people and using them to calculate an average value, we are more likely to get closer to the real answer. It is through the combination of different perspectives and experiences that we can make wise decisions.
Now let's just replace hundreds of people with hundreds of decision trees and we get a forest.
It is also worth mentioning the creators and scientists who created such a great algorithm.
The origins of the random forest idea date back to the 1990s, when Tin Kam Ho helped lay the foundations for the technique. Ho proposed the “ random subspace method ” in 1995, which used randomly selected subsets of features to create various decision trees.
The early development of the random forest concept was also influenced by the work of [ LINK ] Amit and Geman, who introduced the idea of searching a random subset of the available decisions when splitting a node in the context of growing a single tree.
The currently known form of the algorithm was developed by Leo Breiman, who proposed the concept of a committee of decision trees that would cooperate with each other to obtain better results than a single decision tree.
Random forest is a powerful tool in data analysis that has many advantages:
Of course, there are no perfect algorithms. Although the random forest has many advantages, it also has some disadvantages:
Let's start as usual by importing the required libraries:
from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import matplotlib.pyplot as plt import time
and generating a data set (details of the ` make_classification
` function in the previous entry )
# Create a balanced random dataset X, y = make_classification(n_samples=10000, n_features=50, n_classes=2, weights=[0.5, 0.5], random_state=2024) # Division into training and test set X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, stratify=y, random_state=2024) print(f'X_train: {X_train.shape}; X_test: {X_test.shape}; ') print(f'y_train: (1){y_train.sum()}; y_test: (1){y_test.sum()}; ')
Let's build our first random forest model with default parameters:
# Create an instance of RandomForestClassifier rf_model = RandomForestClassifier(random_state=2024) # Train the model using your training data rf_model.fit(X_train, y_train) # Make predictions on test data predictions = rf_model.predict(X_test) # Calculating AUC on test data y_pred_proba = rf_model.predict_proba(X_test)[:, 1] auc = round(roc_auc_score(y_test, y_pred_proba),3) print(f"AUC: {auc}")
The power of our model is according to the AUC metric: 0.972. This is a very good result with the default parameters:
# defaultparameters rf_model.get_params()
Hyperparameters in a random forest are used to either increase the predictive power of the model or to speed up the model. Let's get acquainted with the main random forest hyperparameters that are worth knowing.
Let's first create two helper functions. The first one is for training the model, which will return the AUC and time metrics:
def rf_calc(params): """ Calculates the AUC score and time taken by a RandomForestClassifier model. """ start_time = time.time() # Build RandomForestClassifier based on imported parameters model = RandomForestClassifier(**params) model.fit(X_train, y_train) # Predict probabilities for the positive class y_pred_proba = model.predict_proba(X_test)[:, 1] #Calculate AUC score auc = round(roc_auc_score(y_test, y_pred_proba), 3) #Calculate the time taken t = round(time.time() - start_time, 2) # return metric auc and time return auc, t
and the other for plotting the results so that we can better understand how the hyperparameters work.
def chart_for_param_dict(d, param_name, min_auc_lim=0.9): """ Creates a plot showing AUC values and corresponding training times for different parameter values. """ # Extracting data from dictionary and sorting by keys param_keys = sorted(list(d.keys())) auc_values = [d[depth]['auc'] for depth in param_keys] time_values = [d[depth]['time'] for depth in param_keys] # Creating the plot with specified figure size fig, ax1 = plt.subplots(figsize=(10, 5)) # Plotting AUC values against the range of indices bars = ax1.bar(range(len(param_keys)), auc_values, color='gray', label='AUC') ax1.set_xlabel('Index') ax1.set_ylabel('AUC') ax1.set_xticks(range(len(param_keys))) ax1.set_xticklabels(param_keys) ax1.tick_params('y') ax1.set_ylim(min_auc_lim, 1.01) # Creating secondary y-axis for time values ax2 = ax1.twinx() ax2.plot(range(len(param_keys)), time_values, color='crimson', label='Time', marker='o') ax2.set_ylabel('Time (sec)', color='crimson') ax2.tick_params('y', colors='crimson') #Adding legends lines1, labels1 = ax1.get_legend_handles_labels() lines2, labels2 = ax2.get_legend_handles_labels() lines = lines1 + lines2 labels = labels1 + labels2 ax1.legend(lines, labels, loc='upper left') # Adding title plt.title(f'AUC and Time for {param_name} parameter') # Adding values on top of each bar for bar, value in zip(bars, auc_values): ax1.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.001, f'{value:.3f}', ha='center', va='bottom', rotation=45, fontsize=8) # Display the plot plt.show()
The n_estimators
parameter in the case of random forests controls the number of decision trees that are created in the forest. The larger the value of this parameter, the larger the number of trees in the forest. It is worth remembering that each tree in a random forest operates independently of the other and generates its own predictions.
param_name = 'n_estimators' results = {} for param in [1, 5, 10, 20, 30, 40, 50, 75, 100, 150, 200, 250, 300, 400, 500, 750, 1000]: #changeparams params = { param_name:param, 'random_state': 2024, } # build model and save auc & time auc, t = rf_calc(params) results[param] = {'auc': auc, 'time': t} print(f'For {param_name}:{param} auc={auc} in {t} sec.') # Create chart :) chart_for_param_dict(results, param_name, min_auc_lim=0.8)
A larger number of trees can lead to better and more stable predictions. However, increasing this value can also increase the model training time.
In our case, the parameter n_estimators
= 150 seems to be the best balance between model power and computation time. To gain 0.001 AUC we would have to increase the number of trees twice to 300.
The max_depth
parameter specifies the maximum depth of each decision tree in the forest. The depth of the tree indicates how many splits (layers) can occur before reaching the leaves, which are the final decision nodes.
param_name = 'max_depth' results = {} for param in range(1, 21): #changeparams params = { param_name:param, 'n_estimators': 150, 'random_state': 2024, } # build model and save auc & time auc, t = rf_calc(params) results[param] = {'auc': auc, 'time': t} print(f'For {param_name}:{param} auc={auc} in {t} sec.') # Create chart :) chart_for_param_dict(results, param_name)
The larger the max_depth
value, the more complex the decision trees can be, allowing for a more precise fit to the training data. However, if the tree becomes too deep, it can lead to learning details from the training data and a decrease in the overall ability to generalize to new data.
I always try to choose a depth where the next depth levels do not significantly affect the model's power. This makes the model simpler, faster and easier to explain.
In our case I choose max_depth
= 6.
Please also note that for this diverse data set, even if we increase the maximum tree depth to 20(*), the model will not overtrain due to random sampling.
So if you are afraid of overtraining, use random forest 😉
(*) 2^20 = 1,048,576 – this is the maximum number of leaves in a tree with 20 branches (i.e. the depth of the tree)
The min_samples_split
parameter controls the minimum number of samples required to split an internal node. If the number of available samples at a node is less than the specified min_samples_split
value, the tree will no longer split into smaller branches at that point. This helps control how closely the tree goes about fitting to the details of the training data, which can help avoid overfitting.
param_name = 'min_samples_split' results = {} for param in [2, 3, 4, 5, 10, 20, 50, 100, 200, 500, 750, 1000, 2000]: #changeparams params = { param_name:param, 'n_estimators': 150, 'max_depth': 6, 'random_state': 2024, } # build model and save auc & time auc, t = rf_calc(params) results[param] = {'auc': auc, 'time': t} print(f'For {param_name}:{param} auc={auc} in {t} sec.') # Create chart :) chart_for_param_dict(results, param_name)
We can see how as the parameter increases, the time decreases because we have fewer opportunities to calculate.
min_samples_leaf
The min_samples_leaf
parameter controls the minimum number of samples required to form a leaf in a decision tree. You can think of this as the minimum number of examples that must be in a leaf of the tree. If the number of samples at a given node falls below the min_samples_leaf
value, then we will no longer continue splitting the tree in that direction, and that node will be treated as a leaf.
param_name = 'min_samples_leaf' results = {} for param in [1, 2, 3, 4, 5, 10, 20, 50, 100, 200, 500, 750, 1000, 2000]: #changeparams params = { param_name:param, 'n_estimators': 150, 'max_depth': 6, 'min_samples_split': 20, 'random_state': 2024, } # build model and save auc & time auc, t = rf_calc(params) results[param] = {'auc': auc, 'time': t} print(f'For {param_name}:{param} auc={auc} in {t} sec.') # Create chart :) chart_for_param_dict(results, param_name)
A larger value of this parameter can lead to simpler trees, which can help prevent overfitting and increase the generalizability of the model.
Another great advantage of the random forest algorithm is that it is very easy to measure the relative importance of each feature in the prediction.
Feature Importance in Random Forests is a way to determine which features have the greatest impact on a model's predictions. This helps us understand what information is most important to the model. For example, if a model is supposed to predict whether a person will survive a disaster, it may turn out that age or gender are key factors. This allows us to better understand how the model works and what features are relevant to our problem.
Feature Importance in the Scikit-learn library for random forests is calculated by measuring the contribution of each feature to improving prediction accuracy. This is done by analyzing how much each feature reduces the impurity of the decision tree nodes in the forest. The greater the reduction in impurity caused by a feature, the more important it is to the model. Scikit-learn automatically calculates these values after the model has been trained, scaling them so that the sum of all the importances is one. In other words, the higher the importance of a feature, the more important it is to the model in making decisions.
Here's what the features from our example look like:
import pandas as pd model = RandomForestClassifier( n_estimators=150, max_depth=6, min_samples_split=20, random_state=2024) model.fit(X_train, y_train) # Create a list of feature names if available feature_names = [f"Feature {i}" for i in range(1, X_train.shape[1]+1)] # Calculate Feature Importance feature_importance = model.feature_importances_ # Create a DataFrame to store feature important data feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance}) # Sort the DataFrame by Feature Importance values in descending order feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False).reset_index(drop=True) # Add a cumulative importance column feature_importance_df['Cumulative Importance'] = feature_importance_df['Importance'].cumsum() # Display sorted features and their importance - top 10 feature_importance_df.head(10)
As you can see, the first 5 of the 50 variables provide almost 96% of the model power. So, if we wanted to simplify the model, I would build it on just these 5 variables.
You can use this technique for feature selection – you can read about the details in this post .
I hope I've convinced you that random forest is a truly beautiful tool.
Choosing a random forest as a modeling algorithm is beneficial at the initial stage of a project, due to its simplicity and effectiveness. Building an ineffective random forest is harder than it might seem, which means that even during experiments we will get reasonable results.
Good luck!
Greetings from all my heart,