Ensemble Models - Bagging, RandomForest and Boosting


As a recap of Decision Trees, we learnt that it is a computational model that contains a set of if-then-else decisions to classify data and its similar to program flow diagrams. The root node consists of the entire set of data. A segment or branch is formed by a splitting rule on the data above. A node which branches out is called a decision node. Bottom nodes which do not branch out are called leaves. Splitting rules are applied on each branch creating a hierarchy of branches which form the decision tree. Data points are assigned to nodes in a mutually exclusive manner. For each data point, the decision tree provides a unique path to enter the class that is defined by a leaf. Decision Trees are simple and useful for interpretation.

However, simple Decision Trees typically are not competitive with the best supervised learning approaches in terms of prediction accuracy. Ensemble Models or Ensemble Learning is the method of combining many weaker learning models together in order to provide better prediction capability. These learning models may run parallelly or sequentially and they may learn from each other, thus compounding learning in certain ensemble techniques. The accuracy of the ensemble learning model is often better than any of the individual weaker models that were combined to create the ensemble model. Generally, the more varied the learning models within the ensemble, the more robust the results are observed to be. However, certain ensemble methods (like Random Forests) use the same learning model and still yield excellent results very quickly. Ensemble models are computation intensive.

Bagging, Boosting and Stacking are some ensemble techniques/improvements on Decision Trees to provide higher accuracy.

In this notebook we will learn about -

  • Bagging
  • Random Forest
  • Boosting

In the following notebooks we will learn about xGBoost and Boosted Trees, special types of Boosted models.

Bagging

Bagging or Bootstrap Aggregating is an ensemble algorithm designed to improve accuracy of other Machine Learning algorithms such as Decision Trees by using a concept called bootstrap.

The decision trees discussed typically suffer from high variance. For example, if we split the training data in to two halves, each half produces different results and if there multiple such splits of training data, each subset of training data provide different outcome, such as different numerical value in case of regression or different class in classification setting. Bootstrap is an approach to measure and reduce such high variance in a statistical learning method such as Decision Trees. In this approach, we take a number of bootstrapped splits from the training dataset and construct multiple Decision Trees from it.

Using bagging we take many boostrapped training splits from the population, build a separate Decision Trees using each training splits, and finally average the resulting predictions. This will reduce the variance and the prediction will be more accurate. Hence the error will be lower compared to Decision Trees. For regression setting, the different numerical values are averaged to get a more accurate numerical predicted value. In case of a classification setting, we predict the class using Decision Trees for each subset and finally vote the class that occurs most frequently.

Bagging is a generic concept and can be used on other machine learning algorithms as well.

In python, scikit-learn library has a module for Bagging for both classification and Regression setting. Here is the sample code to apply BaggingRegressor.

from sklearn.ensemble import BaggingClassifier, BaggingRegressor

regr1 = BaggingRegressor(max_features=13, random_state=1)
regr1.fit(X_train, y_train)

#or

clf1 = BaggingClassifier(max_features=13, random_state=1)
clf1.fit(X_train, y_train)

Note that the syntax for BaggingRegressor and BaggingClassifier does not change. The functioning of the bagging technique also does not change, the only difference being the nature of the target variable in both cases.

The base learning model can be assigned to be any simple learning model. If nothing is mentioned, then the default base learning model for Classifier (or Regressor) is the decision tree algorithm.

Exercise

We will now try to see what results we can derive by applying the Bagging Classifier to the iris dataset. Run the code given below to see the output.

In [33]:
# STEP 1 - Loading the iris data set
from sklearn import datasets

iris = datasets.load_iris()

# STEP 2 - Defining the predictors and target variable
X = iris.data
y = iris.target

# STEP 3 - Splitting the data into training and test data
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(X, test_size=0.3, random_state=20052017)
y_train, y_test = train_test_split(y, test_size=0.3, random_state=20052017)

# STEP 4 - Importing and fitting the model
from sklearn.ensemble import BaggingClassifier

iris_BC = BaggingClassifier(n_estimators=100, max_features=3, random_state=0)
BC = iris_BC.fit(X_train,y_train)

# STEP 5 - Calculating the score of the model using the test data
BC.score(X_test,y_test)
Out[33]:
0.9555555555555556

Solution code

# Just run the above code

Random Forest

Random Forest is an ensemble tree based algorithm involving multiple Decision Trees which are combined to yield a single prediction that is collective and consensus of multiple trees. By combining a large number of decision trees we can obtain results with dramatic improvements in prediction accuracy. But there could be some loss in interpretation of the model.

The Random Forest is an extension over plain bagging technique. In Random Forest we build a forest of a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of say 'm' predictors is chosen as split predictors from the full set of 'p' predictors. This split is allowed to use only one set of of 'm' predictors from the whole list. For each subsequent split, a fresh sample of next 'm' predictors is taken. This means we do not have the same set of predictors for each Decision Tree unlike in the case of Bagging.

Thus while building the random forest, the algorithm is not allowed to consider a majority of the available predictors. This helps in reducing the chance that a strong predictor having more weightage over a much weaker predictor. In contrast if we use all predictors in each split we would allow more or less the same predictor to impact the overall prediction irrespective of the split. In such cases 'm' is same as p, which would then be same as Bagging.

Owing to the selection of subset of features from total features available, the Random Forest model efficiently handles data sets with high dimensionality.

The final accuracy values of the model are given by the means of accuracies of each individual decision tree within the model. It is becuase of this reason that Random Forests are generally not applied for regression problems as they may not yeild great results.

In python, sklearn has a module for Random Forest for both classification and Regression setting. Here is the sample code to apply Random Forest.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

regr1 = RandomForestRegressor(max_features=13, random_state=1)
regr1.fit(X_train, y_train)

#or

clf1 = RandomForestClassifier(max_features=13, random_state=1)
clf1.fit(X_train, y_train)

Exercise

Implement a Random Forest Classifier on the iris dataset and observe the results. Print the test accuracy for train-test split in the ratio of 70-30.

In [55]:
# STEP 1 - Loading the iris data set
from sklearn import datasets

iris = datasets.load_iris()

# STEP 2 - Defining the predictors and target variable
X = iris.data
y = iris.target

# STEP 3 - Splitting the data into training and test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Solution code

# STEP 4 - Importing and fitting the model
from sklearn.ensemble import RandomForestClassifier

iris_RFC = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
RFC = iris_RFC.fit(X_train,y_train)

# STEP 5 - Calculating the score of the model using the test data
RFC.score(X_test,y_test)

Observing the results of the Bagging Classfier and the Random Forest Classifier, we can conclude a few points -

  • The results for both the models are same, as Random Forest is just an extension of the Bagging technique
  • The base learning model in both ensembles applied above is the same - Decision Tree. And the number of trees created in both cases is also the same - 100. Hence there will not be any difference in the results of both models. A significant difference may arise if the the size of the dataset is large and there are more number of features (iris dataset has only 4 features)
  • Maximum number of features used to construct trees in Bagging Classifier and the maximum depth of the trees created in a Random Forest Classifier, these are the only two differing parameters in the above models. Try varying these parameters (keeping number of trees same in both models) and observe where the two models may yield different results.

Boosting

Boosting is an ensemble algorithm designed to improve accuracy of other Machine Learning algorithms similar to Bagging, except that in boosting the individual models such as Decision Trees are sequentially used in different splits. Unlike in bagging where there different splits of training dataset is used individually.

Boosting does not involve bootstrapping, but instead each tree is fit with a modified version of the original training dataset sequentially. In boosting ensemble method, due to the approach of using the component decision trees sequentially, the resulting ensemble algorithm is boosted by each previous step.

There are many boosting algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, and others. Sciit-Learn has a GradientBoostingRegressor and GradientBoostingClassifier for regression and classification setting.

One most popular boosting algorithm is xgboost which is an extreme gradient boosting algorithm, that is fast and scalable.

In python, sklearn has a module for Random Forest for both classification and Regression setting. Here is the sample code to apply Random Forest.

from sklearn.ensemble import GradientBoostingRegressor

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

Exercise:

In this exercise, you are required to create a GradientBoostingClassifier for iris dataset, fit the dataset and print the test accuracy score. Ensure Train-Test split at 70-30 ratio.

In [53]:
from sklearn.ensemble import GradientBoostingClassifier

X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Solution code

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)