Created
Aug 2, 2025
Last Modified
2 weeks ago

Ensambling Learning

Ensembling Learning

Definition:

Ensemble learning is a machine learning technique where multiple models are combined to produce a better overall prediction than any single model alone.

Advantages
  • Improved Predictive Accuracy:
    Combining several models reduces errors and improves overall performance.

Disadvantages
  • Low Interpretability:
    Ensembles (like Random Forest, Gradient Boosting, Stacking) are often difficult to understand because they merge many models.


Why Ensembles Help

  • Statistical Stability:
    When the dataset is limited, many different hypotheses (possible models) may fit the data equally well.

  • A traditional algorithm picks only one hypothesis, which may:

    • perform well on training data

    • but perform poorly on unseen data

  • Ensemble methods reduce this risk by combining many hypotheses.


Key Terms

  • Hypothesis:
    Any possible model or outcome generated from the data.

  • Hypothesis Space:
    The set of all possible hypotheses a learning algorithm can choose from.

  • Limitation of Algorithms:
    Due to computational constraints, algorithms cannot guarantee finding the absolute best hypothesis in the entire hypothesis space.


Bagging (Bootstrap Aggregating)

Bagging is a homogeneous weak learner ensemble method where multiple models are trained independently in parallel and their outputs are combined to produce a final prediction.
Example: Random Forest


Benefits of Bagging

  • Reduces overfitting

  • Improves accuracy

  • Handles unstable models (like decision trees)

  • Different experiments on the same input may produce different outcomes, increasing diversity


Steps of Bagging

  1. Create multiple subsets from the original dataset

    • Done randomly with replacement (bootstrap sampling)

    • Subsets have almost equal size and similar feature values

  2. Select observations with replacement for each subset

  3. Train multiple models in parallel, each on a different subset

  4. Collect predictions form all models

  5. Combine the predictions

    • Classification: Use majority voting

    • Regression: Take the average of all model outputs

Bagging ensemble learning process showing bootstrapped datasets generated from training data and multiple models combined for improved performance


Boosting

Boosting is an ensemble method in which models are arranged in sequence to create a strong classifier. The process involves building models sequentially, where each model aims to correct the error made by the previous model.

Algorithm of Boosting

  1. Initialize the dataset

    • Assign equal weights to all data points.

  2. Train the first model

    • Identify the wrongly classified data points.

  3. Update weights

    • Increase weights of wrongly classified points

    • Decrease weights of correctly classified points power

  4. Check accuracy

    • If the desired accuracy is reached → go to Step 5

    • Else → repeat from Step 2

  5. Stop

    • The boosting process ends.


Random Forest

Random Forest is a supervised learning algorithm used for both classification and regression. It works by randomly creating a forest of decision trees, trained in parallel using the bagging technique.

The final prediction is based on:

  • Majority Vote → for classification

  • Average of Predictions → for regression


Algorithm (Random Forest)

  1. Bootstrap Sampling
    If the training set has examples, randomly select N data points with replacement from the original dataset.
    This sample becomes the training set for one decision tree.

  2. Random Feature Selection
    If there are input features, choose M features at each node .
    The value of remains fixed throughout tree construction.

  3. Prediction of New Data

    • Pass the new input through each decision tree.

    • Each tree gives its own classification or regression output.

    • Combine the outputs using:

      • Majority vote (classification)

      • Average (regression)


Advantages of Random Forest

  • Efficient on large datasets

  • Handles a large number of input features without feature deletion

  • Provides feature importance estimates

  • Deals well with missing data

  • Models (forests) can be saved and reused

  • Works for both classification and regression

  • Helps detect variable interactions


Disadvantages of Random Forest

  • For regression, it cannot predict beyond the range of training data

  • Large model size (many trees) → more memory usage and slower predictions

  • Acts like a black box → difficult to interpret