Ensambling Learning
Ensembling Learning
Definition:
Ensemble learning is a machine learning technique where multiple models are combined to produce a better overall prediction than any single model alone.
Advantages
Improved Predictive Accuracy:
Combining several models reduces errors and improves overall performance.
Disadvantages
Low Interpretability:
Ensembles (like Random Forest, Gradient Boosting, Stacking) are often difficult to understand because they merge many models.
Why Ensembles Help
Statistical Stability:
When the dataset is limited, many different hypotheses (possible models) may fit the data equally well.A traditional algorithm picks only one hypothesis, which may:
perform well on training data
but perform poorly on unseen data
Ensemble methods reduce this risk by combining many hypotheses.
Key Terms
Hypothesis:
Any possible model or outcome generated from the data.Hypothesis Space:
The set of all possible hypotheses a learning algorithm can choose from.Limitation of Algorithms:
Due to computational constraints, algorithms cannot guarantee finding the absolute best hypothesis in the entire hypothesis space.
Bagging (Bootstrap Aggregating)
Bagging is a homogeneous weak learner ensemble method where multiple models are trained independently in parallel and their outputs are combined to produce a final prediction.
Example: Random Forest
Benefits of Bagging
Reduces overfitting
Improves accuracy
Handles unstable models (like decision trees)
Different experiments on the same input may produce different outcomes, increasing diversity
Steps of Bagging
Create multiple subsets from the original dataset
Done randomly with replacement (bootstrap sampling)
Subsets have almost equal size and similar feature values
Select observations with replacement for each subset
Train multiple models in parallel, each on a different subset
Collect predictions form all models
Combine the predictions
Classification: Use majority voting
Regression: Take the average of all model outputs

Boosting
Boosting is an ensemble method in which models are arranged in sequence to create a strong classifier. The process involves building models sequentially, where each model aims to correct the error made by the previous model.
Algorithm of Boosting
Initialize the dataset
Assign equal weights to all data points.
Train the first model
Identify the wrongly classified data points.
Update weights
Increase weights of wrongly classified points
Decrease weights of correctly classified points power
Check accuracy
If the desired accuracy is reached → go to Step 5
Else → repeat from Step 2
Stop
The boosting process ends.
Random Forest
Random Forest is a supervised learning algorithm used for both classification and regression. It works by randomly creating a forest of decision trees, trained in parallel using the bagging technique.
The final prediction is based on:
Majority Vote → for classification
Average of Predictions → for regression
Algorithm (Random Forest)
Bootstrap Sampling
If the training set has examples, randomly select N data points with replacement from the original dataset.
This sample becomes the training set for one decision tree.Random Feature Selection
If there are input features, choose M features at each node .
The value of remains fixed throughout tree construction.Prediction of New Data
Pass the new input through each decision tree.
Each tree gives its own classification or regression output.
Combine the outputs using:
Majority vote (classification)
Average (regression)
Advantages of Random Forest
Efficient on large datasets
Handles a large number of input features without feature deletion
Provides feature importance estimates
Deals well with missing data
Models (forests) can be saved and reused
Works for both classification and regression
Helps detect variable interactions
Disadvantages of Random Forest
For regression, it cannot predict beyond the range of training data
Large model size (many trees) → more memory usage and slower predictions
Acts like a black box → difficult to interpret
