Cross Validation
Methods of Cross Validation
1. Hold-Out Cross Validation
🔹 Overview:
The simplest and most commonly used method in machine learning.
🔹 Process:
Split the dataset into two sets:
Training Set
Testing Set
Common split ratios: 70:30, 80:20, 50:50
Split is done randomly (use
random_statefor reproducibility)
🔹 Drawbacks:
❌ High Variance: Result depends heavily on how data is split.
❌ High Bias: Uses only part of the data for training.
❌ Not suitable for small datasets: Leads to unreliable test error estimation.
2. Leave-One-Out Cross Validation (LOOCV)
🔹 Overview:
A type of cross-validation where each data point gets a turn as a test set.
🔹 Process:
For a dataset of n points:
Train the model n times.
Each time, use 1 point for testing, and the rest n-1 for training.
Final error = average of all n test errors
🔹 Advantages:
✅ Very low bias: Nearly all data is used for training in each iteration.
🔹 Drawbacks:
❌ High variance: Models are trained on almost identical data.
❌ Very slow: Computationally expensive for large datasets.
3. K-Fold Cross Validation
🔹 Overview:
A popular and efficient cross-validation method.
🔹 Process:
Split the dataset into K equal folds.
Repeat K times:
Use 1 fold for testing.
Use K-1 folds for training.
Final error = average of K test errors
🔹 Advantages:
✅ Lower variance and bias than hold-out and LOOCV.
✅ Faster than LOOCV.
✅ Works well with K = 5 or 10.
4. Stratified K-Fold Cross Validation
🔹 Overview:
A variation of K-Fold that maintains class proportions across all folds.
🔹 Use Case:
Perfect for imbalanced datasets.
🔹 Example:
Dataset with 60% female and 40% male users:
Normal K-Fold may split unevenly.
Stratified K-Fold ensures each fold has 60% female, 40% male.
🔹 Advantage:
✅ Better represents the population in each fold.
✅ Improves performance on classification problems.
