Created
Apr 8, 2025
Last Modified
9 months ago

Cross Validation

Methods of Cross Validation


1. Hold-Out Cross Validation

🔹 Overview:
The simplest and most commonly used method in machine learning.

🔹 Process:

  • Split the dataset into two sets:

    • Training Set

    • Testing Set

  • Common split ratios: 70:30, 80:20, 50:50

  • Split is done randomly (use random_state for reproducibility)

🔹 Drawbacks:

  • High Variance: Result depends heavily on how data is split.

  • High Bias: Uses only part of the data for training.

  • Not suitable for small datasets: Leads to unreliable test error estimation.


2. Leave-One-Out Cross Validation (LOOCV)

🔹 Overview:
A type of cross-validation where each data point gets a turn as a test set.

🔹 Process:

  • For a dataset of n points:

    • Train the model n times.

    • Each time, use 1 point for testing, and the rest n-1 for training.

  • Final error = average of all n test errors

🔹 Advantages:

  • Very low bias: Nearly all data is used for training in each iteration.

🔹 Drawbacks:

  • High variance: Models are trained on almost identical data.

  • Very slow: Computationally expensive for large datasets.


3. K-Fold Cross Validation

🔹 Overview:
A popular and efficient cross-validation method.

🔹 Process:

  • Split the dataset into K equal folds.

  • Repeat K times:

    • Use 1 fold for testing.

    • Use K-1 folds for training.

  • Final error = average of K test errors

🔹 Advantages:

  • ✅ Lower variance and bias than hold-out and LOOCV.

  • ✅ Faster than LOOCV.

  • ✅ Works well with K = 5 or 10.


4. Stratified K-Fold Cross Validation

🔹 Overview:
A variation of K-Fold that maintains class proportions across all folds.

🔹 Use Case:

  • Perfect for imbalanced datasets.

🔹 Example:

Dataset with 60% female and 40% male users:
Normal K-Fold may split unevenly.
Stratified K-Fold ensures each fold has 60% female, 40% male.

🔹 Advantage:

  • ✅ Better represents the population in each fold.

  • ✅ Improves performance on classification problems.