Data Preprocessing And Grid Search
โ Data Preprocessing and Grid Search
๐ What is Data Preprocessing?
Data preprocessing is the process of preparing raw data for use in a machine learning model.
Its main goal is to improve data quality, ensuring accuracy, consistency, and reliability.
๐ Main Steps of Data Preprocessing:
1. Data Cleaning
Involves identifying and fixing:
Missing Values:
Replace with mean, median, or most probable value.Noisy Data:
Irrelevant or incorrect data (e.g., entry errors).
โ Use clustering (like DBSCAN) or regression smoothing to detect and remove.Duplicate Data:
Remove repeated entries to avoid skewed results.
Example:
S.No | Age | Salary | Experience |
|---|---|---|---|
1 | 30 | 5000 | 4 |
2 | 32 | 6500 | 6 |
3 | 36 | 4300 | 7 |
4 | 28 | 2100 | missing |
5 | 39 | 5500 | 8 |
2. Data Integration
Combines data from multiple sources into a single dataset.
Challenges include varying formats and inconsistencies.
Record Linkage:
Match records that refer to the same entity.Data Fusion:
Merge data from different sources to create a unified dataset.
3. Data Transformation
Converts data into a format suitable for analysis and interpretation.
Normalization:
Scales features to a common range (e.g., [0, 1]).Standardization:
Adjusts features to have mean = 0 and standard deviation = 1.Discretization:
Converts continuous features into discrete categories.
4. Data Reduction
Reduces the dataset size while retaining essential information.
Feature Selection:
Keeps the most relevant attributes.Feature Extraction:
Converts features into a lower-dimensional space (e.g., using PCA).Numerosity Reduction:
Reduces the number of data points (e.g., through sampling) without losing key patterns.
โ Grid Search
Grid Search is a technique used to tune hyperparameters of ML algorithms.
It exhaustively searches over a grid of possible values.
It helps in selecting the best-performing combination of hyperparameters for a given model.
๐ Example Hyperparameters:
Learning Rate (in Gradient Descent)
Depth of Decision Tree
Number of Neighbors in KNN
โ ๏ธ These values are not learned from data โ theyโre set manually and tuned using grid search.
โ๏ธ Advantages & Disadvantages
Advantages | Disadvantages |
|---|---|
Improves model performance | Time-consuming |
Ensures data consistency | May lead to potential data loss |
Helps handle messy real-world data | Resource-intensive process |
