Statistics
Statistics for Machine Learning: Practice Numericals With Solutions
Topic: A complete statistics practice set for Data Science and Machine Learning students — covering mean, median, mode, standard deviation, variance, quartiles, grouped data, and skewness with step-by-step solutions.
Before you can understand how a machine learning model learns, you need statistics. Every algorithm you'll ever use — linear regression, naive Bayes, neural networks, clustering — is built on statistical concepts. Mean and variance tell you about your data's center and spread. Standard deviation flags outliers. Skewness tells you if your distribution is lopsided and whether to normalize.
This practice set covers everything from ungrouped data calculations to grouped frequency distributions — the exact problems that appear in university exams and data science interviews. Every question includes a full step-by-step solution.
Important Formulas — Keep These Open While Solving
Formula | Expression |
|---|---|
Arithmetic Mean | |
Mean (Frequency Distribution) | |
Standard Deviation | |
Variance | |
Median (Grouped Data) | |
Mode (Grouped Data) | |
Coefficient of Variation | |
Karl Pearson's Skewness | or |
Part 1: Mean, Median, Mode (Ungrouped Data)
Q1. Arithmetic Mean
Find the mean of: 12, 15, 18, 20, 25, 30, 35
Solution:
Why it matters in ML: The mean is the foundation of linear regression. The regression line always passes through .
Q2. Median
Find the median of: 7, 12, 15, 18, 21, 24, 27, 30
Solution:
Data is already sorted. n = 8 (even).
Why it matters in ML: The median is robust to outliers — which is why median absolute deviation is preferred over standard deviation for outlier detection in real datasets.
Q3. Mode
Find the mode of: 4, 5, 7, 8, 5, 9, 5, 10, 7, 8
Solution:
Count frequencies:
4 → 1 time
5 → 3 times
7 → 2 times
8 → 2 times
9 → 1 time
10 → 1 time
Why it matters in ML: In classification problems, predicting the most frequent class is the baseline — called the "zero-rule classifier." If your model can't beat the mode, it's not learning anything.
Q4. Mean and Median
Find both mean and median of: 22, 25, 28, 30, 32, 35, 40, 45
Solution:
Mean:
Median (n = 8, even):
Q5. Missing Frequency
The mean of 10, 15, 20, 25, x is 18. Find x.
Solution:
Part 2: Standard Deviation & Variance
Standard deviation measures how spread out your data is around the mean. A small σ means the data clusters tightly. A large σ means it's scattered. In ML, high-variance features can dominate distance-based algorithms like KNN — which is why feature scaling (normalization/standardization) matters.
Q6. Variance and Standard Deviation
Find variance and standard deviation of: 5, 7, 9, 11, 13
Solution:
Step 1 — Find mean:
Step 2 — Find deviations and squared deviations:
x | ||
|---|---|---|
5 | −4 | 16 |
7 | −2 | 4 |
9 | 0 | 0 |
11 | +2 | 4 |
13 | +4 | 16 |
Total | 40 |
Step 3:
Q7. Standard Deviation
Find the standard deviation of: 2, 4, 4, 4, 5, 5, 7, 9
Solution:
Mean:
Squared deviations:
x | ||
|---|---|---|
2 | −3 | 9 |
4 | −1 | 1 |
4 | −1 | 1 |
4 | −1 | 1 |
5 | 0 | 0 |
5 | 0 | 0 |
7 | +2 | 4 |
9 | +4 | 16 |
Total | 32 |
Note: This is the same dataset used in many ML textbooks to explain why for a "nicely behaved" distribution. Notice how the mean is exactly 5 and the values are symmetric around it.
Q8. Coefficient of Variation
Find mean, standard deviation, and CV for: 10, 12, 15, 18, 20
Solution:
Mean:
Squared deviations:
x | ||
|---|---|---|
10 | −5 | 25 |
12 | −3 | 9 |
15 | 0 | 0 |
18 | +3 | 9 |
20 | +5 | 25 |
Total | 68 |
Why it matters in ML: CV compares variability across features with different units (e.g., salary in thousands vs age in years). A high CV feature needs normalization before using distance-based models.
Part 3: Range, Quartiles, and Dispersion
Q9. Range
Find the range of: 18, 25, 12, 30, 45, 28, 35
Solution:
Q10. Quartiles
Find Q1, Q2 (Median), and Q3 for: 5, 8, 10, 12, 15, 18, 20, 22, 25
Solution:
Sorted data (n = 9):
Interquartile Range (IQR):
Why it matters in ML: The IQR is how box plots detect outliers. Any value below or above is flagged. This is one of the most common data cleaning steps in any ML pipeline.
Part 4: Grouped Data / Class Interval Questions
These are the most important question types for university exams. The key difference from ungrouped data: you use the midpoint of each class interval as your representative value.
Q11. Mean from Frequency Distribution (Direct Method)
Class Interval | Frequency (f) | Midpoint (x) | fx |
|---|---|---|---|
0–10 | 5 | 5 | 25 |
10–20 | 8 | 15 | 120 |
20–30 | 12 | 25 | 300 |
30–40 | 10 | 35 | 350 |
40–50 | 5 | 45 | 225 |
Total | 40 | 1020 |
Q12. Median from Grouped Data
Class Interval | Frequency | Cumulative Frequency |
|---|---|---|
0–10 | 4 | 4 |
10–20 | 6 | 10 |
20–30 | 10 | 20 |
30–40 | 8 | 28 |
40–50 | 2 | 30 |
, so
The cumulative frequency just exceeds 15 at the 20–30 class → Median class = 20–30
, , ,
Q13. Mode from Grouped Data
Class Interval | Frequency |
|---|---|
0–10 | 3 |
10–20 | 7 |
20–30 | 12 ← Modal class |
30–40 | 9 |
40–50 | 4 |
, , , ,
Q14. Standard Deviation from Grouped Data
Class | f | Midpoint (x) | fx | |||
|---|---|---|---|---|---|---|
10–20 | 5 | 15 | 75 | −20 | 400 | 2000 |
20–30 | 8 | 25 | 200 | −10 | 100 | 800 |
30–40 | 15 | 35 | 525 | 0 | 0 | 0 |
40–50 | 10 | 45 | 450 | +10 | 100 | 1000 |
50–60 | 7 | 55 | 385 | +20 | 400 | 2800 |
Total | 45 | 1635 | 6600 |
Part 5: Previous-Year Style Mixed Numericals
Q15. Mean, Median & Mode from Frequency Distribution
Marks | No. of Students (f) | Midpoint (x) | fx | CF |
|---|---|---|---|---|
0–10 | 5 | 5 | 25 | 5 |
10–20 | 9 | 15 | 135 | 14 |
20–30 | 12 | 25 | 300 | 26 |
30–40 | 8 | 35 | 280 | 34 |
40–50 | 6 | 45 | 270 | 40 |
Total | 40 | 1010 |
Mean:
Median:
→ Median class = 20–30 (CF crosses 20)
Mode: Highest frequency = 12 → Modal class = 20–30
Q16. Mean Deviation About Mean
Data: 14, 18, 20, 22, 25, 30
Mean:
| x | | |---|---| | 14 | 7.5 | | 18 | 3.5 | | 20 | 1.5 | | 22 | 0.5 | | 25 | 3.5 | | 30 | 8.5 | | Total | 25 |
Q17. Average Salary Problem
Given: Average salary of 8 employees = ₹24,000. One leaves, a new one joins at ₹30,000, new average = ₹25,000. Find the salary of the employee who left.
Solution:
Total salary (original 8 employees):
Total salary (new 8 employees):
Difference after swap:
Q18. Karl Pearson's Coefficient of Skewness
Given: Mean = 45, Median = 42, σ = 6
Since , the distribution is positively skewed (tail on the right).
Why it matters in ML: Skewed features hurt linear models and tree-based models. A positive skew > 1 is a strong signal to apply log transformation before feeding data into your model. This is one of the first checks in any exploratory data analysis (EDA).
Quick Revision: What Each Measure Tells You
Measure | What It Captures | ML Relevance |
|---|---|---|
Mean | Center of data | Regression baseline, feature scaling |
Median | Robust center | Outlier-resistant imputation |
Mode | Most frequent value | Baseline classifier, categorical imputation |
Variance / σ | Spread around mean | Feature importance, noise detection |
IQR | Middle 50% spread | Outlier detection (box plot rule) |
CV | Relative variability | Comparing features with different units |
Skewness | Symmetry of distribution | Signals need for log/sqrt transformation |
Also Explore These Topics
Some Authority post
Written by Abhijeet Singh Rajput · Published on Notehub
