Statistics

Feb 20, 2025

Updated 1 month ago

5 min read

Statistics for Machine Learning: Practice Numericals With Solutions

Topic: A complete statistics practice set for Data Science and Machine Learning students — covering mean, median, mode, standard deviation, variance, quartiles, grouped data, and skewness with step-by-step solutions.

Before you can understand how a machine learning model learns, you need statistics. Every algorithm you'll ever use — linear regression, naive Bayes, neural networks, clustering — is built on statistical concepts. Mean and variance tell you about your data's center and spread. Standard deviation flags outliers. Skewness tells you if your distribution is lopsided and whether to normalize.

This practice set covers everything from ungrouped data calculations to grouped frequency distributions — the exact problems that appear in university exams and data science interviews. Every question includes a full step-by-step solution.

Important Formulas — Keep These Open While Solving

Formula	Expression
Arithmetic Mean	$\overset{x}{ˉ} = \frac{\sum x}{n}$
Mean (Frequency Distribution)	$\overset{x}{ˉ} = \frac{\sum f x}{\sum f}$
Standard Deviation	$σ = \frac{\sum ( x - x ˉ ) ^{2}}{N}$
Variance	$σ^{2} = \frac{\sum ( x - x ˉ ) ^{2}}{N}$
Median (Grouped Data)	$M = L + (\frac{\frac{N}{2} - c f}{f}) \times h$
Mode (Grouped Data)	$Z = L + (\frac{f _{1} - f _{0}}{2 f _{1} - f _{0} - f _{2}}) \times h$
Coefficient of Variation	$C V = \frac{σ}{x ˉ} \times 100$
Karl Pearson's Skewness	$S_{k} = \frac{M e an - M o d e}{σ}$ or $\frac{3 ( M e an - M e d ian )}{σ}$

Part 1: Mean, Median, Mode (Ungrouped Data)

Q1. Arithmetic Mean

Find the mean of: 12, 15, 18, 20, 25, 30, 35

Solution:

\overset{x}{ˉ} = \frac{\sum x}{n} = \frac{12 + 15 + 18 + 20 + 25 + 30 + 35}{7} = \frac{155}{7}

\overset{x}{ˉ} = 22.14

Why it matters in ML: The mean is the foundation of linear regression. The regression line always passes through $(\overset{x}{ˉ}, \overset{y}{ˉ})$ .

Q2. Median

Find the median of: 7, 12, 15, 18, 21, 24, 27, 30

Solution:

Data is already sorted. n = 8 (even).

Median = \frac{4th term + 5th term}{2} = \frac{18 + 21}{2}

Median = 19.5

Why it matters in ML: The median is robust to outliers — which is why median absolute deviation is preferred over standard deviation for outlier detection in real datasets.

Q3. Mode

Find the mode of: 4, 5, 7, 8, 5, 9, 5, 10, 7, 8

Solution:

Count frequencies:

4 → 1 time
5 → 3 times
7 → 2 times
8 → 2 times
9 → 1 time
10 → 1 time

Mode = 5

Why it matters in ML: In classification problems, predicting the most frequent class is the baseline — called the "zero-rule classifier." If your model can't beat the mode, it's not learning anything.

Q4. Mean and Median

Find both mean and median of: 22, 25, 28, 30, 32, 35, 40, 45

Solution:

Mean:

\overset{x}{ˉ} = \frac{22 + 25 + 28 + 30 + 32 + 35 + 40 + 45}{8} = \frac{257}{8}

\overset{x}{ˉ} = 32.125

Median (n = 8, even):

Median = \frac{4th + 5th}{2} = \frac{30 + 32}{2} = 31

Q5. Missing Frequency

The mean of 10, 15, 20, 25, x is 18. Find x.

Solution:

\overset{x}{ˉ} = \frac{10 + 15 + 20 + 25 + x}{5} = 18

70 + x = 90

x = 20

Part 2: Standard Deviation & Variance

Standard deviation measures how spread out your data is around the mean. A small σ means the data clusters tightly. A large σ means it's scattered. In ML, high-variance features can dominate distance-based algorithms like KNN — which is why feature scaling (normalization/standardization) matters.

Q6. Variance and Standard Deviation

Find variance and standard deviation of: 5, 7, 9, 11, 13

Solution:

Step 1 — Find mean:

\overset{x}{ˉ} = \frac{5 + 7 + 9 + 11 + 13}{5} = \frac{45}{5} = 9

Step 2 — Find deviations and squared deviations:

x	$x - \overset{x}{ˉ}$	$(x - \overset{x}{ˉ})^{2}$
5	−4	16
7	−2	4
9	0	0
11	+2	4
13	+4	16
Total		40

Step 3:

σ^{2} = \frac{40}{5} = 8

σ = 8 = 2.83

Q7. Standard Deviation

Find the standard deviation of: 2, 4, 4, 4, 5, 5, 7, 9

Solution:

Mean:

\overset{x}{ˉ} = \frac{2 + 4 + 4 + 4 + 5 + 5 + 7 + 9}{8} = \frac{40}{8} = 5

Squared deviations:

x	$x - \overset{x}{ˉ}$	$(x - \overset{x}{ˉ})^{2}$
2	−3	9
4	−1	1
4	−1	1
4	−1	1
5	0	0
5	0	0
7	+2	4
9	+4	16
Total		32

σ = \frac{32}{8} = 4 = 2

Note: This is the same dataset used in many ML textbooks to explain why $σ = 2$ for a "nicely behaved" distribution. Notice how the mean is exactly 5 and the values are symmetric around it.

Q8. Coefficient of Variation

Find mean, standard deviation, and CV for: 10, 12, 15, 18, 20

Solution:

Mean:

\overset{x}{ˉ} = \frac{10 + 12 + 15 + 18 + 20}{5} = \frac{75}{5} = 15

Squared deviations:

x	$x - \overset{x}{ˉ}$	$(x - \overset{x}{ˉ})^{2}$
10	−5	25
12	−3	9
15	0	0
18	+3	9
20	+5	25
Total		68

σ = \frac{68}{5} = 13.6 = 3.69

C V = \frac{σ}{x ˉ} \times 100 = \frac{3.69}{15} \times 100 = 24.6

Why it matters in ML: CV compares variability across features with different units (e.g., salary in thousands vs age in years). A high CV feature needs normalization before using distance-based models.

Part 3: Range, Quartiles, and Dispersion

Q9. Range

Find the range of: 18, 25, 12, 30, 45, 28, 35

Solution:

Range = Max - Min = 45 - 12 = 33

Q10. Quartiles

Find Q1, Q2 (Median), and Q3 for: 5, 8, 10, 12, 15, 18, 20, 22, 25

Solution:

Sorted data (n = 9):

Q_{1} = value at \frac{n + 1}{4} = \frac{10}{4} = 2.5 th position = \frac{8 + 10}{2} = 9

Q_{2} = value at \frac{n + 1}{2} = 5 th position = 15

Q_{3} = value at \frac{3 ( n + 1 )}{4} = 7.5 th position = \frac{20 + 22}{2} = 21

Interquartile Range (IQR):

I QR = Q_{3} - Q_{1} = 21 - 9 = 12

Why it matters in ML: The IQR is how box plots detect outliers. Any value below $Q_{1} - 1.5 \times I QR$ or above $Q_{3} + 1.5 \times I QR$ is flagged. This is one of the most common data cleaning steps in any ML pipeline.

Part 4: Grouped Data / Class Interval Questions

These are the most important question types for university exams. The key difference from ungrouped data: you use the midpoint of each class interval as your representative value.

Q11. Mean from Frequency Distribution (Direct Method)

Class Interval	Frequency (f)	Midpoint (x)	fx
0–10	5	5	25
10–20	8	15	120
20–30	12	25	300
30–40	10	35	350
40–50	5	45	225
Total	40		1020

\overset{x}{ˉ} = \frac{\sum f x}{\sum f} = \frac{1020}{40} = 25.5

Q12. Median from Grouped Data

Class Interval	Frequency	Cumulative Frequency
0–10	4	4
10–20	6	10
20–30	10	20
30–40	8	28
40–50	2	30

$N = 30$ , so $\frac{N}{2} = 15$

The cumulative frequency just exceeds 15 at the 20–30 class → Median class = 20–30

$L = 20$ , $f = 10$ , $c f = 10$ , $h = 10$

Median = 20 + (\frac{15 - 10}{10}) \times 10 = 20 + 5 = 25

Q13. Mode from Grouped Data

Class Interval	Frequency
0–10	3
10–20	7
20–30	12 ← Modal class
30–40	9
40–50	4

$L = 20$ , $f_{1} = 12$ , $f_{0} = 7$ , $f_{2} = 9$ , $h = 10$

Mode = 20 + (\frac{12 - 7}{2 ( 12 ) - 7 - 9}) \times 10

= 20 + 6.25 = 26.25

Q14. Standard Deviation from Grouped Data

Class	f	Midpoint (x)	fx	$x - \overset{x}{ˉ}$	$(x - \overset{x}{ˉ})^{2}$	$f (x - \overset{x}{ˉ})^{2}$
10–20	5	15	75	−20	400	2000
20–30	8	25	200	−10	100	800
30–40	15	35	525	0	0	0
40–50	10	45	450	+10	100	1000
50–60	7	55	385	+20	400	2800
Total	45		1635			6600

\overset{x}{ˉ} = \frac{1635}{45} = 36.33

σ^{2} = \frac{6600}{45} = 146.67 \Rightarrow σ = 146.67 = 12.11

Part 5: Previous-Year Style Mixed Numericals

Q15. Mean, Median & Mode from Frequency Distribution

Marks	No. of Students (f)	Midpoint (x)	fx	CF
0–10	5	5	25	5
10–20	9	15	135	14
20–30	12	25	300	26
30–40	8	35	280	34
40–50	6	45	270	40
Total	40		1010

Mean: $\overset{x}{ˉ} = \frac{1010}{40} = 25.25$

Median:

$\frac{N}{2} = 20$ → Median class = 20–30 (CF crosses 20)

Median = 20 + (\frac{20 - 14}{12}) \times 10 = 20 + 5 = 25

Mode: Highest frequency = 12 → Modal class = 20–30

Mode = 20 + (\frac{12 - 9}{2 ( 12 ) - 9 - 8}) \times 10 = 20 + (\frac{3}{7}) \times 10 = 20 + 4.28 = 24.28

Q16. Mean Deviation About Mean

Data: 14, 18, 20, 22, 25, 30

Mean: $\overset{x}{ˉ} = \frac{14 + 18 + 20 + 22 + 25 + 30}{6} = \frac{129}{6} = 21.5$

| x | $∣ x - \overset{x}{ˉ} ∣$ | |---|---| | 14 | 7.5 | | 18 | 3.5 | | 20 | 1.5 | | 22 | 0.5 | | 25 | 3.5 | | 30 | 8.5 | | Total | 25 |

M D = \frac{25}{6} = 4.17

Q17. Average Salary Problem

Given: Average salary of 8 employees = ₹24,000. One leaves, a new one joins at ₹30,000, new average = ₹25,000. Find the salary of the employee who left.

Solution:

Total salary (original 8 employees):

8 \times 24000 = ₹1, 92, 000

Total salary (new 8 employees):

8 \times 25000 = ₹2, 00, 000

Difference after swap:

New total = Old total - Left salary + 30000

2, 00, 000 = 1, 92, 000 - Left + 30, 000

Left salary = 1, 92, 000 + 30, 000 - 2, 00, 000 = ₹22, 000

Q18. Karl Pearson's Coefficient of Skewness

Given: Mean = 45, Median = 42, σ = 6

S_{k} = \frac{3 ( Mean - Median )}{σ} = \frac{3 ( 45 - 42 )}{6} = \frac{9}{6} = 1.5

Since $S_{k} > 0$ , the distribution is positively skewed (tail on the right).

Why it matters in ML: Skewed features hurt linear models and tree-based models. A positive skew > 1 is a strong signal to apply log transformation before feeding data into your model. This is one of the first checks in any exploratory data analysis (EDA).

Quick Revision: What Each Measure Tells You

Measure	What It Captures	ML Relevance
Mean	Center of data	Regression baseline, feature scaling
Median	Robust center	Outlier-resistant imputation
Mode	Most frequent value	Baseline classifier, categorical imputation
Variance / σ	Spread around mean	Feature importance, noise detection
IQR	Middle 50% spread	Outlier detection (box plot rule)
CV	Relative variability	Comparing features with different units
Skewness	Symmetry of distribution	Signals need for log/sqrt transformation

Also Explore These Topics

Some Authority post

Written by Abhijeet Singh Rajput · Published on Notehub