Dimensionality Reduction
Dimensionality Reduction
In statistics and machine learning, dimensionality reduction is the process of reducing the number of variables under consideration by obtaining a smaller set of principal variables. Dimensionality reduction can be implemented in two primary ways:
1. Feature Selection:
In feature selection, we aim to identify k features out of a total of n that provide the most meaningful information. The remaining (n-k) dimensions are discarded, as they contribute less to the model's performance.
2. Feature Extraction:
In feature extraction, we generate a new set of k features that are combinations of the original n features. This approach transforms the data into a lower-dimensional space while retaining important information.
Some well-known feature extraction techniques include:
Principal Component Analysis (PCA) – an unsupervised linear projection method.
Linear Discriminant Analysis (LDA) – a supervised linear projection method.
Usefulness of Dimensionality Reduction:
For most learning algorithms, the computational complexity depends on the number of input dimensions (d) and the size of the dataset (n). Reducing dimensionality helps in lowering both memory and computational costs.
When an input feature is determined to be unnecessary, we reduce the cost of extracting and processing it.
Simpler models tend to be more robust, especially when working with small datasets.
When the data can be described with fewer features, we gain better insight into the underlying processes and patterns, enabling knowledge extraction.
If data can be represented in fewer dimensions without significant loss of information, it can be plotted and visually analyzed to detect patterns, structures, and outliers.
Principal component analysis
The PCA (principal component analysis) algorithm:- This procedure depends on mathematical concepts.
Step 1: Data Representation
We consider a data set having n features denoted by . Let these be N examples. let the value of feature
Features | Example 1 | Example 2 | ... | Example N |
|---|---|---|---|---|
... | ||||
Features Example 1 Example | ||||
... |
✅ Step 2: Compute the Mean of Each Feature
✅ Step 3: Compute the Covariance Matrix,
consider the variables and (i & j need not be different). The covariance of the ordered pair is defined as
We compute the following matrix called the covariance matrix of the data, the element of the i-th row, j-th column is the covariance
... | ... | ... |
... | ... | ... |
✅ Step 4: Compute Eigenvalues and Eigenvectors
To obtain the principal components, we solve for the eigenvalues and eigenvectors of the covariance matrix:
Set the
This is a polynomial equation of degree n in . It has no real root, and those roots are the eigenvalues of ; we find the n roots of
If is an eigenvalue, then the corresponding eigenvector is a vector
This is a System of n homogeneous linear equations , and it always has a nontrivial solution.
we next find a set of n orthogonal eigen vectors such that is the eigen vector corresponding to
We now normalize the eigen vectors. given any vector x, we normalize it by dividing X by its length.
X = X1
X2
.
.
.
XnGiven the Eigen vector U, then corresponding normalized eigen vector is computed as
We compute the n-1 normalized eigen vectors
by
Step 5: Derive new Data
Order the eigen values from highest to lowest. The unit eigen vector corresponding to the largest eigen value is the first principal component, the unit eigen vector corresponding to next highest eigen value is the second principal component. and so on.
Let the eigen values in descending order
be and let the corresponding unit eigen vectors areChoose a positive integer so that
choose the eigen vectors corresponding to the eigen values and form the following matrix
/image here [todo]we form the following matrix
/image here [todo]next compute the matrix
Step 6:
The matrix is the new dataset, each row of the matrix represents the values of the features, since there are p rows only so new dataset has p features only.
this is how PCA helps us in dimensionality reduction of the dataset.
