Principle Component Analysis
A [see page 2, method] to transform a data-set from the data space into the feature space where we have optimum dimension reduction (we can pick out the dimensions with the largest variance and drop dimensions with very little variance while losing very little information).
At a high level PCA essentially rotates our data until its spread apart as much as possible.
PCA works by creating a Feature Matrix \( F \) (from the eigenvectors of the covariance matrix) and then [see page 4, transforming] the data \( x \) with this matrix to get our new data \[ \tilde{x} = F^T x \].
The data we get for (un-)supervised learning could have a lot of information, but not all of it will be useful. There could be some features that're highly correlated.
PCA tries to find the dimensions that have the highest-correlation and then transforms the data into a coordinate-system where the data is spread out along as many of these dimensions as desired.
Finding the Feature Matrix
We [see page 5, define] our basis-vector \( \hat{e}_n \) as an eigenvector of the covariance matrix \( C^0 \hat{e}_n = \lambda_n \hat{e}_m \) which we intend to use to transform the sample data points.
The feature-matrix is a matrix constructed with columns of one or more basis vectors. For example a feature matrix containing two basis vectors can be constructed as: \[ F = \begin{pmatrix} \hat{e}_{11} \hat{e}_{21} \\ \hat{e}_{12} \hat{e}_{22} \\ \end{pmatrix} \]
Note: The eigenvalues of each basis-vector specifies how much variation there is along that direction (eigenvector). Therefore a feature-matrix with eigenvectors (of the covariance-matrix) having the \(N\) largest eigenvalues transform the sample space through the \( N \) directions with the least cross correlation.
Note: We define the Principle Component as the basis vector with largest eigenvalue.
Warn: The length of each eigenvector in PCA must be \(1\).
See the [see page 15, walkthrough]. TODO: Custom walkthrough, including on eigenvectors.
Reducing Dimensionality
Observe that the dimensionality of our new data depends solely on the number of columns in our feature matrix (transposed), which is equivalent to the number of basis vectors we intend to use.
We can of course use all the eigenvectors we find (matching the number of dimensions in our sample-space) in which case we'll get a lossless transformation to PCA axes that we can reverse. However in practice we'll most likely only use a subset of these eigenvectors (with \( \lambda_1 > \lambda_2 > \ldots > \lambda_N \) ), reducing the dimensionality of our samples lossyly while preserving the most important details.
See [see page 7, final] recipe for PCA.