Correlation

Tags: math

Is a measure of how correlated two variables are. The goal is to take a collection of paired-data and return an integer in the range \( -1 \leq i \leq 1 \) to indicate how correlated the data is.

For example when we plot them on a scatter-diagram if a change in \( x \) leads to a consistent positive change in \( y \) then we say the two variables are positively correlated. If it leads to a negative change in \( y \) then their negatively correlated. And if the change in \( y \) seems independent of the change in \( x \) then we presume there's no correlation.

Note: correlation doesn't imply causation. It doesn't prove there's a relationship between the two variables. For example a scatter-graph of iron and steel production and the number of penguins at the South pole may show positive correlation but there's no link between the two.

Pearson's Product Moment Correlation Coefficient

Measures how close the points in a paired data set are to forming a straight-line. This means it doesn't rule out other types of correlation (such as those that're non-linear), but it can give a value for linear-correlation.

It is based on calculating each \( x \) deviation from the mean \( \bar{x} \) and each \( y \) deviation from the mean \( \bar{y} \) to give the formula:

\begin{align} r = \frac{S_{xy}}{\sqrt{S_{x x} S_{y y}}} \label{eq:pearson-prod-coeff} \end{align}

Where:

\begin{align*} S_{xy} &= \sum xy - \frac{\sum x \sum y}{n} \
S_{xx} &= \sum x^2 - \frac{(\sum x)^2}{n} \
S_{yy} &= \sum y^2 - \frac{(\sum y)^2}{n} \
\end{align*}

The correlation coefficient is resilient to both translation and scaling. Translating moves the origin on the graph but the correlation between the elements remains the same. Scaling alters the scale on the graph but does not affect the correlation of the elements.

Links to this note

Least Squares Regression Line