Brain Dump

K-Means Clustering

Tags
adaptive-intelligence

The most common clustering implementation, where we [see page 6, visualise] points in some N-dimensional data-space Each point has a certain degree of similarity to each cluster and is assigned to the cluster to which its closest to by some distance measure (example: Euclidean distance).

The name [see page 11, K-means clustering] comes from having \( k \) prototypes and assigning a data-point to the closest prototype.

Mathematically we perform a clustering using a [see page 6, prototype] \( m_k \) (AKA: centroid, mean, \( w_k \)) for some cluster \( k \) in the set of clusters in the range \( i \). We therefore assign a point to a cluster by finding the cluster \( k \) where: \[ |m_k - x^{\mu}| \leq |m_i - x^{\mu}| \forall i \]

Warn: K-means [see page 22, struggles] with data that has complex structures.

Formulation

We define the [see page 9, reconstruction-error] as the sum of the squared distance between each point in a cluster and its prototype. \[ E = \frac{1}{2} \sum_{k} \sum_{x^{\mu} \in C_k} (m_k - x{\mu})2 \] Where:

  • \( K \) is the number of clusters (a hyper-parameter)
  • \( m_k \) is the prototype of cluster \( k \)
  • \( C_k \) the set of data-points in the cluster \( k \)
  • \( n_k \) is the number of data-points in cluster \( k \)

Which can be written equivalently in non-vector notation as: \[ E = \frac{1}{2} \sum_{k} \sum_{x^{\mu} \in C_k} \sum_{l} (m_{kl} - x{\mu}_l)2 \]

This function is used to assign a point to a cluster based on whichever cluster has the lowest error.

The idea being that if something is really well clustered together each point will be close to the clusters prototype and therefore the error for that cluster will be small. If the clustering is very ambiguous the error will instead be very high.

We can then [see page 10, derive] the delta-function for our centroids as:

\begin{align*} \Delta{m_{k}} = - \eta \frac{\delta E}{\delta m_{k}} \\

       = - \eta \sum\_{x^{\mu} \in C\_i} (m\_{k} - x\_j^{\mu})

\end{align*}

Observe that this rule only updates the weights for the winning prototype, the error only affects the given cluster it contributes to and no other clusters.

TODO: Copy in derivation of \( m_k \) from [see page 10, here].

Final Algorithm

So the final K-Means [see page 12, algorithm] is:

  1. Initialise the cluster.
  2. Assign each point \( x_i \) to cluster with the lowest error.
  3. Recompute the cluster centre from the new cluster-set.
  4. Repeat until the cluster-centres see page 11, converge.

Weight Normalisation

When using classic euclidean distance to calculate point-to-prototype distances, we end up doing a dot product of the prototype and the data-point

We should [see page 17, normalize] the data (keep its magnitude under 1) to avoid clusters that're far away from getting a high response (due to them being in the same direction) despite being further away.

Initialisation Methods

Various [see page 17, initialisation-methods] exist which use data-points in the training-set to pick the initial list of prototypes.

  • K-Means: Select data points that're far away from each other.
  • ROBIN: Select data points far away from each other and at dense regions of the feature space (where a lot of points cluster together already).
  • DK-Means++: Similar to ROBIN but deterministic.

Clustering is sensitive to the initial clustering centres and the accuracy for a final clustering can be directly influenced by a good initial clustering.