Brain Dump

Clustering

Tags
adaptive-intelligence

A form of unsupervised learning where we identify and group similar samples based on the structure of the data.

While clustering can be useful for grouping it isn't very useful for classification, unless you have some way of associating the clusters with samples.

A complete clustering forms a [see page 7, voronoi tesselation], where the boundary between two clusters can be seen as the decision boundary between one class of clusters and another.

Evaluation

Silhouette

Is an internal method that evaluates a cluster using a [see page 20, measure] of the distance of each point in the cluster and a measure of the distance to every point outside of the cluster.

\begin{align*} a^{\mu} = \frac{1}{n_k - 1} \sum_{\mu' \in C_k, \mu' \neq \mu} |x^{\mu} - x^{\mu '} | \
b^{\mu} = \min_{l \neq k} [ \frac{1}{n_l} \sum_{\mu' \in C_l} |x^{\mu} - x^{\mu'} ] \
s^{\mu} = \frac{b^{\mu} - {a}{\mu}}{\text{max}({a}{\mu}, {b}^{mu})} \end{align*}

For large clusters you want \( b \) to be large (far away from external clusters), and the internal distance \( a \) to be small (cluster points should be close together).

The range of the silhouette should be \( -1 \leq {s}^{\mu} \leq 1 \).

Purity

An external method that evaluates a cluster based on the average [see page 21, purity] of all the clusters. This approach requires some knowledge of existing clusterings.

We define purity as the maximum number of points in a cluster which're known to belong to the same cluster.

\begin{align*} P_{C_k} = \frac{\text{max}_{l}(n_k^{l})}{n_k} \
P = \sum_{k} \frac{n_k}{n} P_{C_k} \end{align*}