Brain Dump

Box Plots

Tags
math

Is a figure used to plot the distribution of a set of data, making it easy to compare two or more sets of data. It divides the data into quartiles where each quartile covers a quarter of the distribution of values; that is to say if you sort the distribution and pick the \( \frac{n}{\nth{4}} \) value you get the lower quartile \( Q_1 \). Similarly \( Q_2 \) is the median of the data-set and the upper quartile \( Q_3 \) is the value at the sorted \( \frac{3n}{\nth{4}} \) index of the distribution.

Note: We define an outlier (erroneously recorded value) as a value which is more than \( 1.5 \times \text{IQR} \) outside of the nearest quartile, where IQR refers to the Inter Quartile Range I.E. the value distance between the upper and lower quartile or equivalently \( Q_3 - Q_1 \).

\begin{figure}
  \centering
  \begin{tikzpicture}
    \draw[very thin, gray] (0,0) grid (10,3);
    \draw
      (1, 2)                 node[above] {$LV$} -- (1, 1)
      (9,2)                  node[above] {$HV$} -- (9,1)
      (3,.75)                node[below] {LQ} rectangle (6,2.25) node[above] {$UQ$}
      (4.5,2.25) -- +(0,-1.5) node[midway,right] {$M$};
    \draw (1, 1.5) -- (3, 1.5) (6, 1.5) -- (9, 1.5);
  \end{tikzpicture}
  \caption{An example box plot with the Lowest Value (\( LV \)), Lower Quartile (\(LQ \)),
Median value (\( M \)), Upper Quartile (\( UQ \)), and Highest Value (\( HV \)) annotated.}
\end{figure}

For example consider the following distribution: \[ 17,23,35 \; (\text{\bfseries LQ}),36,51,53 \; (\text{\bfseries M}),54,55,60 \; (\text{\bfseries UQ}),77,110 \] Note: In this example the only outlier is 110, because it's greater than \( Q_3 + 1.5 (Q_3 - Q_1) = 60 + 1.5*25 = 97.5 \).

Skewness

TODO: Insert skewness diagram.

There's several measures of skewness that can be attributed to the box-plot:

  1. Use the Quartiles:

    The distribution has

    • positive skew if \( Q_2 - Q_1 < Q_3 - Q_2 \).
    • negative skew if \( Q_2 - Q_1 > Q_3 - Q_2 \).
  2. Use the mean, median, and mode in alphabetical order.

    The distribution has:

    • positive skew if the less than constraint holds: \( \text{mean} < \text{median} < \text{mode} \).
    • positive skew if the greater than constraint holds: \( \text{mean} > \text{median} > \text{mode} \).
  3. Use the following Formula \[ \frac{3 (\text{mean} - \text{median})}{\text{standard deviation}} \] If this is positive the distribution has positive skew and vice-versa.