Box Plots
- Tags
- math
Is a figure used to plot the distribution of a set of data, making it easy to compare two or more sets of data. It divides the data into quartiles where each quartile covers a quarter of the distribution of values; that is to say if you sort the distribution and pick the \( \frac{n}{\nth{4}} \) value you get the lower quartile \( Q_1 \). Similarly \( Q_2 \) is the median of the data-set and the upper quartile \( Q_3 \) is the value at the sorted \( \frac{3n}{\nth{4}} \) index of the distribution.
Note: We define an outlier (erroneously recorded value) as a value which is more than \( 1.5 \times \text{IQR} \) outside of the nearest quartile, where IQR refers to the Inter Quartile Range I.E. the value distance between the upper and lower quartile or equivalently \( Q_3 - Q_1 \).
\begin{figure}
\centering
\begin{tikzpicture}
\draw[very thin, gray] (0,0) grid (10,3);
\draw
(1, 2) node[above] {$LV$} -- (1, 1)
(9,2) node[above] {$HV$} -- (9,1)
(3,.75) node[below] {LQ} rectangle (6,2.25) node[above] {$UQ$}
(4.5,2.25) -- +(0,-1.5) node[midway,right] {$M$};
\draw (1, 1.5) -- (3, 1.5) (6, 1.5) -- (9, 1.5);
\end{tikzpicture}
\caption{An example box plot with the Lowest Value (\( LV \)), Lower Quartile (\(LQ \)),
Median value (\( M \)), Upper Quartile (\( UQ \)), and Highest Value (\( HV \)) annotated.}
\end{figure}
For example consider the following distribution: \[ 17,23,35 \; (\text{\bfseries LQ}),36,51,53 \; (\text{\bfseries M}),54,55,60 \; (\text{\bfseries UQ}),77,110 \] Note: In this example the only outlier is 110, because it's greater than \( Q_3 + 1.5 (Q_3 - Q_1) = 60 + 1.5*25 = 97.5 \).
Skewness
TODO: Insert skewness diagram.
There's several measures of skewness that can be attributed to the box-plot:
Use the Quartiles:
The distribution has
- positive skew if \( Q_2 - Q_1 < Q_3 - Q_2 \).
- negative skew if \( Q_2 - Q_1 > Q_3 - Q_2 \).
Use the mean, median, and mode in alphabetical order.
The distribution has:
- positive skew if the less than constraint holds: \( \text{mean} < \text{median} < \text{mode} \).
- positive skew if the greater than constraint holds: \( \text{mean} > \text{median} > \text{mode} \).
Use the following Formula \[ \frac{3 (\text{mean} - \text{median})}{\text{standard deviation}} \] If this is positive the distribution has positive skew and vice-versa.