Anscombe's Quartet and Robust Fitting

Anscombe's Quartet (F.J. Anscombe, "Graphs in Statistical Analysis," American Statistician, 27 [February 1973], 17-21) comprises 4 data sets of 11 points each:

Anscombe's Quartet

I		II		III		IV
x	y	x	y	x	y	x	y
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.10	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.10	4	5.39	19	12.50
12	10.84	12	9.13	12	8.15	8	5.56
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

The four sets of data that make up the quartet are similar in many respects. For all four:

mean of the x values = 9.0
mean of the y values = 7.5
equation of the least-squared regression line is: y = 3 + 0.5x
sums of squared errors (about the mean) = 110.0
regression sums of squared errors (variance accounted for by x) = 27.5
residual sums of squared errors (about the regression line) = 13.75
correlation coefficient = 0.82
coefficient of determination = 0.67

However, when the data are plotted, the differences among the data sets are revealed.

This illustrates the sensitivity of least-squares fitting techniques to outliers (points with very large deviations from the rest of the data).

Numerical Recipes has a "robust" linear fitting routine called MEDFIT that is supposed to be less sensitive to outliers; it works by minimizing the absolute deviation of the data from the fit rather than the squared deviation. To test it, I fit lines to all four datasets in Anscombe's Quartet using both least-squares and MEDFIT. The results are below:

The robust fitting makes little difference when the data are linear, with few large outliers (set 1). It is markedly better than the least-squares line when the data are mostly linear with moderate slope and one outlier (set 3). For sets that are not well-represented by a straight line (sets 2 and 4), it is arguably no worse than least-squares fitting; however, prior examination of these datasets would show that fitting a straight line was not appropriate in any case.

Acknowledgments:

This quartet is used as an example of the importance of looking at your data before analyzing it in Edward Tufte's excellent book, The Visual Display of Quantitative Information. I borrowed the HTML tables of the datasets and fit parameters, and the small GIF showing the data, from J. Randall Flanagan's pages for his Statistics in Psychology course at Queen's University, Kingston, Ontario.

Return to main page.

Last modified: Tue Feb 4 15:45:33 2003