A Cleveland Library
One of the pioneers in developing guidelines for comprehensible data graphics was William Cleveland. Now a professor at Purdue, he was working at Bell Labs in the Statistics and Data Mining Research Department when the following were written. He developed lowess, a statistical technique for locally-weighted fitting of curves to scatterplot data, now widely used. He also analyzed all 377 graphs in volume 207 of Science, and found significant problems in a third of them; hence these books.
The Elements of Graphing Data
Chapman and Hall/CRC (2nd ed.), 1994 (1985) • ISBN: 0963488414
Now out of print, and consequently pretty pricey, so I wouldn’t recommend you rush out and buy it. Worth reading for the nice summaries of graphical perception experiments (for example the way we tend to underestimate areas and volumes, and are bad at judging slopes and comparing lengths), and for his wise words on the use of error bars. But. The graphics are almost all terrible (as is the typography). I’d be inclined to blame the primitive state of computer graphing packages in the 1980s, but Tufte’s The Visual Display of Quantitative Information came out at the same time, and its illustrations are so infinitely superior there’s really no excuse. A quirk of Cleveland is his penchant for extending scales back before zero, to avoid interference with data points, although this sometimes creates nonsense (like less-than-zero parts per billion); discussed further here. He also has a penchant for log base 2 transformation of data. I don’t know about you, but I haven’t memorized my powers of 2, and a scale which goes | 32 | 64 | 128 | 256 | makes it very difficult to read off intermediate values and make comparisons. Cleveland also makes pronouncements against placing labels inside the graph, or using color as a scale, that I would take issue with. So on the whole, though the book contains some common sense, I can’t recommend following its example.
Visualizing Data
Hobart Press, 1993 • 0963488406
A very different and more satisfying book. Not only are the graphics better, but the organization is more sensible, proceeding through univariate to multivariate to multiway data. Cleveland is concerned with visualization as the first step in statistical analysis, and walks you through fitting curves, generating residuals, and plotting both residuals and fit. He’s not as interested in tweaking his data graphics. Many graphs do have grayed-out elements, usually a reference frame (curiously, this tends not to correspond to the scale tick marks, when it would be perfectly easy for it to do so), but there are still plenty of minor flaws: too many boxes and frames, axis labeling is rarely intuitive, and the log2 scale crops up again. There are also box plots everywhere. I’ve never understood what’s so great about a box plot (a one-dimensional scatter plot presented for some reason in two dimensions). Perhaps this is a topic for a future post. These problems aside, Cleveland’s discussion of the importance of visualization in statistics is right on the money, although the bulk of the book is concerned with a rather dry standard procedure for examining datasets.
Both these books are often touted as an essential addition to an information graphics library, but only Visualizing Data is really worth reading, and I wouldn’t call it riveting. And as usual, I recommend you try a price comparison search on Addall or BestWebBuys rather than Amazon.