Lies, damned lies, and data visualization
Posted by Scott on August 30, 2010
I have a ton of respect for information vizualization. It’s captivating, intuitive, and informative, all wrapped together in an awesome ‘art meets science’ package. There are so many people doing information visualization brilliantly and it’s a super exciting field to watch.
Worth reflecting on, however, is the current state of mainstream data viz. The infographic, having graced the cover of USA Today for some time now, seems to be getting more sophisticated and we find ourselves in an era of some sort of amalgam of infographics, infoviz, and “data science”. How many charts, graphics, or other graphical depictions of data do you see every day? Here are three reasons why the data visualization we see today is more potent than the ‘statistics’ referred to in the ‘lies and damed lies’ quote popularized by Mark Twain (in addition to the sheer volume and size of audience, both of which have got to be orders of magnitude larger than in Twain’s day):
1. Seeing is compelling and humans are cognitively lazy Let’s start with the basics: Visualized data is… well, visual and often much more compelling. As an example, which of the following representations of self-reported smokers by state (courtesy of ManyEyes) is more interesting:
I grabbed pretty much the first example I saw on Many Eyes, but the difference is stark. The map is endlessly interesting, while your eyes glaze over by about Arkansas when reading the list. Couple that with the fact that humans are cognitively lazy and unlikely to take the time to parse the list and the map is the clear winner. Now this is basically a good thing, except:
2. Lack of context/perspective (and humans are cognitively lazy) David McCandless recently gave a very inspiring TED talk about data visualization, the most important part of which for me was where he showed that while the military budget for the U.S. was indeed the largest in the world, but when framed as a fraction of GDP, it’s actually 8th in the world. Or consider this recent Chart of the Day from Silicon Alley Insider, with the headline announcing that teens text every 10 minutes when awake:
That big green bar for people under 18 certainly pops. The text actually clarifies that this includes messages sent and received, which cuts in half the perception the headline implies, that teens send a text every 10 minutes (not to mention that you can text more than one person at a time, and thus messages received are likely far higher than messages sent). More misleading, the numbers focus on the average over time (1 text every 10 minutes), while the usage pattern likely is not that constant – texting probably happens in bursts when coordinating, etc., and then not at all during other times. This lack of depiction of the actual temporal patterns of the texting is what I mean by lack of context or perspctive. This is not to say teens aren’t heavy texters, but it’s not as clear cut as the figure, and certainly not as clear cut as the headline. Again, though, we’re cognitively lazy and unlikely to wade through the details. (Note: this same lack of context applies to traditional statistics as well, which often fail to include covariates, and so on.)
3. No established metrics for significance In inferential statistics there are generally established values for different tests, such as the well known p < .05, but many more around effect sizes, confidence intervals, and so on. Infographics never say, “have a look, but keep in mind there’s a 60% likelihood this could have occured by chance.” Here is an example, from a recent USA Today:
You look at this and think that more African-Americans follow baseball than whites or Hispanics. Actually if you assume they sampled 100 people of each ethnicity and then test the results with a chi-square for independence (which I did), this is a non-significant effect. In fact, it’s not even close to significant. Even if this infographic did come with such a disclaimer, it would be like the lawyer who says a bunch of stuff that she knows will be stricken from the record, but that she knows the jury can’t help but process. (BTW, for these percentages to reach significance you would need close to 1000 people of each ethnicity sampled, but the infographic doesn’t report the sample size.)
Statistics are about extracting meaning from numbers. So is data visualization. The question, and this is at the heart of the Twain quote, is to what extent can we trust what they are saying. Is there an infographic equivalent of p < .05?