Least Squares

just trying to minimize error

Archive for August, 2010

Lies, damned lies, and data visualization

Posted by Scott on August 30, 2010

I have a ton of respect for information vizualization. It’s captivating, intuitive, and informative, all wrapped together in an awesome ‘art meets science’ package. There are so many people doing information visualization brilliantly and it’s a super exciting field to watch.  

Worth reflecting on, however, is the current state of mainstream data viz. The infographic, having graced the cover of USA Today for some time now, seems to be getting more sophisticated and we find ourselves in an era of some sort of amalgam of infographics, infoviz, and “data science”. How many charts, graphics, or other graphical depictions of data do you see every day? Here are three reasons why the data visualization we see today is more potent than the ‘statistics’ referred to in the ‘lies and damed lies’ quote popularized by Mark Twain (in addition to the sheer volume and size of audience, both of which have got to be orders of magnitude larger than in Twain’s day):

1. Seeing is compelling and humans are cognitively lazy Let’s start with the basics: Visualized data is… well, visual and often much more compelling. As an example, which of the following representations of self-reported smokers by state (courtesy of ManyEyes) is more interesting:



I grabbed pretty much the first example I saw on Many Eyes, but the difference is stark. The map is endlessly interesting, while your eyes glaze over by about Arkansas when reading the list. Couple that with the fact that humans are cognitively lazy and unlikely to take the time to parse the list and the map is the clear winner. Now this is basically a good thing, except:

2. Lack of context/perspective (and humans are cognitively lazy) David McCandless recently gave a very inspiring TED talk about data visualization, the most important part of which for me was where he showed that while the military budget for the U.S. was indeed the largest in the world, but when framed as a fraction of GDP, it’s actually 8th in the world. Or consider this recent Chart of the Day from Silicon Alley Insider, with the headline announcing that teens text every 10 minutes when awake:

That big green bar for people under 18 certainly pops. The text actually clarifies that this includes messages sent and received, which cuts in half the perception the headline implies, that teens send a text every 10 minutes (not to mention that you can text more than one person at a time, and thus messages received are likely far higher than messages sent). More misleading, the numbers focus on the average over time (1 text every 10 minutes), while the usage pattern likely is not that constant – texting probably happens in bursts when coordinating, etc., and then not at all during other times. This lack of depiction of the actual temporal patterns of the texting is what I mean by lack of context or perspctive. This is not to say teens aren’t heavy texters, but it’s not as clear cut as the figure, and certainly not as clear cut as the headline. Again, though, we’re cognitively lazy and unlikely to wade through the details. (Note: this same lack of context applies to traditional statistics as well, which often fail to include covariates, and so on.)

3. No established metrics for significance In inferential statistics there are generally established values for different tests, such as the well known p < .05, but many more around effect sizes, confidence intervals, and so on. Infographics never say, “have a look, but keep in mind there’s a 60% likelihood this could have occured by chance.”  Here is an example, from a recent USA Today:

You look at this and think that more African-Americans follow baseball than whites or Hispanics. Actually if you assume they sampled 100 people of each ethnicity and then test the results with a chi-square for independence (which I did), this is a non-significant effect. In fact, it’s not even close to significant. Even if this infographic did come with such a disclaimer, it would be like the lawyer who says a bunch of stuff that she knows will be stricken from the record, but that she knows the jury can’t help but process. (BTW, for these percentages to reach significance you would need close to 1000 people of each ethnicity sampled, but the infographic doesn’t report the sample size.)

Statistics are about extracting meaning from numbers. So is data visualization. The question, and this is at the heart of the Twain quote, is to what extent can we trust what they are saying. Is there an infographic equivalent of p < .05?


Posted in Uncategorized | 3 Comments »

Top Twitter Authors for Topic ‘God’

Posted by Scott on August 20, 2010

Yesterday I tweeted about an NPR piece on the neuroscience of religious experience. This got me wondering about people who tweet about ‘god’, so I ran ‘god’ through this algorithm we’ve been working on for finding topical authorities in Twitter. Here are the results, along with a few observations.

Notes/Caveats: This was computed using one ‘authoritativeness’ method (developed largely by Aditya Pal, with a bit of chiming in from me). There are many ways to do this, each of which would likely yield a different result. Without going into the details, our method is not a graph-based solution (though we do incorporate some graph features), it looks only at the most recent 5 days of Twitter, and does not include latent topics. So, if a person hasn’t tweeted the word ‘god’ in the last 5 days, they won’t be included. Also, people use the word ‘god’ all the time in non-religious ways and thus you end up with people like @stewie_griffin and @chazsom3ers on the list, who I simply ignore.


  1. RevRunWisdom
  2. chazsom3ers
  3. VanNessVanWu
  4. MaxLucado
  5. ihatequotes
  6. jaesonma
  7. CSLewisDaily
  8. UGOdotcom
  9. RickWarren
  10. Stewie_Griffin
  11. TheLoveStories
  12. DaRealAmberRose
  13. JoyceMeyer
  14. DeepakChopra
  15. FunnyOrFact

Some (very non-scientific) observations

In terms of basic numbers, these folks (ignoring the couple of non-religious accounts) have an average of 257k followers, ranging from 12k to 1.3m, though most are in the 100-200k range (removing the person with 1.3m followers lowers the average to 127k). They have an average of 19.3 tweets containing the word ‘god’ over the past 5 days, ranging from 7 to 53, though most are very close to 20 (or 4 per day).

The results seem to fall into a couple of categories, but in general, the dominant trend is around the sharing of inspirational quotes and words of wisdom or encouragement, with a healthy dose of business-oriented media savvy throw in. People like @joycemeyer are clearly leveraging Twitter.

Religious hipsters/musicians

@RevRunWisdom (Run from Run DMC; 1.3M followers)  is the number one result and really falls more in the category of what I snarkily call the ‘self-help section of Twitter’ (see below). He mainly posts quotes, using lots of hashtags like #anxietyfree and #powerprayer. There’s little in the way of personal content, and you get the feeling he uses Twitter kind of like preaching – it’s about conveying a messge, not about his personal life. Overall though, he’s a celeb who feels very ‘real’.

Also very hip looking, but much less of a celeb is @jaesonma (12k followers): If you click through to his webpage, it’s about ‘God, Culture, Mission.’ From what I can tell, this is a slick, hip, social media-savvy, way to spread his message. @VanNessVanWu (27k followers) fall into this same category: contemporary religious soul musician.

Ministers The business of god

This is interesting. If the democratization of preaching afforded by Twitter lets the religious hipster get 12k followers, how about actual ministers? Oops, we don’t know because there aren’t any on the list, really. The closest is @MaxLucado (100k followers) who is a minister, but also an author, and whose tweets seem to swing between insprirational quotes and updates about his book tour. @RickWarren (“Location: I live in the State of Grace”; 136k followers) runs pastors.com. His tweets are mainly inspirational quotes, and the whole thing looks very well-meaning, though very aware of the power of Twitter to reach a large audience. Similarly, @JoyceMeyer clearly uses Twitter in a very media/business savvy manner, with lots of promotion of her inspiration empire. Finally, we have Deepak Chopra, though somewhat surprisingly he only has 265k followers (@RevRunWisdom has 5 times as many).

Reading these folk’s tweets makes me wonder about the balance of spreading inspiration and marketing a business. My guess is this approach is super successful: people follow for the inspriation and also get updates on tour info, promotions, etc. It seems well intentioned, but there’s no doubt it’s also a business.

The self-help section of Twitter

Just like at your local Barnes and Noble, there is quite the demand for inspiration and encouragement on Twitter, and the character limit seems perfect for quote sharing. Authors like @cslewisdaily, @ihatequotes, and @thelovestoreis (“Location: Your heart”) do virtually nothing but quote sharing. I’d love to know what percentage of Twitter is quote sharing.

Posted in Uncategorized | Leave a Comment »

MSR Folks Working in Social Computing

Posted by Scott on August 18, 2010

It was pointed out to me recently that MSR’s web presence for our collective social computing effort is painfully out of date. Ah, group web pages – serious diffusion of responsibility. In a small effort to remedy the situation (without, of course, actually taking the time to fix the group web pages), here is a list of some of the people and groups working in the social media/computing space as of summer 2010.

Note that while I provide short descriptions, each person’s work is considerably more multifaceted than that. Also, this list could be a touch shorter if you restricted your definition of social computing, but it would be much longer if you broadened to include areas like CSCW, telepresence, and HCI. Finally, I’m sure I missed people, especially outside the Redmond lab.

Redmond Lab

Adaptive Systems and Interaction (ASI, CLUES)


Machine Learning and Applied Statistics

Natural Language Processing


Text Mining, Search, Navigation

New England Lab

China Lab

  • Chen Zhao (SNSs in China, enterprise social computing)

India Lab


Outside MSR

  • FUSE – Lots of innovation in social media going on there.
  • Bing – Bing is also really innovative and has great people. Check www.bing.com/social , which should continue to get more sophisticated and more interesting. Also, check the twitter and hyperlocal blog map mashups from Matt Hurst and company.


Posted in Uncategorized | Leave a Comment »