Entropy-based tweet sampling
Posted by Scott on March 29, 2011
Problem: There are 100,000+ tweets from the past couple of days on the disaster in Japan, and you want the find the “best” 10. What do you do?
This is a very real problem if you want to search social media. There is no pagerank, no algorithm for determining the best content. What does “best” even mean in this context? Most recent? Most retweeted? Originating from the most authoritative sources (assuming you could determine who those were)? This is also a potentially very helpful problem to solve: how useful would it be for search engines to return the best social media results for any given topic?! (Note that current approaches focus on recent content and content with heavily shared links; see Bing Social and Google Realtime.)
This is the challenge Munmun De Choudhury took on last summer during her internship at MSR, working with Mary Czerwinski and me. A pair of papers on the work is set to come out in upcoming the HyperText and ICWSM conferences (early drafts here and here, respectively). Here I provide a short summary, but there are many more details in the papers.
We broke the problem down into two pieces: sampling and end user cognition. Intuitively, we want to find the sample of tweets on a given topic that the user finds most engaging, informative, and memorable. I’ll break down each piece in turn:
Sampling: A popular topic in Twitter will see tens to hundreds of thousands of tweets over the course of a couple of days. These tweets originate from people all over the planet, provide all sorts of angles on a topic (e.g., the political versus the economic perspective), and so on. How do you sample from such an incredibly diverse media source? Here Munmun had her first stroke of brilliance: leverage the diversity by sampling based on it. That is, by characterizing the set of tweets by its level of information diversity (entropy), it can then be sampled to meet a desired level of diversity. From a usage scenario standpoint, a user could then specify a level of diversity (ranging from highly homogenous to highly heterogeneous).
To do this, each tweet was characterized numerically in terms of the following attributes: whether it is a retweet, is a reply, and/or contains a link, it’s timestamp, location (timezone) of the author, thematic categorization (politics, sports, technology, etc.), number of followers and followees of the author, and degree of activity of the author (number of tweets). Doing this for every tweet on a topic generates the information space for tweets over that topic that can then be sampled according to a desired level of diversity.
That makes sense, but it’s still a huge amount of items to wade through. Might there be a way to prune without losing valuable information? Here was Munmun’s second stroke of genius: what if we treat tweet streams like signals that can be compressed? As we show in the ICWSM paper, using a Haar wavelet transform, she was able to eliminate about half of the set of tweets on a topic with minimal loss of the information space. Yes! From the remaining set of tweets, she employed a greedy sampling algorithm that sampled 10 tweets that best matched a specified level of diversity.
This sampling process, then, follows three high level steps: 1) characterize the set of tweets on a topic as an information space by quantifying each tweets along a number of attributes, or dimensions, 2) prune the information space in such a way as to not lose much of that information space, 3) from the remaining items, iteratively create a sample of 10 items that matches a desired level of diversity. Here is an example from the tweets about last summer’s oil spill in the Gulf of Mexico. Our method is ‘PM’ at the bottom. MTU represents tweets with the most tweeted URLs, which was the second best method in our tests.
It turns out that for a given level of diversity, there is pretty much one set of 10 items that meets that level. Here we show that regardless of the starting “seed” tweet, the resulting set of 10 tweets for a given level of diversity is very similar. Note the slight increase in variation at the middle levels of information diversity (because there are more ways the tweet space can be combined -more degrees of freedom- to achieve levels of diversity toward the middle of the diversity scale):
We go on to show that Munmun’s sampling process better matches target levels of diversity, across a wide range of topics and across levels of diversity.
OK, but how do you know these are good samples? PageRank offers an objective measure for which result is the best for a given search term. In the absence of such a metric, we turned to user cognition. That is, our reasoning was that the best social media results would be informative, engaging to read, and actually be remembered by users. A user study showed that Munmun’s method (PM) generated tweet samples that were largely better than a number of baselines (B1-3) along these user cognition measures:
Summary: How do you sample a highly diverse media space like Twitter for the best content, and how do you define “best”? In this work we approached this problem from a entropy-based sampling and user cognition standpoint, with promising results.