Selection bias and the perils of data science

According to the Guardian Data Blog, Obama is heading for electoral success, on the basis of a Twitter-based analysis.

It’s all very nice to see mapped out, and the use of geocoding is cool (though possibly flawed), but underlying the approach is a massive potential for selection bias.

The problem is quite simply this: if Democrat supporters use Twitter more frequently (or are more likely to tweet about their political preferences) than Republicans, then the number of tweets supporting Obama over Romney is of course going to suggest that Obama is in the lead. On the other hand, if Republicans are more Twitter-active than Democrats, then there could be an underestimation of the level of support for Obama. Essentially, we’ve got a reasonable estimate for a numerator, but no clue about the denominator.

To answer a question well, the design of the study is crucial. It’s so important to ask the right question in the right sample of people. Think about it – if you want to estimate what proportion of the British population smoke, would you get a more accurate result if you randomly stopped people walking down the road and asked them, or if you approached groups of people standing outside pubs of an evening? Not that the ‘stopping people in the street’ method is without bias (results may vary according to neighbourhood, time of day, day of week), but the point is that if you go to where a group of people with similar behaviours hang out, and find that something is very common in that group, you can’t assume the results apply to the rest of the population.

Over-interpretation of findings is another trap to avoid. A candidate’s name appearing in a tweet doesn’t guarantee endorsement. A geocode attached to a tweet doesn’t guarantee that the tweeter is eligible to vote in the place they are tweeting from. Some tweeters may not be eligible to vote at all (e.g. children). Prolific tweeters may mention a topic many times, but still only get one vote. A view expressed in the Twittersphere might not be maintained all the way to the ballot box.*

All that the Guardian map really shows us is that Twitter users are more likely to talk about Obama and Biden than Romney and Ryan.** That’s all. It doesn’t even tell us if they’re talking positively about the current incumbent (what the authors call ‘sentiment analysis’). Being a ‘politically opinionated Twitter user’ also has a whole set of conditions attached to it, which could also influence the content of your tweets (a problem we call ‘confounding’)

You may be wondering why I’m labouring over this example, but I think it also demonstrates another point, namely the perils of the new wave of ‘data science’. It may be the sexy new discipline (or a sexy re-branding of stuff that people have already been doing for ages under duller names) but it’s not without issues. There’s so much data out there these days that it’s good to make the most of it. We need to be cautious, though, about not generating meaningless analyses.

Don’t get me wrong, I like a good infographic as much as the next nerd, and I’m generally quite impressed with what the Guardian’s Data Blog is trying to achieve. But

Big Data + Cool Tools =/= Right Answers

unless a fair amount of critical thought and intelligence is also added to the equation. It’s been suggested that a data scientist is “a better software engineer than any statistician and a better statistician than any software engineer”, but I’d argue that a decent heap of contextual understanding is also required.

Working with big clinical datasets, it concerns me how little talk there is about data errors, missingness, and potential for misinterpretation in this burgeoning field (or at least at the popular science reporting end of it). Anyone can sign up for a Kaggle competition, but not everyone has the subject-specific knowledge to spot whether a particular approach is appropriate for the question under consideration, and what potential sources of bias and confounding may lead to a misleading conclusion.

Admittedly, this map isn’t a very serious example, and I think even the creators would concur it’s a fairly informal analysis. But as more of these analyses get featured in the popular press, I wonder if there’s a risk that (a) people become so accustomed to these maps or plots that they casually accept the headlines without thinking more deeply or (b) the noise of dubious infographics and analyses outweighs the signal of the good, reliable stuff, and the bubble bursts. We’ve already got to the point where there are collections of bad examples and infographics about infographics (sigh!)

If Obama wins the election I’ll be a happy bunny (yay for Obamacare!), but it won’t make me think that Twitter predicted the outcome.

*I desperately want to make an ‘intention to Tweet’ pun but I can’t quite make it work.

**I’m not even sure how they figured out that tweets containing ‘Ryan’ pertained to the election. Perhaps it had to be in combination with another relevant word. Otherwise we can add classification errors into the mix. At least the other candidates have distinctive surnames!

Image credit: Image by Scot A. Hale and Mark Graham, sourced from the Guardian website, used under a Creative Commons License (Attribution-NonCommercial-ShareAlike 3.0 Unported)

(This post was brought to you by #AcWriMo!)

Advertisements

5 comments

  1. Great post. Good job!!

  2. Reblogged this on Mr Epidemiology and commented:
    A great post illustrating selection bias using the 2012 US Election and Twitter.

  3. Good explanation! So we decided to tweet about it on @EpidemiologyUU 🙂

  4. I guess twitter analysis, what now most popularly called as sentiment analysis has really given nearly accurate results. Sentiment analysis works on technique of judging the polarity of the sentence. Based on the sentiments expressed the polarity is decided to negative and positive. Not only in USA, even in canada elections too they used sentiment analysis. I guess the sentiment field is still evolving, and we are yet to see the real time example of usage of sentiment analysis. Twitter is really good source of data. Twitter is the media where people express their opinion about any brand or product, using the twitter data the companies can understand their customers in a more better way. Companies can also check the brand perception among their customers and can also use it to see the response for new products. Therefore, I believe that sentiment analysis has lots of scope for brands and companies.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: