Correlation vs Causation (part 1)

Correlation vs Causation from

Correlation vs Causation by xkcd

I’m a massive fan of the webcomic xkcd. Don’t be surprised if you find me using Randall Munroe’s creative outputs on a regular basis to help me get my point across.

It’s easy to find things that correlate in every day life. Cold weather spells correlate with higher fuel bills. The start of the festival season correlates with increased sales of camping gear. Risk of death increases as body temperature plummets. But how do you tell when the apparent association between A and B actually means that A causes B?

Say you work for the London Strawberry Marketing Board and you’ve been set the task of determining whether permanent residence in a given area determines the rate of consumption of strawberries. You’re asked to focus on the month of June. You determine strawberry consumption by measuring number of punnets sold  per day in local shops and markets. You decide to analyse your data by postcode, and in doing so find that the SW19 area has considerably higher rates of strawberry consumption than any other area. You conclude that being a permanent resident of SW19 somehow causes you to have a heightened preference for strawberries.

You are wrong.

You have identified a correlation, but not necessarily a causal relationship. See, SW19 is the home of the All England Lawn Tennis Association, which plays host to Wimbledon each June. You thought that measuring strawberry sales was a good indication of the consumption habits of the permanent residents. But you failed to take into account the effect of the vast number of outside visitors at that time of year, many of whom purchase strawberries on their way to watch the Tennis. This alternative explanation of the observed data is known as ‘confounding’, but we’ll return to that another day.

So, back to the drawing board. Upon adjusting for sporting fixtures, you find that the June strawberry consumption rates in SW19 are only marginally higher than the rest of London. You realise that this is not necessarily attributable to any feature of the SW19 residents themselves – perhaps the result is due to chance, or perhaps the local residents have a Pavlovian (pavlova-ian?) conditioning to up their strawberry eating at that time of year, but the effect is actually too small to worry your bosses with.

Just a hypothetical example, of course, but hopefully one which makes the point that sometimes it’s necessary to dig a little deeper when a relationship is apparently found. And also that choosing how you measure your outcome and exposure(s), and the time-frame in which you measure them, can affect the result you get.  So how do you work out if a correlation is a causal relationship? That’s for next time…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: