Chance, bias and infinite monkeys

The Infinite Monkey theorem, in which an endless typing pool of monkeys is hypothesised to one day produce Shakespearean output, is well known. Now an enthusiastic programmer has decided to exploit cloud computing capabilities to simulate millions of ‘virtual monkeys’ typing away randomly, to see how long it takes before those sonnets arise. Quite wisely, he’s opted for an iterative approach in which text strings matching Shakespeare are retained and built upon, rather than waiting for an e-Monkey to produce the Complete Works in one single output. This improves his chances of success considerably.

It’s not the first time an empirical approach has been attempted:

“…in 2003, Paignton Zoo carried out a practical test by putting a keyboard connected to a PC into the cage of six crested macaques. After a month the monkeys had produced five pages of the letter “S” and had broken the keyboard.”

That paragraph made me laugh, but it contains an important point. Real life monkeys don’t produce random text. Even if they got over their ‘S’ fixation, the Paignton Monkeys may well have gone on to produce text that followed patterns – they might have favoured the keys around the edges, or the ones with the most aesthetically pleasing outputs on the screen. Maybe they’d have a sensory preference for the ‘F’ and ‘J’ keys with their raised surfaces. To suggest that monkey keyboard outputs are truly random doesn’t give the primates enough cognitive credit.

It seems to me that the chances of infinite monkeys producing Shakespeare are less than a truly random algorithmic approach, whether you allow for an iterative process or not. Once those monkeys start developing keyboard preferences, the output is not going to be random. Which means that some combinations of text might never arise

What on earth has any of this got to do with epidemiology?

A well-trained epidemiologist will question any apparent statistical association laid out before him/her. We’re conditioned to look for alternative explanations before believing that A is truly associated with B, let alone asserting that the relationship might be causal. The alternative explanations we usually explore are chance (the result is a random finding, possibly due to not having a large enough sample), bias (non-random error in the data collection or analysis) or confounding (the result is actually due to something else which we didn’t measure adequately).

This is also important when considering ‘missing data’. It’s rare to get a data set where every variable is completely filled in for every subject, but the occurrence of ‘missingness’  might have an underlying pattern. This can have implications for the analysis and interpretation of your findings.

Imagine you have a dataset of patients admitted to hospital accident and emergency departments in which staff have been asked to record patient’s occupation in order to answer a research question.  When you look at the dataset, occupation is missing in 34% of records. If it is ‘missing at random’ then you can proceed with your analysis, though you need to consider whether the number of complete records is now too small to answer your question. But what if the missingness is not random? What if the data is more likely to be missing in the most serious emergency cases, but more commonly present in the less serious? What if some staff are less likely to bother recording occupation if it turns out the patient is unemployed or retired? What if some hospitals are a lot worse at collecting data than others, and these hospitals are also dealing with a very different patient populations? All of these things could introduce different biases into the analysis or interpretation, and give an inaccurate view of reality.

Whether it’s infinite monkeys vs algorithms, or missing data in your dataset, it is crucial to distinguish between what happens by chance and what occurs due to preference or bias. Failure to do so might scupper your chances of finding a causal relationship – or replicating classic literature before your hard drive fails. Perhaps it’s time to find a better analogy for randomness.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: