What, will these data ne’er be clean?

I’ve had a bit of a blogging hiatus due to PhD commitments and winter hibernation, but hopefully now I’m back. One of the things I’ve been pre-occupied with lately is ‘data cleaning’ . If you know what that is, then you’ll be nodding sagely in understanding at my absence. If you don’t, let me enlighten you.

Like many people in my field, I cut my statistical analysis teeth on nice polished datasets containing no more than a handful of variables and a few hundred records, where one could feel smug after inspecting each record by eye and spotting the ‘deliberate mistakes’ such as implausible ages or height, or the occasional missing data point.

These days I work with very big datasets: typically hundreds of thousands of patient records, each containing up to a couple of hundred pieces of information. Eyeballing these records for anomalies is obviously not practical or efficient. Unfortunately a lot of standard epidemiology training doesn’t actually teach you how to manage this on a large scale. Firstly it relies on being able to code pretty well in your programme of choice. Secondly it requires you to ask the right questions of your dataset.

Nobody’s ever given me a checklist or guidance on this so I’ve had to be creative and figure it out as I go along. The following bits of advice are drawn from my own and others experiences of looking at large clinical or demographic databases, from developed and developing countries. The overall message is that no database is safe, even if someone tells you it’s ‘pretty clean’.

Check for zombies

If there are a number of date variables in your dataset, check whether events happened in a plausible order. If the patient is listed has having died 15 years before their operation date, something has gone wrong in the coding. If you find a patient aged over 130, either exclude them from your analysis or call Guinness.

Set the a-gender

Men don’t usually become pregnant or develop gynaecological cancers. Women don’t tend to have testicular problems. If your research question relates to a gender-specific condition, do have a quick check to make sure everything adds up.

Ditto for questions where you are looking at repeat admissions. While some people do of course undergo gender reassignment surgery, the frequency of that is probably a lot lower than some databases suggest. And it’s rare that people yo-yo back and forth between sexes – far more likely a coding issue. “Gender=unknown” is probably less likely to indicate someone transitioning and more likely to indicate that the data inputter never met the patient and can’t figure out their sex from their name. Don’t read too much into it.

When is nothing really something?

Fields where you expect numerical inputs are not free from the weird stuff either. Data inputters have a variety of ways of using and abusing integers. If vast swathes of your cohort appear to have a zero reading for their blood pressure, it may just mean that coders have decided that values below a certain risk threshold are easier to code as ‘0’ meaning ‘normal’ rather than the actual value. This is why if there is ANY accompanying documentation with the dataset, it’s a good idea to read the fine print.

You might also find that a lot of the numbers end in 0 or 5. Despite the freedom to enter exact values, and the capacity for some medical instruments to give readings to one decimal place, some of that detail gets lost in translation and people end up rounding up or down (not necessarily correctly). So just because a variable should be continuous, doesn’t mean you are safe to proceed in treating it as such in your analysis without checking the distribution of values first.

The long and the short and the tall

Have variables for height and weight? Then calculate BMI, even if you don’t need it for your analysis. If you find patients with BMI values of over 150, then what is more likely: that you have patients who are very short and very wide, or someone has transposed the numbers?

When your dataset is not all there

When there are various stages and different people involved in the flow of data from the patient to your desk, things can get lost along the way. If you work with data that is periodically updated, don’t assume that everything will be stable from one version to the next. In one dataset I work with, as I moved from an old version to an update, the number of patients I was excluding from the analysis increased four-fold. Turns out that the data extractors had somehow introduced an arbitrary cut-off in the length of some text strings and as a consequence I accidentally misclassified 60,000 patients. Took a while to figure that one out. And I only noticed because I’d kept detailed records of the number of patients excluded at different stages of the cleaning, so that was an important lesson learned.

Curb freedom of expression

Many people like free text and argue that subtle detail about cases can be lost about cases if data entry is limited to numbers and categorical options from drop down menus. I’d agree to this point to an extent – nobody wants important details to be left out. The trouble is that if a piece of information is simple and can easily be categorised, then data which has been aggregated from several different data entry interfaces, some of which  allow free text, can be unnecessarily difficult to handle. For example, in one dataset I work with, a procedure can be either ‘elective’ (planned), ‘urgent’, or ’emergency’. Three options (four if you allow ‘unknown’). Unfortunately when I look at the data, I see:

1. Elective

1. Routine

1. elective

1.00. Elective

2. Urgent

2. urgent

2.00. Urgent

3. Emergency

3. emergency

3.00. Emergency

Elective (Routine admission)

Emergency (Unscheduled patient)

Something like that takes me several lines of code and much muttering under my breath to resolve. Repeat across 100+ variables and you will understand why Edison’s’ perspiration vs inspiration ratio  maps well to data processing vs analysis.

Over to you. What ‘surprises’ have you found in your datasets?*

*Obviously don’t share anything that compromises data confidentiality



  1. As for height and weight, I’ve found plenty of giants and dwarves – lots of this seem to be due to the data enterer being a bit confused about (centi)meters/feet and inches or (kilo)grams/stones. As you mentioned, calculating BMIs is a neat way of finding the outliers. Another method we used was comparing our extreme heights and weights with those found in surveys like the Health Survey for England, and if using longitudinal data, check if there’s consist data within one patient over time (details in this paper written by another PhD student from my department: http://www.ncbi.nlm.nih.gov/pubmed/22052713).

    Another oddity that also pops up in ‘my’ data are Disney figures. As lots of different people enter data, there are a few fake patients that are used to practice data coding. These tend to be named after Disney figures like Mickey Mouse and Pinocchio and lead pretty miserable lives. I believe Pinocchio in particular has had a few close calls death.

  2. I’ve never come across the tragic cartoon characters, perhaps because I never get to see patient names. Perhaps I should ask someone further upstream to make sure that the particularly extreme cases aren’t actually data entry dummy runs!

  3. […] the really nitty-gritty details came out, imparted by people who were clearly up to their elbows in data cleaning and validation on a regular basis. That’s not to say that broad overviews of bigger areas […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: