Hospitals and other medical services record a lot of data on their patients – contact details, demographics, treatments and outcomes. Lots of detail is also recorded about illnesses and injuries, and it helps if this is done in a systematic and consistent fashion. In the UK, ‘Read Codes‘ do the job, but another worldwide standard is the International Statistical Classification of Diseases and Related Health Problems or ICD for short.
The latest version, ICD-10, has undergone some expansion in the US of late. Actually, that’s an understatement for an 8 fold increase: the original 18,000 codes have been increased to 140,000. The motivation is to provide a finer degree of detail about the health issues and events experienced by the admitted patients. As an outcomes researcher, I welcome the ability to collect detailed information on whether a patient’s diabetes has complications, or whether or not that heart defect is congenital. As a rational person, however, I’m not sure how much value there is in recording whether that orca bite was sustained on the ‘first’ or ‘subsequent encounter’ Or whether it was a volleyball or basketball that did the striking. On the up side, I can see that page forming the basis of a very nerdy party game, if anyone wants to join me.
Fine detail alone is only of so much use – we also need to know how accurately things are measured. It is often argued that if continuous data (e.g. height in cm, age in years) are available, we should use these in their raw form during analysis, rather than categorising them, so that detail isn’t lost. Very well in principle, but it depends how confident we are about the accuracy of that continuous information. Most clinical researchers would concede that it is more informative to report actual blood pressure measurements, rather than simply categorise people as ‘hypertensive’ and ‘normal’. But if you look at a real life clinical dataset, you will find the allegedly ‘continuous’ recording of blood pressure contains a lot of numbers ending in 0 or 5. This would suggest that the people collecting the data can be prone to rounding up or down. So if we analyse blood pressure as a continuous variable, we’re implying a degree of accuracy in the measurement that just isn’t there. In such cases, it would be better to group the values into categories.
Some detail is important. I can see why researchers studying zoonoses would want to know whether an infection arose after contact with a duck rather than a goose (assuming both patient and clinician have the knowledge to differentiate between the two). But who exactly cares whether a toe stubbing incident occurred in the bathroom or the bedroom – or in the ‘Garden or yard of other non-institutional residence’? Even allowing for the fact that some of this data is of more interest to actuaries than clinicians or health researchers, it’s hard to imagine how useful such fine-grained information will be. If nothing else, some of those sub-sub-sub-groups are going to be very small.
When designing any data collection process, the balance between too much information and not enough is really important. Ask yourself:
- Do I really care whether this item is a subset of a subset of a subset?
- Would I miss this detail if it were gone?
- Will I regret at a later date not having obtained this detail?
- Will this extra information help me better understand a basic process/make a clinical decision/develop a policy?
- Does the recording of finer detail reflect the accuracy of data collection possible in real world settings?
If the answer to any of those is no, then maybe it’s better to save the time and terabytes