To Impute or Not to Impute: Working with Incomplete Data

July 31, 2020
Dray McFarlane
To Impute or Not to Impute: Working with Incomplete Data

That is the question. Or, if not the question, at least an important one! Especially when paired with how should be impute? But before we go down that path and say the word impute another dozen times, maybe I should explain what it means in this context!

Basically imputation is part of the answer to how we handle incomplete data. For any number of reasons, when we get a set of data that we want to analyze, there are going to be some gaps. This includes surveys or forms left incomplete, fields that don't apply for everyone, events that haven't occurred or many other explanations but regardless of the reason, our models aren't going to handle blanks very well.

So what are our options? Do we just trash that field and move on to ones that are complete? Even if you're assuming any data will ever be complete - which might be a little optimistic - that still can leave you ignoring some really good, predictive information. If only a few records out of a large data set weren't filled in, we don't want to get rid of that element entirely. Do we trash just those records? Maybe! If it's a small number of records and wouldn't impact the model much, that is a reasonable option in some cases when you're more interested in the aggregate results but if you are attempting to predict behavior at an individual level it would be less than ideal if you just had to shrug your shoulders any time someone left off their birthday.

What else can we do if getting rid of it doesn't make sense? Here's where we get to imputation. Basically, we're going to populate those blank values with something so we can still include the record and the feature in our model.

So that's the first question - to impute or not to impute? If dropping the missing data makes sense in the context of what we're trying to achieve, maybe we don't need to impute; but if that means losing a significant amount of valuable data or risking not being able to produce any results at the level we're aiming for, let's impute.

And that takes us to the follow-up question: how do we actually fill in those blank values? What is going to be helpful in maintaining the integrity of the model and keeping that feature valuable? Well, context matters. Depending on the data, we get to choose between several different methods here.

We can use some numeric techniques - fill in blanks with the mean or median values from our populated data - which work well if you can assume people generally fall between a range. This can work pretty well for something like a level of satisfaction if you have that on some scale like 1-10. Basically if you don't have an answer, you would end up assuming something fairly neutral.

What if we're looking at something like days since the last time a person attended a meeting? A mean or median value there could grossly misrepresent reality. The blank value there might intentionally mean that someone never attended a meeting and that has drastically different predictive value than putting any value in there between the people who have attended meetings. In this case, we might use a constant value instead. Maybe something very large would work here and still allow the feature to be useful in our model and results.

A constant value is a pretty easy fallback but requires you to pick the value you want to use. A variation that is a bit of a compromise between the mean, median, and constant value approaches is to use the most commonly used value. Rather than something somewhere in the middle like mean or median or some value you picked like constant, this still lets the data inform the choice assuming that the records generally have similar preferences in this case if not otherwise stated.

It feels like for each of these methods there's a lot of context required to make the right choice - and that's absolutely true. But there's good news here: with current tools we can pretty much try all of them! You need to understand the pitfalls so you don't produce fake results (just populating blanks with values you think you lead to something interesting is frowned upon), but once operating within good statistical practices you can set up and train your models trying each approach and tailoring them to see what gives the best results.

Once you have these basics spun up, you can even start playing with more complex approaches that might include adding new fields to identify when imputation has occurred - maybe having a very large value for days since last meeting registration alone was ruining the predictive capabilities of that field but combining it with another field that is set to 1 whenever the first was blank? That allows the model to be even smarter since it now knows when you were correcting for missing data.

As always, there is a bit of an art to this process and experience is very valuable. Working with similar data structures over and over will allow you to get to the most effective way to handle missing data quickly rather than having to go through all of this every time.

Dray McFarlane

Follow us on social media