Dealing with missing data is a key issue for reliable predictive analytics. You may have heard of data prediction or missing data imputation, but exactly how does data science process the absence or ambiguity of information? Let’s take a look at these fundamental questions.
Why use data prediction?
A simple case in point: a CSR director in charge of tracking energy consumption at 50 sites can’t afford to just abandon analysis where data is missing. The same applies when managers of a property portfolio have consumption figures for common areas but not private spaces. Data prediction is a kind of compromise, employed to avoid massive losses of information.
Imputing missing data provides a more reliable macro view and more accurate predictions. Many algorithms don’t work if data is missing, so it’s a matter of making a trade-off between values and variables.
Did you say prediction? For what cases?
Simply put, data prediction is about filling in gaps in the information stored within a database. Several scenarios may require the imputation of missing data:
- The data exists, but we don’t have it: if I want an annual analysis of electricity consumption before my December bill arrives, for example, I might temporarily post a value until I get the missing figures.
- The absence of data is ambiguous: in answer to the question “Is equipment present at the site”, an empty box can be interpreted in several ways. It might mean that the site doesn’t have the equipment, or that the information isn’t known. Missing data is problematic, because it’s subject to multiple interpretations.
- Data is missing for a perfectly logical reason – in cases where questions overlap, for example: if a building doesn’t have air conditioning, there will of course be no answer to the question “What type of air conditioning system do you have?”.
Primary data prediction methods
There are two main methods for imputing missing data:
- A system of logical rules can be used to fill in gaps or ambiguities, by looking for information elsewhere (ex: in previous consumption records) or by identifying the reason for a blank question. For example, we could assume that if I have no information on the type of heating, it’s because my heating is electric.
- More complex machine learning algorithms can use deduction based on other known and similar situations. If other sites of comparable size and geographical situation have a boiler, the missing data “boiler” could be imputed deductively to the “heating type” field.
When missing data is imputed, the choice of a data prediction method will depend on the context, logic, data type, etc.
If not addressed, missing data poses a problem for data analytics. It can distort the overall picture, or even prevent prediction algorithms from working properly. Data prediction lets you draw conclusions based on sound data analysis, providing a more realistic view and helping identify the right decisions for your energy management.