3.4 Processing
3.4.4 Imputation

Text begins

Editing is of little value to the overall improvement of the actual survey results, if no corrective action is taken when items fail to follow the rules set out during the editing process. When all of the data have been edited using the applied rules and a file is found to have missing data, then imputation is usually done as a separate step.

Missing or invalid values definitely impact the quality of the survey results. Imputation is the process used to assign replacement values for missing, invalid or inconsistent values that have failed edits. This occurs after a follow-up with respondents (if possible), manual review and correction of questionnaires (if applicable). At this stage, all kinds of errors are corrected, including errors made by respondents and errors occurred during coding and data capture.

Imputation procedures are designed to fill in the gaps. In general, changes are made to the minimum number of fields until the completed record passes all of the edits. When these errors are detected, values for invalid, missing or incomplete entries are imputed or replaced with appropriate values, and answers are provided for non-response questions. This procedure is best accomplished by those with full access to the microdata and in possession of good auxiliary information.

Although imputation can improve the quality of the final data, care must be taken to choose an appropriate imputation methodology. Some methods of imputation do not preserve the relationship between variables. In fact, some can actually distort the underlying distributions.

Some commonly used methods of data imputation methods include:

  • Deductive imputation is usually the first method used. This method is used when a value can be deducted with certainty and can be completed during the collection, capture, editing, or later stages of data processing. Deductive imputation is used when there is only one possible response to the question (e.g. all the values are given but the total or subtotal is missing).
  • Hot-deck imputation uses the answers from another record of the same survey, referred to as the donor, to answer the question (or set of questions) that needs imputation. The donor can be randomly selected from a pool of donors with the same set of predetermined characteristics. For example, if a questionnaire has been returned with the yearly income missing, then we could determine donor characteristics as records with the same province, same occupation and same amount of experience as the respondent from the survey requiring imputation. A list of possible donors matching these criteria is created and one of them is randomly selected. Once a donor is found, the donor response (in this case, the yearly income) replaces the missing or invalid response.
  • Cold-deck imputation is similar to hot-deck imputation. The difference is that hot-deck imputation uses donors from the same survey, while cold-deck imputation uses donors from another source, such as historical data from an earlier iteration of the same survey or from administrative data.
  • Mean value imputation is to replace the missing or inconsistent value by the mean value calculated from the responding units with the same set of predetermined characteristics. For example, if a record is missing a total number for an individual’s yearly income, then we could impute the observed average income in that individual’s province for the same occupation with the same level of experience as the respondent. One drawback of mean imputation is that it destroys distribution and the relationships between variables by creating an artificial spike at the group mean. This artificially lowers the estimated sampling variance of the final estimates if conventional formulas for the sampling variance are used.
  • Nearest neighbor imputation is another type of donor imputation. In this case, some sort of criteria must be developed to determine the responding unit the most similar to the unit with the missing value in accordance with the predetermined characteristics. The closest unit to the missing value is then used as the donor.

There are other more sophisticated imputation methods, which use statistical modelling to assign an appropriate replacement value.

The method of imputation can vary from survey to survey and even, depending on particular circumstances, within the same survey. Quite often, different methods are combined together to provide the most suitable value for a variable. These methods can be applied either manually or with the use of an automated system. To help facilitate this, Statistics Canada has developed a generalized imputation system to impute data based on the methodological input of experienced statisticians who have analyzed the survey and suggested the best approaches to impute meaningful data.

Although imputation can improve the quality of the final data, care should be taken when choosing an appropriate imputation methodology. One risk with imputation is that it can destroy reported data to create records that fit preconceived models that may later turn out to be incorrect. The suitability of the imputation methods depends upon the survey, its objectives, available auxiliary information and the nature of the error.

In addition, all the imputation methods can be applied to other sources of data, not just limited to survey data. For example, Statistics Canada receives and uses financial data from Canada Revenue Agency to reduce response burden, and these administrative data often have missing or inconsistent values. In order to make good use of them, some rigorous editing and imputation systems have been put in place to improve the data quality before moving to the next step.

Note also that in the case of total nonresponse, when very little or no data have been collected for a record or unit, a common approach is to perform a nonresponse weight adjustment, which will be discussed in detail later in the section on estimation.


Date modified: