Imputation is the process used to assign replacement values for missing, invalid or inconsistent data that have failed edits. This occurs after following up with respondents (if possible) and manual review and correction of questionnaires (if applicable). Imputation is typically used to treat item nonresponse and, occasionally, unit nonresponse. Unit nonresponse occurs when no usable information is collected for a given record while item nonresponse occurs when some but not all the desired information is collected. After imputation, the survey data file should normally only contain plausible and internally consistent data records that can then be used for estimation of the population quantities of interest.
Under the Fellegi-Holt principle (Fellegi and Holt, 1976), the fields to be imputed are determined by making changes to the minimum number of responded values so as to ensure that the completed record passes all of the edits. The determination of the fields to be imputed can be done before imputation or simultaneously with imputation.
Imputation is performed by those with full access to the microdata and thus in possession of auxiliary information known for both units with and without fields to be imputed. Auxiliary information can be used to predict missing values using a regression model, to find "close" donors to impute recipients, or to build imputation classes (e.g., Haziza and Beaumont, 2007). It can also be used directly as substitute values for the unknown missing values.
The basic principle underlying imputation is to use available auxiliary information in order to approximate as accurately as possible the unknown missing values and thus to produce quality estimates of population characteristics. Therefore, the application of this principle should normally lead to a reduction in both the bias and variance caused by not having observed all the desired values.
Good imputation processes are automated, objective and reproducible, make an efficient use of the available auxiliary information, have an audit trail for evaluation purposes and ensure that imputed records are internally consistent.
The choice of auxiliary variables used for imputation, also called matching variables for donor imputation, should mainly be based on the strength of their association with the variables to be imputed. To this end, consider using modelling techniques and consult subject matter experts to obtain information about variables.
Consider using different sources of data (e.g. current survey data, historical data, administrative data, paradata, etc.) for variables that can be used as auxiliary variables for imputing the missing values. Study the quality and appropriateness of these available variables to determine which ones to ultimately use as auxiliary variables.
Evaluate the type of nonresponse. That is, try to determine which auxiliary variables can explain the nonresponse mechanism(s) in order to use them to enrich the imputation method. Include such auxiliary variables in the imputation method, especially if they are also associated with the variables to be imputed.
Take into account the type of characteristics to be estimated (such as level vs. change, high-level aggregates vs. small domains, and cross-sectional vs. longitudinal) when choosing auxiliary variables and developing an imputation strategy so as to preserve relationships of interest; e.g. use historical auxiliary information if you are interested in changes or use domain information (if available) if you are interested in domain estimation.
Imputation methods and their implementation
Imputation methods can be classified as either deterministic or stochastic, depending on whether or not there is some degree of randomness in the imputation process (Kalton and Kasprzyk, 1986; Kovar and Whitridge, 1995). Deterministic imputation methods include logical imputation, historical (e.g. carry-forward) imputation, mean imputation, ratio and regression imputation and nearest-neighbour imputation. These methods can be further divided into methods that rely solely on deducing the imputed value from data available for the nonrespondent and other auxiliary data (logical and historical) and those that make use of the observed data from other responding units for the given survey. Observed data from responding units can be used directly by transferring data from a chosen donor record or by means of explicit parametric models (ratio and regression). Stochastic imputation methods include the random hot deck, nearest neighbour imputation where a random selection is made from several "closest" nearest neighbours, regression with random residuals, and any other deterministic method with random residuals added.
A serious modelling effort should normally be done to choose appropriate auxiliary variables and an appropriate imputation model. (An imputation model is a set of assumptions about the variables requiring imputation.) Once such a model has been found, the imputation strategy should be determined as much as possible in agreement with this model. This should help in controlling the nonresponse bias and variance and may be needed for proper variance estimation.
Try to force the imputed record to be internally consistent but resemble as closely as possible the failed edit record. This is achieved by imputing the minimum number of variables in some sense, thereby preserving as much respondent data as possible (Fellegi-Holt principle). The underlying assumption is that a respondent is more likely to make only one or two errors rather than several, although this is not always true in practice.
In some surveys, it is necessary to use several different types of imputation methods depending on the availability of auxiliary information. This is usually achieved in an automated hierarchy of methods. Carefully develop and test the methods used at each level of the hierarchy and limit as much as possible the number of such levels. Similarly, when collapsing of imputation classes is required, carefully develop and test the imputation methods for each set of classes.
When donor imputation is used, try to impute data for a record from as few donors as possible. Operationally, this may be interpreted as one donor per section of questionnaire, since it is virtually impossible to treat all variables at once for a large questionnaire. Also, try to limit the number of times a specific donor is used to impute recipients in order to control the variance of imputed estimators. Based on available donors, this may imply allowing equally good imputation actions an appropriate chance of being selected to avoid artificially inflating the size of certain groups in the population.
For large surveys, it may be necessary to process variables sequentially in two or more passes, rather than in a single pass, so as to reduce computational costs. As well, there may be extensive response errors on a record. Either of these conditions can make it difficult to follow the guidelines exactly: there may be cases where more than one donor is required (per section of the questionnaire), and more than the minimum number of variables are imputed.
Impact on estimates
Information on the imputation process should be retained on the post-imputation files and be available for proper evaluation of the impact of imputation on estimates as well as on variances. Such information includes variables indicating which values were imputed and by which method, variables used to indicate which donors were used to impute recipients and so on. Retain the unimputed and imputed values of the record's fields for evaluation purposes.
Consider the degree and impact of imputation when analyzing data. Even when the degree of imputation is low, changes to individual records may have a significant impact; for example, when changes are made to large units or when large changes are made to a few units. In general, the greater the degree and impact of imputation, the more judicious the analyst needs to be in using the data. In such cases, analyses may be misleading if the imputed values are treated as observed values.
The imputation methods used may not preserve relationships between variables and may have a significant impact on distributions of data. For example, not much may have changed at the aggregate level, but values in one domain could have moved systematically up, while values in another domain could have moved down by an offsetting amount. This may actually mean that the domain variable needs to be taken into account in the imputation strategy.
Evaluate the degree and effects of imputation. The Generalized System for Imputation Simulation (GENESIS) is one possible tool for this purpose. It performs imputation in a simulation environment and can be used to assess the bias and variance of imputed estimators in specific settings.
- There exist a number of generalized systems that implement a variety of algorithms, either for continuous or categorical data. They should be considered during the development of the imputation methodology. The systems are usually simple to use once the edits are specified, and they include algorithms to determine which fields to impute. They are well documented and retain audit trails to allow evaluation of the imputation process. Two systems currently available at Statistics Canada are the Generalized Edit and Imputation System (GEIS/BANFF) (Kovar et al, 1988; Statistics Canada, 2000a) for quantitative economic variables and the Canadian Census Edit and Imputation System (CANCEIS) (Bankier et al, 1999) for qualitative and quantitative variables.
Consider the use of techniques to adequately measure the sampling variance under imputation and to measure the added variance caused by nonresponse and imputation (Lee et al, 2002; Haziza, 2008; Beaumont and Rancourt, 2005). This information is required to satisfy Statistics Canada's Policy on Informing Users of Data Quality and Methodology (Statistics Canada, 2000d; see Appendix 2 where this Policy is reproduced). The System for the Estimation of Variance due to Nonresponse and Imputation (SEVANI), developed at Statistics Canada, can be used for this purpose.
The final report and recommendations of the Committee on Quality Measures (Beaumont, Brisebois, Haziza, Lavallée, Mohl, Rancourt and Trépanier, 2008) contain additional guidelines for variance estimation in the presence of imputation that should be read and considered before implementing any new methodology or software.
- To obtain general training on imputation or greater detail on some specific issues, there are different resources. First, it is suggested to take the Statistics Canada course "0423: Nonresponse and Imputation: Theory and application". The Imputation Bulletin is also an interesting and useful source of information on the subject. Finally, external consultants, such as David Haziza and J.N.K. Rao, as well as a number of internal consultants, including the members of the Statistical Research and Innovation Division, the members of the Committee on Quality Measures and the members of the Committee on Practices in Imputation, are available to answer questions.
Main quality elements: accuracy, timeliness, interpretability, coherence.
Estimates obtained after nonresponse has been observed and imputation has been used to deal with this nonresponse are usually not equivalent to the estimates that would have been obtained had all the desired values been observed without error. The difference between these two types of estimates is called the nonresponse error. The nonresponse bias and variance (i.e. the bias and variance caused by not having observed all the desired values) are two quantities associated with the nonresponse error that are usually of interest. These unknown quantities, for which we would ideally like to obtain an accurate measure, are related to the 'accuracy' aspect of quality.
In theory, the nonresponse bias is eliminated if the imputation strategy is based on a correctly specified imputation model with good predictive power. Such an imputation model also leads to a reduction of the nonresponse variance. An imputation model is correctly specified if, given the chosen auxiliary variables, the assumptions underlying its first moments (usually the mean and variance) hold. It is predictive if the chosen auxiliary variables are well associated with the variables to be imputed. As pointed out in the above guidelines, variables used in the definition of the estimator and variables associated with the nonresponse mechanism should be considered as potential auxiliary variables. The objective of these guidelines is to ensure that, given the chosen auxiliary variables, the respondents and nonrespondents are similar with respect to the measured variables.
It is difficult to measure the magnitude of the nonresponse bias but it is possible to derive indicators that are associated with it. Since the magnitude of the nonresponse bias depends on the adequacy of the imputation model, standard model validation techniques, which can be found in classical textbooks on regression, can be used to derive useful indicators. For instance, graphs of model residuals versus different auxiliary variables, including predicted values, can be used to detect possible model misspecifications. The residuals could also be used to derive different statistics. For logistic regression, the Hosmer-Lemeshow test statistic may be a useful indicator. These indicators may also be useful for giving an idea of how the nonresponse variance has been controlled, especially those giving information on the strength of the association between the auxiliary variables and the variables to be imputed.
In addition to the above model diagnostics, estimates of the nonresponse variance or estimates of the total variance may provide good measures of the increased variability due to nonresponse provided that the nonresponse bias can be assumed to be reasonably small. The total variance is the sampling variance to which a nonresponse component is added to reflect the additional uncertainty due to nonresponse. Many variance estimation methods that take nonresponse and imputation into account exist as well as some software. For instance, estimates of the nonresponse component or the total variance can be obtained using SEVANI.
Other indicators can be considered and are useful to give an indication of the degree of imputation but are more difficult to directly relate to the nonresponse bias and variance. The imputation rate by variable and by important domains is one of these indicators. For estimates of totals and means, another useful indicator is the contribution to key estimates that comes from imputed values. A large contribution from imputed values may be an indication that the nonresponse bias and/or variance are not small. Other indicators of the impact of imputation on final estimates can also be determined to provide additional information on the reliability of the estimates.
As stressed in the above discussion, a serious modelling effort should be made before determining any imputation strategy. This requires time and resources. In practice, an appropriate balance must thus be struck between the time taken to produce the imputed data file (timeliness) and the quality of the underlying imputation model so as to avoid unduly delaying the release of data. Whenever appropriate, the use of generalized systems for imputation may contribute to a substantial reduction of the processing time, especially the time needed for system development, and thus ensure that more time can be devoted to the choice of a good imputation strategy.
Finally, the imputation methodology used should be clearly described and provided to users along with some of the above indicators and measures. This ensures a better interpretability of the survey results. As much as possible and relevant, the use of similar imputation methodologies across surveys collecting similar information should be considered for coherence purposes.
Bankier, M., M. Lachance. and P. Poirier. 1999. "A generic implementation of the New Imputation Methodology." Proceedings of the Survey Research Methods Section. American Statistical Association, 548-553.
Beaumont, J.-F., F. Brisebois, D. Haziza, P. Lavallée, C. Mohl, E. Rancourt, and J. Trépanier. 2008. Final report and recommendations: Variance estimation in the presence of imputation, Committee on Quality Measures. Statistics Canada technical report.
Beaumont, J.-F., and E. Rancourt. 2005. "Variance estimation in the presence of imputation at Statistics Canada." Paper presented at the Statistics Canada's Advisory Committee on Statistical Methods, May 2005.
Fellegi, I.P. and D. Holt. 1976. "A systematic approach to automatic edit and imputation." Journal of the American Statistical Association. Vol. 71. p. 17-35.
Haziza, D., and J.-F. Beaumont. 2007. "On the construction of imputation classes in surveys." International Statistical Review. Vol. 75. p. 25-43.
Haziza, D. 2008. "Imputation and inference in the presence of missing data." In Handbook of Statistics. Vol. 29. Chapter 10: Sample Surveys: Theory, Methods and Inference. D. Pfeffermann and C.R. Rao (eds.). Elsevier BV (to appear).
Kalton, G. and D. Kasprzyk. 1986. "The treatment of missing survey data." Survey Methodology. Vol. 12. p. 1-16.
Kovar, J.G., and P. Whitridge. 1995. "Imputation of business survey data." In Business Survey Methods. B.G. Cox et al. (eds.) New York. Wiley. p. 403-423.
Kovar, J.G., J. MacMillan and P. Whitridge. 1988. Overview and strategy for the Generalized Edit and Imputation System. Statistics Canada, Methodology Branch Working Paper no.BSMD 88-007 E/F.
Lee, H., E. Rancourt and C.-E. Särndal. 2002. "Variance estimation from survey data under single imputation." In Survey Nonresponse. R.M. Groves et al. (eds.) New York. Wiley. p. 315-328.
Statistics Canada. 2000d. "Policy on Informing Users of Data Quality and Methodology." Statistics Canada Policy Manual. Section 2.3. (Reproduced in Appendix 2). Last updated March 4, 2009. /about-apercu/policy-politique/info_user-usager-eng.htm
Statistics Canada 2000a. Functional description of the Generalized Edit and Imputation System. Statistics Canada Technical Report.
- Date modified: