 |
|
Survey steps >
Imputation
Scope and purpose
Imputation is the process used to determine and assign replacement
values for missing, invalid or inconsistent data that have failed edits.
This is done by changing some of the responses or assigning values when
they are missing on the record being edited to ensure that estimates are
of high quality and that a plausible, internally consistent record is
created. Many of these problems would have been solved earlier through
follow-up with the respondent or through review and manual correction
of the questionnaire. However, it is generally impossible to resolve all
problems at these early stages due to concerns of response burden, cost
and timeliness. Since it is usually desirable to produce a complete and
consistent microdata file containing imputed data, imputation is used
to handle the remaining edit failures.
Although imputation can improve the quality of the final data by correcting
for missing, invalid or inconsistent responses, care must be exercised
in choosing an appropriate imputation methodology. Some methods of imputation
do not preserve the relationships between variables and can actually distort
underlying distributions. Therefore, imputation must be taken into account
when producing estimates and their associated variance estimates.
Principles
Imputation is best done by those with full access to the microdata and
in possession of good auxiliary information. It may be automated, manual
or a combination of both. Good imputation attempts to limit the bias caused
by not having observed all of the desired values, has an audit trail for
evaluation purposes and ensures that imputed records are internally consistent.
Good imputation processes are automated, objective, reproducible and efficient.
Under the Fellegi-Holt principles (1976), changes are made to the minimum
number of fields to ensure that the completed record passes all of the
edits.
Imputation methods can be classified as either deterministic or stochastic,
depending upon whether or not there is some degree of randomness in the
imputed data (Kalton and Kasprzyk, 1986; Kovar and Whitridge, 1995). Deterministic
imputation methods include logical imputation, historical imputation,
mean imputation, ratio and regression imputation and single donor nearest-neighbour
imputation. These methods can be further divided into methods that rely
solely on deducing the imputed value from data available for the nonrespondent
and other auxiliary data (logical and historical) and those that make
use of the observed data for other responding units for the given survey.
Use of observed data from responding units can be made directly by transferring
data from a chosen donor record or by means of models (ratio and regression).
Stochastic imputation methods include the hot deck, nearest neighbour
imputation where a random selection is made from several “closest”
nearest neighbours, regression with random residuals, and any other deterministic
method with random residuals added.
Guidelines
- Evaluate the type of nonresponse. That is, try to determine which
auxiliary variables can explain the nonresponse mechanism(s) in order
to use them to enrich the imputation method. Include such auxiliary
variables in the imputation method.
- Carefully develop and test the imputation approach. Study the quality
and appropriateness of available variables to determine which ones to
use as auxiliary variables, as matching variables or to build imputation
classes. For this purpose, consult subject matter experts and use modelling
techniques.
- Take into account the type of estimates to be produced, such as level
vs. change, high-level aggregates vs. small domains, and cross-sectional
vs. longitudinal.
- Try to have the imputed record closely resemble the failed edit record.
This is achieved by imputing the minimum number of variables in some
sense, thereby preserving as much respondent data as possible. The underlying
assumption is that a respondent is more likely to make only one or two
errors rather than several, although this is not always true in practice.
Make imputed records internally consistent.
- In some surveys, it is necessary to use several different types of
imputation methods. This is usually achieved in an automated hierarchy
of methods. Limit the number of such levels and carefully develop and
test the methods used at each level of the hierarchy. Similarly, when
collapsing imputation classes is required, carefully develop and test
the imputation methods for the new classes.
- When donor imputation is used, try to impute data for a record from
as few donors as possible. Operationally, this may be interpreted as
one donor per section of questionnaire, since it is virtually impossible
to treat all variables at once for a large questionnaire. Also, based
on available donors, allow equally good imputation actions an appropriate
chance of being selected to avoid artificially inflating the size of
certain groups in the population.
- For large surveys, it may be necessary to process variables in two
or more passes, rather than in a single pass, so as to reduce computational
costs. As well, there may be extensive response errors on a record.
Either of these conditions can make it difficult to follow the guidelines
exactly: there may be cases where more than one donor is required, and
more than the minimum number of variables are imputed.
- During the development of the imputation methodology, note that there
exist a number of generalized systems that implement a variety of algorithms,
for either continuous or categorical data. The systems are usually simple
to use once the edits are specified, and they include algorithms to
determine which fields to impute. They are well documented and retain
audit trails to allow evaluation of the imputation process. Two systems
currently available at Statistics Canada are the Generalized Edit and
Imputation System (GEIS/BANFF) (Kovar et al, 1988; Statistics Canada,
2000a) for quantitative economic variables and the Canadian Census Edit
and Imputation System (CANCEIS) (Bankier et al, 1999) for qualitative
and quantitative variables.
- Flag imputed values and clearly identify the methods and sources of
imputation. Retain the unimputed and imputed values of the record’s
fields for evaluation purposes. Evaluate the degree and effects of imputation.
Consider the use of techniques to adequately measure the sampling variance
under imputation and to measure the added variance introduced by imputation
(Lee et al, 2002). This information is required to satisfy Statistics
Canada’s Policy
on Informing Users of Data Quality and Methodology (Statistics Canada,
2000d).
- Consider the degree and impact of imputation when analyzing data.
The imputation methods used may have a significant impact on distributions
of data. For example, it is possible that not very much has changed
at the aggregate level, but that values in one domain have moved systematically
up, while values in another domain have moved down by an offsetting
amount. As well, even when the degree of imputation is low, changes
to individual records may have a significant impact, for example when
changes are made to large units or when large changes are made to a
few units. In general, the greater the degree and impact of imputation,
the more judicious the analyst needs to be in using the data. In such
cases, analyses may be misleading if the imputed values are treated
as observed data.
- Note that the Imputation Bulletin produced by the Methodology Branch
presents Statistics Canada’s software and practices in imputation
as well as recent developments in the field of imputation. Also, the
Committee on Practices in Imputation (CoPI) meets regularly to discuss
issues related to imputation and specific implementations of imputation.
Valuable comments and suggestions can be obtained from the CoPI when
designing an imputation strategy.
References
Bankier, M., Lachance, M. and Poirier, P. (1999). A generic implementation
of the New Imputation Methodology. Proceedings of the Survey Research
Methods Section, American Statistical Association, 548-553.
Fellegi, I.P. and Holt, D. (1976). A systematic approach to automatic
edit and imputation. Journal of the American Statistical Association,
71, 17-35.
Kalton, G. and Kasprzyk, D. (1986). The treatment of missing survey data.
Survey Methodology, 12, 1-16.
Kovar, J.G., and Whitridge, P. (1995). Imputation of business survey
data. In Business Survey Methods, B.G. Cox et al. (eds.),
Wiley, New York, 403-423.
Kovar, J.G., MacMillan, J. and Whitridge, P. (1988). Overview and strategy
for the Generalized Edit and Imputation System. Statistics Canada, Methodology
Branch Working Paper No. BSMD 88-007 E/F.
Lee, H., Rancourt, E. and Särndal, C.-E. (2002). Variance estimation
from survey data under single imputation. In Survey Nonresponse,
R.M. Groves et al. (eds.), Wiley, New York, 315-328.
Statistics Canada (2000a). Functional description of the Generalized
Edit and Imputation System. Statistics Canada technical report.
Statistics Canada (2000d). Policy
on Informing Users of Data Quality and Methodology. Policy
Manual, 2.3.
|