|
|
Survey steps >
Scope and purpose
Data editing is the application of checks to detect missing,
invalid or inconsistent entries or to point to data records that are potentially
in error. Some of these checks involve logical relationships that follow
directly from the concepts and definitions. Others are more empirical
in nature or are obtained as a result of the application of statistical
tests or procedures (e.g., outlier analysis techniques). The checks may
be based on data from previous collections of the same survey or from
other sources.
Editing encompasses a wide variety of activities, ranging from interviewer
field checks, computer-generated warnings at the time of data collection
or capture, through identification of units for follow-up, all the way
to complex relationship verifications, error localization for the purposes
of imputation, and data validation. The last two topics will be addressed
in the sections on Imputation and on Data
quality evaluation.
Principles
The goals of editing are threefold (Granquist, 1984): to provide the basis
for future improvement of the survey vehicle, to provide information about
the quality of the survey data, and to tidy up the data. There is good
reason to believe that a disproportionate amount of resources is concentrated
on the third objective of "cleaning up the data." As a result,
learning from the editing process often plays an undeserved, secondary
role.
It is recognized that fatal errors (e.g., invalid or inconsistent
entries) should be removed from the data sets in order to maintain the
Agency's credibility and to facilitate further automated data processing
and analysis. However, a caution against the overuse of query edits
(those pointing to questionable records that may potentially be in error)
must be heeded. Data editing is most likely the single most expensive
activity of a sample survey or census cycle. It is estimated to account
for at least one-quarter of the total survey budget on average, reaching
as high as 40% in the case of business surveys (Gagnon, Gough and Yeo,
1994). When the impact of such painstaking, often manual, editing on the
final estimates is negligible it is called over-editing. Not only is the
practice of over-editing costly in terms of finances, timeliness
and increased response burden, but it can also lead to severe biases resulting
from fitting data to implicit models imposed by the edits.
Guidelines
- Ensure that all edits are internally consistent (i.e., not self-contradictory).
- Reapply edits to units to which corrections were made to ensure that
no further errors were introduced directly or indirectly by the correction
process.
- Editing is well suited for identifying fatal errors (Granquist and
Kovar, 1997) - since the process for this process can be easily automated.
Perform this activity as quickly and as expediently as possible. While
some manual intervention may be necessary, generalized, reusable software
is particularly useful for this purpose. Banff – the SAS release
of the Generalized Edit and Imputation System (Statistics Canada, 2000a)
– and CANCEIS – the Canadian Census Edit and Imputation
System (Bankier et al, 1999) - are examples of such software.
- Hit rates of edits, which is the proportion of warning or
query edits that point to true errors, have been shown to be poor, often
as low as 20-30% (Linacre and Trewin, 1989). Furthermore, the impact
of errors has been shown to be highly differential, particularly in
surveys that collect numeric data. In other words, it is not uncommon
for a few errors to be responsible for the majority of changes in the
estimates. Consider editing in a selective manner to achieve potential
efficiency gains (Granquist and Kovar, 1997), without detrimental impact
on data quality. Priorities may be set according to types or severity
of error or according to the importance of the variable or the reporting
unit.
- For business surveys, put in place a strategy for selective follow-up.
The use of a score function (Latouche and Berthelot, 1992) concentrates
resources on the important sample units, the key variables, and the
most severe errors.
- Keep in mind that the usefulness of editing is limited, and the process
can in fact be counter-productive (see, for example, Linacre and Trewin,
1989). Often, data changes based on edits are erroneously considered
as data corrections. It can be argued that a point in time is reached
during the editing process when just as many errors are introduced as
are corrected through the process. Identify and respect this logical
end of the process.
- Automation allows and temps survey managers to increase the scope
and volume of checks that can be performed. Minimize any such increases
if they make little difference to the estimates from the survey. Instead
of increasing the editing effort, redirect resources into activities
with a higher pay-off (e.g., data analysis, response error analysis.)
- Limit the reliance on editing to fix problems after the fact, especially
in the case of repeated surveys. The contribution of editing to error
reduction is limited. While some editing is essential, reduce its scope
and redirect its purpose. Assign a high priority to learning from the
editing process. To reduce errors, focus on the earlier phases of data
collection rather than cleaning up at the end. Practice error prevention
rather than error correction. To this end, move the editing step to
the early stages of the survey process, preferably while the respondent
is still available, for example, through the use of computer-assisted
telephone or personal or self-interview methods.
- Edits cannot possibly detect small, systematic errors reported consistently
in repeated surveys, errors that can lead to serious biases in the estimates.
"Tightening" the edits is not the solution. Use other methods,
such as traditional quality control methods, careful analysis and review
of concepts and definitions, post-interview studies, data validation,
data confrontation (see section on Administrative
data use) with other data sources that might be available for some
units, etc., to detect such systematic errors.
- Identify extreme data values in a survey period or across survey periods
(this exercise is known as the outlier detection process). The presence
of such outlying data is a warning sign of potential errors. Use simple
univariate detection methods (Hidiroglou and Berthelot, 1986) or more
complex and graphical methods (de Waal, 2000).
- When conducting follow-ups, do not overestimate the respondents' ability
to report or correct reports. Their aggregations may be different, their
memory limited, and their "pay-off" negligible. Limit respondent
follow-up activity.
- Do not underestimate the ability of the editing process to fit the
reported data to the models imposed by the edits. There exists a real
danger of creating spurious changes just to ensure that the data pass
the edits. Control the process!
- The editing process is often very complex. When editing is under the
Agency’s control, make available detailed and up-to-date procedures
with appropriate training to all staff involved, and monitor the work
itself. Consider using formal quality control procedures.
- Editing can serve a useful purpose in tidying up some of the data,
but its much more useful role derives from its ability to provide information
about the survey process, either as quality measures for the current
survey or to suggest improvements for future surveys. Consider editing
to be an integral part of the data collection process in its role of
gathering intelligence about the process. In this role, editing can
be invaluable in sharpening definitions, improving the survey vehicle,
evaluating the quality of the data, identifying nonsampling error sources,
serving as the basis of future improvement of the whole survey process,
and feeding the continuous learning cycle. To accomplish this goal,
monitor the process and produce audit trails, diagnostics and performance
measures, and use these to identify best practices.
References
Bankier, M., Lachance, M. and Poirier, P. (1999), A generic implementation
of the New Imputation Methodology. Proceedings of the Survey Research
Methods Section, American Statistical Association, 548-553.
De Waal, T., Van de Pol, F. and Renssen, R. (2000). Graphical macro editing:
possibilities and pitfalls. Proceedings of the Second International
Conferences on Establishment Surveys, Buffalo, NewYork.
Gagnon, F., Gough, H. and Yeo, D. (1994). Survey of editing practices
in Statistics Canada. Statistics Canada technical report.
Granquist, L. (1984). On the role of editing. Statistisk tidskrift,
2, 105-118.
Granquist, L. and Kovar, J.G. (1997). Editing of survey data: how much
is enough? In Survey Measurement and Process Quality,
Lyberg et al (eds.). Wiley, New York, 415-435.
Hidiroglou, M. A. and Berthelot, J.-M.(1986). Statistical editing and
imputation for periodic business surveys. Survey Methodology,
12, 73-83.
Latouche, M. and Berthelot, J.-M. (1992). Use of a score function to
prioritize and limit recontacts in editing business surveys. Journal
of Official Statistics, 8, 389-400.
Linacre, S. J. and Trewin, D. J. (1989). Evaluation of errors and appropriate
resource allocation in economic collections. Proceedings of the
Annual Research Conference, U.S. Bureau of the Census, 197-209.
Statistics Canada (2000a). Functional description of the Generalized
Edit and Imputation System. Statistics Canada technical report.
|