12-539 Data Quality Guidelines

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Survey steps >

Editing

Scope and purpose

Data editing is the application of checks to detect missing, invalid or inconsistent entries or to point to data records that are potentially in error. Some of these checks involve logical relationships that follow directly from the concepts and definitions. Others are more empirical in nature or are obtained as a result of the application of statistical tests or procedures (e.g., outlier analysis techniques). The checks may be based on data from previous collections of the same survey or from other sources.

Editing encompasses a wide variety of activities, ranging from interviewer field checks, computer-generated warnings at the time of data collection or capture, through identification of units for follow-up, all the way to complex relationship verifications, error localization for the purposes of imputation, and data validation. The last two topics will be addressed in the sections on Imputation and on Data quality evaluation.

Principles

The goals of editing are threefold (Granquist, 1984): to provide the basis for future improvement of the survey vehicle, to provide information about the quality of the survey data, and to tidy up the data. There is good reason to believe that a disproportionate amount of resources is concentrated on the third objective of "cleaning up the data." As a result, learning from the editing process often plays an undeserved, secondary role.

It is recognized that fatal errors (e.g., invalid or inconsistent entries) should be removed from the data sets in order to maintain the Agency's credibility and to facilitate further automated data processing and analysis. However, a caution against the overuse of query edits (those pointing to questionable records that may potentially be in error) must be heeded. Data editing is most likely the single most expensive activity of a sample survey or census cycle. It is estimated to account for at least one-quarter of the total survey budget on average, reaching as high as 40% in the case of business surveys (Gagnon, Gough and Yeo, 1994). When the impact of such painstaking, often manual, editing on the final estimates is negligible it is called over-editing. Not only is the practice of over-editing costly in terms of finances, timeliness and increased response burden, but it can also lead to severe biases resulting from fitting data to implicit models imposed by the edits.

Guidelines

Ensure that all edits are internally consistent (i.e., not self-contradictory).
Reapply edits to units to which corrections were made to ensure that no further errors were introduced directly or indirectly by the correction process.
Editing is well suited for identifying fatal errors (Granquist and Kovar, 1997) - since the process for this process can be easily automated. Perform this activity as quickly and as expediently as possible. While some manual intervention may be necessary, generalized, reusable software is particularly useful for this purpose. Banff – the SAS release of the Generalized Edit and Imputation System (Statistics Canada, 2000a) – and CANCEIS – the Canadian Census Edit and Imputation System (Bankier et al, 1999) - are examples of such software.
Hit rates of edits, which is the proportion of warning or query edits that point to true errors, have been shown to be poor, often as low as 20-30% (Linacre and Trewin, 1989). Furthermore, the impact of errors has been shown to be highly differential, particularly in surveys that collect numeric data. In other words, it is not uncommon for a few errors to be responsible for the majority of changes in the estimates. Consider editing in a selective manner to achieve potential efficiency gains (Granquist and Kovar, 1997), without detrimental impact on data quality. Priorities may be set according to types or severity of error or according to the importance of the variable or the reporting unit.
For business surveys, put in place a strategy for selective follow-up. The use of a score function (Latouche and Berthelot, 1992) concentrates resources on the important sample units, the key variables, and the most severe errors.
Keep in mind that the usefulness of editing is limited, and the process can in fact be counter-productive (see, for example, Linacre and Trewin, 1989). Often, data changes based on edits are erroneously considered as data corrections. It can be argued that a point in time is reached during the editing process when just as many errors are introduced as are corrected through the process. Identify and respect this logical end of the process.
Automation allows and temps survey managers to increase the scope and volume of checks that can be performed. Minimize any such increases if they make little difference to the estimates from the survey. Instead of increasing the editing effort, redirect resources into activities with a higher pay-off (e.g., data analysis, response error analysis.)
Limit the reliance on editing to fix problems after the fact, especially in the case of repeated surveys. The contribution of editing to error reduction is limited. While some editing is essential, reduce its scope and redirect its purpose. Assign a high priority to learning from the editing process. To reduce errors, focus on the earlier phases of data collection rather than cleaning up at the end. Practice error prevention rather than error correction. To this end, move the editing step to the early stages of the survey process, preferably while the respondent is still available, for example, through the use of computer-assisted telephone or personal or self-interview methods.
Edits cannot possibly detect small, systematic errors reported consistently in repeated surveys, errors that can lead to serious biases in the estimates. "Tightening" the edits is not the solution. Use other methods, such as traditional quality control methods, careful analysis and review of concepts and definitions, post-interview studies, data validation, data confrontation (see section on Administrative data use) with other data sources that might be available for some units, etc., to detect such systematic errors.
Identify extreme data values in a survey period or across survey periods (this exercise is known as the outlier detection process). The presence of such outlying data is a warning sign of potential errors. Use simple univariate detection methods (Hidiroglou and Berthelot, 1986) or more complex and graphical methods (de Waal, 2000).
When conducting follow-ups, do not overestimate the respondents' ability to report or correct reports. Their aggregations may be different, their memory limited, and their "pay-off" negligible. Limit respondent follow-up activity.
Do not underestimate the ability of the editing process to fit the reported data to the models imposed by the edits. There exists a real danger of creating spurious changes just to ensure that the data pass the edits. Control the process!
The editing process is often very complex. When editing is under the Agency’s control, make available detailed and up-to-date procedures with appropriate training to all staff involved, and monitor the work itself. Consider using formal quality control procedures.
Editing can serve a useful purpose in tidying up some of the data, but its much more useful role derives from its ability to provide information about the survey process, either as quality measures for the current survey or to suggest improvements for future surveys. Consider editing to be an integral part of the data collection process in its role of gathering intelligence about the process. In this role, editing can be invaluable in sharpening definitions, improving the survey vehicle, evaluating the quality of the data, identifying nonsampling error sources, serving as the basis of future improvement of the whole survey process, and feeding the continuous learning cycle. To accomplish this goal, monitor the process and produce audit trails, diagnostics and performance measures, and use these to identify best practices.

Top of Page

References

Bankier, M., Lachance, M. and Poirier, P. (1999), A generic implementation of the New Imputation Methodology. Proceedings of the Survey Research Methods Section, American Statistical Association, 548-553.

De Waal, T., Van de Pol, F. and Renssen, R. (2000). Graphical macro editing: possibilities and pitfalls. Proceedings of the Second International Conferences on Establishment Surveys, Buffalo, NewYork.

Gagnon, F., Gough, H. and Yeo, D. (1994). Survey of editing practices in Statistics Canada. Statistics Canada technical report.

Granquist, L. (1984). On the role of editing. Statistisk tidskrift, 2, 105-118.

Granquist, L. and Kovar, J.G. (1997). Editing of survey data: how much is enough? In Survey Measurement and Process Quality, Lyberg et al (eds.). Wiley, New York, 415-435.

Hidiroglou, M. A. and Berthelot, J.-M.(1986). Statistical editing and imputation for periodic business surveys. Survey Methodology, 12, 73-83.

Latouche, M. and Berthelot, J.-M. (1992). Use of a score function to prioritize and limit recontacts in editing business surveys. Journal of Official Statistics, 8, 389-400.

Linacre, S. J. and Trewin, D. J. (1989). Evaluation of errors and appropriate resource allocation in economic collections. Proceedings of the Annual Research Conference, U.S. Bureau of the Census, 197-209.

Statistics Canada (2000a). Functional description of the Generalized Edit and Imputation System. Statistics Canada technical report.

Home \| Search \| Contact Us \| Français
Date Modified: 2014-04-10	Important Notices