Editing

Scope and purpose
Principles
Guidelines
Quality indicators
References

Scope and purpose

Data editing is the application of checks to detect missing, invalid or inconsistent entries or to point to data records that are potentially in error. Some of these checks involve logical relationships that follow directly from the concepts and definitions. Others are more empirical in nature or are obtained as a result of the application of statistical tests or procedures (e.g., outlier analysis techniques). The checks may be based on data from previous collections of the same survey or from other sources.

Editing encompasses numerous activities at various steps of a survey.  It includes interviewer field checks and computer-generated warnings at the time of data collection and capture.  It also includes identification of units for follow-up and detailed checks of the micro data during and after data.  Finally, it includes error localization for the purposes of imputation, and complex relationship verifications at a macro level for the purposes of data validation.

Principles

A data record that has been altered as a result of edits should be closer to the truth after those alterations than before.  We design edits to detect and correct inconsistencies, and not to generate bias by imposing implicit models.  When further editing has a negligible impact on the final survey estimates, it is called over-editing, and should be avoided.

Analysis of edit failure rates and the magnitude of changes due to edits provide information about the quality of the survey data, and can also suggest improvements to the survey vehicle.

Guidelines

Design

  • Editing is well suited for identifying fatal errors (Granquist and Kovar, 1997) - since the process can be easily automated. Perform this activity as quickly and as expediently as possible. While some manual intervention may be necessary, generalized, reusable software is particularly useful for this purpose. The Banff system for edit and imputation (Statistics Canada, 2009) and CANCEIS – the Canadian Census Edit and Imputation System (Bankier et al., 1999) – are examples of such software. Some customized applications can also be developed based on other software that does not target only editing processes. Logiplus – the Statistics Canada system for managing decision logic tables – is an example of such software.

  • Automation allows and tempts survey managers to increase the scope and volume of checks that can be performed. Minimize any such increases if they make little difference to the estimates from the survey. Instead of increasing the editing effort, redirect resources into activities with a higher pay-off (e.g., data analysis, response error analysis.)

  • Identify extreme data values in a survey period or across survey periods (this exercise is known as the outlier detection process). The presence of such outlying data is a warning sign of potential errors. Use simple univariate detection methods (Hidiroglou and Berthelot, 1986) or more complex and graphical methods (de Waal, 2000). 

  • The impact of errors has been shown to be highly variable, particularly in surveys that collect numeric data; it is not uncommon for a few errors to be responsible for the majority of changes in the estimates. Consider editing in a selective manner to achieve potential efficiency gains (Granquist and Kovar, 1997) without detrimental impact on data quality.  Priorities may be set according to type or severity of error, or according to the importance of the variable or the reporting unit.

  • Hit rates of edits, which are the proportions of warning or query edits that point to true errors, have been shown to be poor, often as low as 20-30%. Develop edits that are efficient and monitor the efficiency on a regular basis.

  • Edits cannot possibly detect small, systematic errors reported consistently in repeated surveys, errors that can lead to serious biases in the estimates. "Tightening" the edits is not the solution. Use other methods, such as traditional quality control methods, careful analysis and review of concepts and definitions, post-interview studies, data validation, and data confrontation with other data sources that might be available for some units, to detect such systematic errors.

  • Limit the reliance on editing to fix problems after the fact, especially in the case of repeated surveys. The contribution of editing to error reduction is limited. While some editing is essential, reduce its scope and redirect its purpose. Assign a high priority to learning from the editing process. To reduce errors, focus on the earlier phases of data collection rather than cleaning up at the end. Practice error prevention rather than error correction. To this end, move the editing step to the early stages of the survey process, preferably while the respondent is still available, for example, through the use of computer-assisted telephone or personal or self-interview methods.

  • In designing data collection processes, especially editing and coding, make sure that the procedures are applied to all units of study as consistently and in as error-free a manner as possible. Automation is desirable. Enable the staff or systems to refer difficult cases to a small number of knowledgeable experts. Centralize the processing in order to reduce costs and make it simpler to take advantage of available expert knowledge. Given that there can be unexpected results in the collected information, use processes that can be adapted to make appropriate changes if necessary from the point of view of efficiency.

Data collection and failed edit follow-up

  • Editing can serve a useful purpose in tidying up some of the data, but its much more useful role derives from its ability to provide information about the survey process, either as quality measures for the current survey or to suggest improvements for future surveys. Consider editing to be an integral part of the data collection process in its role of gathering intelligence about the process. In this role, editing can be invaluable in sharpening definitions, improving the survey vehicle, evaluating the quality for the data, identifying nonsampling error sources, serving as the basis of future improvement of the whole survey process, and providing valuable information to improve other survey processes and other surveys (Granquist, Kovar and Nordbotten, 2006). To accomplish this goal, monitor the process and produce audit trails, diagnostics and performance measures, and use these to identify best practices.

  • When conducting follow-ups, do not overestimate the respondents' ability to correct errors. Their aggregations may be different, their memory limited, and their "pay-off" negligible. Limit respondent follow-up activity.

  • For business surveys, put in place a strategy for selective follow-up. The use of a score function (Latouche and Berthelot, 1992) concentrates resources on the important sample units, the key variables, and the most severe errors.

 Quality assurance

  • Ensure that all edits are internally consistent (i.e., not self-contradictory).

  • Keep in mind that the usefulness of editing is limited, and the process can in fact be counter-productive. Often, data changes based on edits are erroneously considered as data corrections. It can be argued that a point is reached during the editing process when just as many errors are introduced as are corrected through the process. Identify and respect this logical end of the process.

  • Reapply edits to units to which corrections were made to ensure that no further errors were introduced directly or indirectly by the correction process.

  • Do not underestimate the ability of the editing process to fit the reported data to implicit models imposed by the edits. There exists a real danger of creating spurious changes just to ensure that the data pass the edits. Control the process!

  • The editing process is often very complex. When editing is under the Agency's control, make available detailed and up-to-date procedures with appropriate training to all staff involved, and monitor the work itself. Consider using formal quality control procedures.

  • Monitor the frequency of edit rejects, the number and type of corrections applied by stratum, collection mode, processing type, data item and language of the collection. This will help in evaluating the quality of the data and the efficiency of the editing function.

Quality indicators

Main quality elements:  accuracy, timeliness

Measurement error is the error that occurs during the reporting process while processing error is the error that occurs when processing data.  The latter includes errors in data capture, coding, editing and tabulation of the data as well as in the assignment of survey weights. Although it is not usually possible to calculate measurement error and processing error individually, an indicator of their combined importance is given by the edit failure rate.  This is the number of units rejected by edit checks divided by the total number of units.  Outputs should be accompanied by a definition of the two types of error and a description of the main sources of errors.  This informs users of the mechanisms in place to minimize error by ensuring fine tuned data collection, capture and processing.  Measurement and processing error have impacts on bias and variance.

Editing rates for key variables should be reported.  They may be higher due to measurement error (for instance, because of poor question wording) or because of processing error (for instance, data capture errors).

The total contribution to key estimates from edited values should be reported.  This is the extent to which the values of key estimates are informed by data that have been edited.  This may give an indication of the effect of measurement error on key estimates.  This indicator applies only for means and totals.

Data editing is crucial to ensuring the accuracy and consistency of data.  However it can become a costly and time consuming endeavour.  It is likely the single most expensive activity of a sample survey or census cycle.  When painstaking, often manual editing has a negligible impact on the final estimates, it is called overediting.  Not only is the practice of overediting costly in terms of finances, timeliness and increased response burden, but it can also lead to severe biases resulting from fitting data to implicit models imposed by the edits.

References

Bankier, M., M. Lachance and P. Poirier. 1999. "A generic implementation of the New Imputation Methodology." Proceedings of the Survey Research Methods Section. American Statistical Association. p. 548-553.

De Waal, T., F. Van de Pol and R. Renssen. 2000. "Graphical macro editing: possibilities and pitfalls." Proceedings of the Second International Conferences on Establishment Surveys. Buffalo, NewYork.

Granquist, L. and J.G. Kovar. 1997. "Editing of survey data: how much is enough?" Survey Measurement and Process Quality. Lyberg et al (eds.). New York. Wiley. p. 415-435.

Granquist, L., J. Kovar and S. Nordbotten. 2006.  "Improving Surveys: Where Does Editing Fit In?" Statistical Data Editing.Vol. 3: Impact on Data Quality, Chapter 4: "Looking forward". Conference of European Statisticians. United Nations Statistical Commission and United Nations Economic Commission for Europe.

Hidiroglou, M. A. and J.-M. Berthelot. 1986. "Statistical editing and imputation for periodic business surveys." Survey Methodology. Vol. 12. p. 73-83.

Latouche, M. and J.-M. Berthelot. 1992. "Use of a score function to prioritize and limit recontacts in editing business surveys." Journal of Official Statistics. Vol. 8. p. 389-400.

Statistics Canada. 2009. Functional Description of the Banff System for Edit and Imputation. Statistics Canada Technical Report.

Date modified: