To test the potential usefulness of the new error
localization approach, I conducted a small simulation study, using the R
environment for statistical computing (R Development Core Team 2015). A
prototype implementation was created in R of the algorithm in Figure 6.1. This
prototype made liberal use of the existing functionality for Fellegi-Holt-based
automatic editing available in the editrules package (van der Loo
and de Jonge 2012; de Jonge and van der Loo 2014). The
program was not optimized for computational efficiency, but it turned out to
work sufficiently fast for the relatively small error localization problems
encountered in this simulation study. (Note: The R code used in this study is
available from the author upon request.)
The simulation study involved records of five
numerical variables that should satisfy the following nine linear edit rules:
Edits of this form might typically be encountered for SBS, as part of a
much larger set of edit rules (Scholtus 2014).
I created a random error-free data set of 2,000
records by drawing from a multivariate normal distribution (using the mvtnorm package) with the following parameters:
Only
records that satisfied all of the above edits were added to the data set. Note
that
is a singular covariance matrix that
incorporates the two equality edits. Technically, the resulting data follow a
so-called truncated multivariate singular normal distribution; see de Waal
et al. (2011, pages 318ff) or Tempelman (2007).
Table 7.1 lists the nine allowed edit operations that
were considered in this study. Note that the first five lines contain the FH
operations for this data set. As indicated in the table, each edit operation
has an associated type of error. A synthetic data set to be edited was created
by randomly adding errors of these types to the above-mentioned error-free data
set. The probability of each type of error is listed in the fourth column of
Table 7.1. The associated “ideal” weight according to (4.2) is shown in the
last column.
To limit the amount of computational work, I only
considered records that required three edit operations or less. Records without
errors were also removed. This left 1,025 records to be edited, each containing
one, two, or three of the errors listed in Table 7.1.
Table 7.1
Allowed edit operations for the simulation study Table summary
This table displays the results of Allowed edit operations for the simulation study. The information is grouped by name (appearing as row headers), operation, associated type of error and XXXX (appearing as column headers).
name
operation
associated type of error
FH1
impute
erroneous value of
0.10
2.20
FH2
impute
erroneous value of
0.08
2.44
FH3
impute
erroneous value of
0.06
2.75
FH4
impute
erroneous value of
0.04
3.18
FH5
impute
erroneous value of
0.02
3.89
IC34
interchange and
true values of and interchanged
0.07
2.59
TF21
transfer an amount from to
part of the true value of reported as part of
0.09
2.31
CS4
change the sign of
sign error in
0.11
2.09
CS5
change the sign of
sign error in
0.13
1.90
Several error localization approaches were applied to
this data set. First of all, I tested error localization according to the
Fellegi-Holt paradigm (i.e., using only the edit operations FH1
FH5) and
according to the new paradigm (i.e., using all edit operations in Table 7.1).
Both approaches were tested once using the “ideal” weights listed in Table 7.1
and once with all weights equal to 1 (“no weights”). The latter case simulates
a situation where the relevant edit operations would be known, but not their
respective frequencies. Finally, to test the robustness of the new error
localization approach to a lack of information about relevant edit operations,
I also applied this approach with one of the non-FH operations in Table 7.1
missing from the set of allowed edit operations.
The quality of error localization was evaluated in two
ways. Firstly, I evaluated how well the optimal paths of edit operations found
by the algorithm matched the true distribution of errors, using the following
contingency table for all
combinations of records and edit operations:
Table 7.2
Contingency table of errors and edit operations suggested by the algorithm Table summary
This table displays the results of Contingency table of errors and edit operations suggested by the algorithm edit operation was suggested and edit operation was not suggested (appearing as column headers).
edit operation was suggested
edit operation was not suggested
associated error occurred
associated error did not occur
From this table, I computed indicators that measure the proportion of false
negatives, false positives, and overall wrong decisions, respectively:
Similar
indicators are discussed by de Waal et al. (2011, pages 410-411). I
also computed
with
the fraction of records in the data set for
which the error localization algorithm found exactly the right solution. A good
error localization algorithm should have low scores on all four indicators.
It should be noted that the above quality indicators
put the original Fellegi-Holt approach at a disadvantage, as this approach does
not use all the edit operations listed in Table 7.1. Therefore, I also
calculated a second set of quality indicators
and
that look at erroneous values rather than edit
operations. In this case,
measures the proportion of values in the data
set that were affected by errors but left unchanged by the optimal solution of
the error localization problem, and similarly for the other measures.
Table 7.3 displays the results of the simulation study
for both sets of quality indicators. In both cases, a considerable improvement
in the quality of the error localization results is seen for the approach that
used all edit operations, compared to the approach that used only FH
operations. In addition, leaving one relevant edit operation out of the set of
allowed edit operations had a negative effect on the quality of error
localization. In some cases this effect was quite large
particularly in terms of edit operations used
, but the
results of the new error localization approach still remained substantially
better than those of the Fellegi-Holt approach. Contrary to expectation, not
using different confidence weights actually improved the quality of the error
localization results somewhat for this data set under the Fellegi-Holt approach
(both sets of indicators) and to some extent also under the new approach (only
the second set of indicators). Finally, it is seen that using all edit
operations led to an increase in computing time compared to using only FH
operations, but this increase was not dramatic.
Table 7.3
Quality of error localization in terms of edit operations used and identified erroneous values; computing time required Table summary
This table displays the results of Quality of error localization in terms of edit operations used and identified erroneous values; computing time required. The information is grouped by approach (appearing as row headers), quality indicators (edit operations), quality indicators (erroneous values) and time*, calculated using XXXX units of measure (appearing as column headers).
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.