A generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation studyA generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation study

To test the potential usefulness of the new error localization approach, I conducted a small simulation study, using the R environment for statistical computing (R Development Core Team 2015). A prototype implementation was created in R of the algorithm in Figure 6.1. This prototype made liberal use of the existing functionality for Fellegi-Holt-based automatic editing available in the editrules package (van der Loo and de Jonge 2012; de Jonge and van der Loo 2014). The program was not optimized for computational efficiency, but it turned out to work sufficiently fast for the relatively small error localization problems encountered in this simulation study. (Note: The R code used in this study is available from the author upon request.)

The simulation study involved records of five numerical variables that should satisfy the following nine linear edit rules:

$\begin{array}{l} x_{1} + x_{2} & = x_{3}, \\ x_{3} - x_{4} & = x_{5}, \\ x_{j} & \geq 0, & j \in {1, 2, 3, 4}, \\ x_{1} & \geq x_{2}, \\ x_{5} & \geq - 0.1 x_{3}, \\ x_{5} & \leq 0.5 x_{3} . \end{array}$

Edits of this form might typically be encountered for SBS, as part of a much larger set of edit rules (Scholtus 2014).

I created a random error-free data set of 2,000 records by drawing from a multivariate normal distribution (using the mvtnorm package) with the following parameters:

$μ = (\begin{matrix} 500 \\ 250 \\ 750 \\ 600 \\ 150 \end{matrix}) and Σ = (\begin{array}{r} 10,000 & -1,250 & 8,750 & 7,500 & 1,250 \\ -1,250 & 5,000 & 3,750 & 4,000 & -250 \\ 8,750 & 3,750 & 12,500 & 11,500 & 1,000 \\ 7,500 & 4,000 & 11,500 & 11,750 & -250 \\ 1,250 & -250 & 1,000 & -250 & 1,250 \end{array}) .$

Only records that satisfied all of the above edits were added to the data set. Note that $Σ$ is a singular covariance matrix that incorporates the two equality edits. Technically, the resulting data follow a so-called truncated multivariate singular normal distribution; see de Waal et al. (2011, pages 318ff) or Tempelman (2007).

Table 7.1 lists the nine allowed edit operations that were considered in this study. Note that the first five lines contain the FH operations for this data set. As indicated in the table, each edit operation has an associated type of error. A synthetic data set to be edited was created by randomly adding errors of these types to the above-mentioned error-free data set. The probability of each type of error is listed in the fourth column of Table 7.1. The associated “ideal” weight according to (4.2) is shown in the last column.

To limit the amount of computational work, I only considered records that required three edit operations or less. Records without errors were also removed. This left 1,025 records to be edited, each containing one, two, or three of the errors listed in Table 7.1.

Table 7.1
Allowed edit operations for the simulation study
Table summary
This table displays the results of Allowed edit operations for the simulation study. The information is grouped by name (appearing as row headers), operation, associated type of error and XXXX (appearing as column headers).
name	operation	associated type of error	$p_{g}$	$w_{g}$
FH1	impute $x_{1}$	erroneous value of $x_{1}$	0.10	2.20
FH2	impute $x_{2}$	erroneous value of $x_{2}$	0.08	2.44
FH3	impute $x_{3}$	erroneous value of $x_{3}$	0.06	2.75
FH4	impute $x_{4}$	erroneous value of $x_{4}$	0.04	3.18
FH5	impute $x_{5}$	erroneous value of $x_{5}$	0.02	3.89
IC34	interchange $x_{3}$ and $x_{4}$	true values of $x_{3}$ and $x_{4}$ interchanged	0.07	2.59
TF21	transfer an amount from $x_{2}$ to $x_{1}$	part of the true value of $x_{1}$ reported as part of $x_{2}$	0.09	2.31
CS4	change the sign of $x_{4}$	sign error in $x_{4}$	0.11	2.09
CS5	change the sign of $x_{5}$	sign error in $x_{5}$	0.13	1.90

Several error localization approaches were applied to this data set. First of all, I tested error localization according to the Fellegi-Holt paradigm (i.e., using only the edit operations FH1 $-$ FH5) and according to the new paradigm (i.e., using all edit operations in Table 7.1). Both approaches were tested once using the “ideal” weights listed in Table 7.1 and once with all weights equal to 1 (“no weights”). The latter case simulates a situation where the relevant edit operations would be known, but not their respective frequencies. Finally, to test the robustness of the new error localization approach to a lack of information about relevant edit operations, I also applied this approach with one of the non-FH operations in Table 7.1 missing from the set of allowed edit operations.

The quality of error localization was evaluated in two ways. Firstly, I evaluated how well the optimal paths of edit operations found by the algorithm matched the true distribution of errors, using the following contingency table for all $1,025 \times 9 = 9,225$ combinations of records and edit operations:

Table 7.2
Contingency table of errors and edit operations suggested by the algorithm
Table summary
This table displays the results of Contingency table of errors and edit operations suggested by the algorithm edit operation was suggested and edit operation was not suggested (appearing as column headers).
	edit operation was suggested	edit operation was not suggested
associated error occurred	$T P$	$F N$
associated error did not occur	$F P$	$T N$

From this table, I computed indicators that measure the proportion of false negatives, false positives, and overall wrong decisions, respectively:

$α = \frac{F N}{T P + F N}; β = \frac{F P}{F P + T N}; δ = \frac{F N + F P}{T P + F N + F P + T N} .$

Similar indicators are discussed by de Waal et al. (2011, pages 410-411). I also computed $\bar{ρ} = 1 - ρ,$ with $ρ$ the fraction of records in the data set for which the error localization algorithm found exactly the right solution. A good error localization algorithm should have low scores on all four indicators.

It should be noted that the above quality indicators put the original Fellegi-Holt approach at a disadvantage, as this approach does not use all the edit operations listed in Table 7.1. Therefore, I also calculated a second set of quality indicators $α, β, δ,$ and $\bar{ρ}$ that look at erroneous values rather than edit operations. In this case, $α$ measures the proportion of values in the data set that were affected by errors but left unchanged by the optimal solution of the error localization problem, and similarly for the other measures.

Table 7.3 displays the results of the simulation study for both sets of quality indicators. In both cases, a considerable improvement in the quality of the error localization results is seen for the approach that used all edit operations, compared to the approach that used only FH operations. In addition, leaving one relevant edit operation out of the set of allowed edit operations had a negative effect on the quality of error localization. In some cases this effect was quite large $-$ particularly in terms of edit operations used $-$ , but the results of the new error localization approach still remained substantially better than those of the Fellegi-Holt approach. Contrary to expectation, not using different confidence weights actually improved the quality of the error localization results somewhat for this data set under the Fellegi-Holt approach (both sets of indicators) and to some extent also under the new approach (only the second set of indicators). Finally, it is seen that using all edit operations led to an increase in computing time compared to using only FH operations, but this increase was not dramatic.

Table 7.3
Quality of error localization in terms of edit operations used and identified erroneous values; computing time required
Table summary
This table displays the results of Quality of error localization in terms of edit operations used and identified erroneous values; computing time required. The information is grouped by approach (appearing as row headers), quality indicators (edit operations), quality indicators (erroneous values) and time*, calculated using XXXX units of measure (appearing as column headers).
approach	quality indicators (edit operations)				quality indicators (erroneous values)				time^Note *
approach	$α$	$β$	$δ$	$\bar{ρ}$	$α$	$β$	$δ$	$\bar{ρ}$	time^Note *
Fellegi-Holt (weights)	74%	12%	23%	80%	19%	10%	13%	32%	46
Fellegi-Holt (no weights)	70%	12%	21%	74%	13%	8%	9%	24%	33
all operations (weights)	14%	3%	5%	24%	10%	5%	7%	17%	98
except IC34	29%	5%	9%	35%	15%	9%	11%	29%	113
except TF21	34%	5%	10%	37%	10%	5%	7%	18%	80
except CS4	28%	6%	9%	39%	10%	5%	7%	17%	80
except CS5	35%	7%	10%	47%	11%	6%	7%	18%	82
all operations (no weights)	27%	5%	8%	36%	6%	4%	5%	13%	99
Note * Total computing time (in seconds) on a laptop PC with a 2.5 GHz CPU under Windows 7. Return to note * referrer

ISSN : 1492-0921

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Use of this publication is governed by the Statistics Canada Open Licence Agreement.

Catalogue No. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2016-06-22

Language selection

Search and menus

Search

A generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation studyA generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation study