A generalized Fellegi-Holt paradigm for automatic error localization 7. Simulation study

To test the potential usefulness of the new error localization approach, I conducted a small simulation study, using the R environment for statistical computing (R Development Core Team 2015). A prototype implementation was created in R of the algorithm in Figure 6.1. This prototype made liberal use of the existing functionality for Fellegi-Holt-based automatic editing available in the editrules package (van der Loo and de Jonge 2012; de Jonge and van der Loo 2014). The program was not optimized for computational efficiency, but it turned out to work sufficiently fast for the relatively small error localization problems encountered in this simulation study. (Note: The R code used in this study is available from the author upon request.)

The simulation study involved records of five numerical variables that should satisfy the following nine linear edit rules:

x 1 + x 2 = x 3 , x 3 x 4 = x 5 , x j 0 , j { 1 , 2 , 3 , 4 } , x 1 x 2 , x 5 0.1 x 3 , x 5 0.5 x 3 . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaafaqaaeGbda aaaeaaqaaaaaaaaaWdbiaadIhapaWaaSbaaSqaa8qacaaIXaaapaqa baGccqGHRaWkpeGaamiEa8aadaWgaaWcbaWdbiaaikdaa8aabeaaaO qaaiabg2da98qacaWG4bWdamaaBaaaleaapeGaaG4maaWdaeqaaOWd biaacYcaa8aabaaabaWdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapa qabaGccqGHsislpeGaamiEa8aadaWgaaWcbaWdbiaaisdaa8aabeaa aOqaaiabg2da98qacaWG4bWdamaaBaaaleaapeGaaGynaaWdaeqaaO WdbiaacYcaa8aabaaabaWdbiaaykW7caaMc8UaaGPaVlaaykW7caaM c8UaaGPaVlaaysW7caaMc8UaamiEa8aadaWgaaWcbaWdbiaadQgaa8 aabeaaaOqaa8qacqGHLjYScaaIWaGaaiilaaWdaeaapeGaamOAaiab gIGiopaacmaapaqaa8qacaaIXaGaaiilaiaaikdacaGGSaGaaG4mai aacYcacaaI0aaacaGL7bGaayzFaaGaaiilaaWdaeaapeGaaGPaVlaa ykW7caaMc8UaaGPaVlaaykW7caaMc8UaaGPaVlaaysW7caWG4bWdam aaBaaaleaapeGaaGymaaWdaeqaaaGcbaWdbiabgwMiZkaadIhapaWa aSbaaSqaa8qacaaIYaaapaqabaGcpeGaaiilaaWdaeaaaeaapeGaaG PaVlaaykW7caaMc8UaaGPaVlaaykW7caaMc8UaaGPaVlaaysW7caWG 4bWdamaaBaaaleaapeGaaGynaaWdaeqaaaGcbaWdbiabgwMiZkabgk HiTiaaicdacaGGUaGaaGymaiaadIhapaWaaSbaaSqaa8qacaaIZaaa paqabaGcpeGaaiilaaWdaeaaaeaapeGaaGPaVlaaykW7caaMc8UaaG PaVlaaykW7caaMc8UaaGPaVlaaysW7caWG4bWdamaaBaaaleaapeGa aGynaaWdaeqaaaGcbaWdbiabgsMiJkaaicdacaGGUaGaaGynaiaadI hapaWaaSbaaSqaa8qacaaIZaaapaqabaGcpeGaaiOlaaWdaeaaaaaa aa@A60C@

Edits of this form might typically be encountered for SBS, as part of a much larger set of edit rules (Scholtus 2014).

I created a random error-free data set of 2,000 records by drawing from a multivariate normal distribution (using the mvtnorm package) with the following parameters:

μ = ( 500 250 750 600 150 )           and           Σ = ( 10,000 -1,250 8,750 7,500 1,250 -1,250 5,000 3,750 4,000 -250 8,750 3,750 12,500 11,500 1,000 7,500 4,000 11,500 11,750 -250 1,250 -250 1,000 -250 1,250 ) . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaahY7acqGH9aqpdaqadaWdaeaafaqabeqbbaaaaeaapeGaaGyn aiaaicdacaaIWaaapaqaa8qacaaIYaGaaGynaiaaicdaa8aabaWdbi aaiEdacaaI1aGaaGimaaWdaeaapeGaaGOnaiaaicdacaaIWaaapaqa a8qacaaIXaGaaGynaiaaicdaaaaacaGLOaGaayzkaaGaaiiOaiaacc kacaGGGcGaaiiOaiaacckacaqGHbGaaeOBaiaabsgacaGGGcGaaiiO aiaacckacaGGGcGaaiiOaiaaho6acqGH9aqpdaqadaWdaeaafaqace qbfaaaaaqaa8qacaqGXaGaaeimaiaabYcacaqGWaGaaeimaiaabcda a8aabaWdbiaab2cacaqGXaGaaeilaiaabkdacaqG1aGaaeimaaWdae aapeGaaeioaiaabYcacaqG3aGaaeynaiaabcdaa8aabaWdbiaabEda caqGSaGaaeynaiaabcdacaqGWaaapaqaa8qacaqGXaGaaeilaiaabk dacaqG1aGaaeimaaWdaeaapeGaaeylaiaabgdacaqGSaGaaeOmaiaa bwdacaqGWaaapaqaa8qacaqG1aGaaeilaiaabcdacaqGWaGaaeimaa WdaeaapeGaae4maiaabYcacaqG3aGaaeynaiaabcdaa8aabaWdbiaa bsdacaqGSaGaaeimaiaabcdacaqGWaaapaqaa8qacaqGTaGaaeOmai aabwdacaqGWaaapaqaa8qacaqG4aGaaeilaiaabEdacaqG1aGaaeim aaWdaeaapeGaae4maiaabYcacaqG3aGaaeynaiaabcdaa8aabaWdbi aabgdacaqGYaGaaeilaiaabwdacaqGWaGaaeimaaWdaeaapeGaaeym aiaabgdacaqGSaGaaeynaiaabcdacaqGWaaapaqaa8qacaqGXaGaae ilaiaabcdacaqGWaGaaeimaaWdaeaapeGaae4naiaabYcacaqG1aGa aeimaiaabcdaa8aabaWdbiaabsdacaqGSaGaaeimaiaabcdacaqGWa aapaqaa8qacaqGXaGaaeymaiaabYcacaqG1aGaaeimaiaabcdaa8aa baWdbiaabgdacaqGXaGaaeilaiaabEdacaqG1aGaaeimaaWdaeaape GaaeylaiaabkdacaqG1aGaaeimaaWdaeaapeGaaeymaiaabYcacaqG YaGaaeynaiaabcdaa8aabaWdbiaab2cacaqGYaGaaeynaiaabcdaa8 aabaWdbiaabgdacaqGSaGaaeimaiaabcdacaqGWaaapaqaa8qacaqG TaGaaeOmaiaabwdacaqGWaaapaqaa8qacaqGXaGaaeilaiaabkdaca qG1aGaaeimaaaaaiaawIcacaGLPaaacaGGUaaaaa@B704@

Only records that satisfied all of the above edits were added to the data set. Note that Σ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiaaho6aaaa@38EE@ is a singular covariance matrix that incorporates the two equality edits. Technically, the resulting data follow a so-called truncated multivariate singular normal distribution; see de Waal et al. (2011, pages 318ff) or Tempelman (2007).

Table 7.1 lists the nine allowed edit operations that were considered in this study. Note that the first five lines contain the FH operations for this data set. As indicated in the table, each edit operation has an associated type of error. A synthetic data set to be edited was created by randomly adding errors of these types to the above-mentioned error-free data set. The probability of each type of error is listed in the fourth column of Table 7.1. The associated “ideal” weight according to (4.2) is shown in the last column.

To limit the amount of computational work, I only considered records that required three edit operations or less. Records without errors were also removed. This left 1,025 records to be edited, each containing one, two, or three of the errors listed in Table 7.1.

Table 7.1
Allowed edit operations for the simulation study
Table summary
This table displays the results of Allowed edit operations for the simulation study. The information is grouped by name (appearing as row headers), operation, associated type of error and XXXX (appearing as column headers).
name operation associated type of error p g MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa WdbiaadchapaWaaSbaaSqaa8qacaWGNbaapaqabaaaaa@3C27@ w g MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa WdbiaadEhapaWaaSbaaSqaa8qacaWGNbaapaqabaaaaa@3C2E@
FH1 impute x 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIXaaapaqabaaaaa@3BF4@ erroneous value of x 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.10 2.20
FH2 impute x 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIYaaapaqabaaaaa@3BF5@ erroneous value of x 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.08 2.44
FH3 impute x 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ erroneous value of x 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.06 2.75
FH4 impute x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaI0aaapaqabaaaaa@3BF7@ erroneous value of x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.04 3.18
FH5 impute x 5 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaI1aaapaqabaaaaa@3BF8@ erroneous value of x 5 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.02 3.89
IC34 interchange x 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ and x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ true values of x 3 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ and x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ interchanged 0.07 2.59
TF21 transfer an amount from x 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ to x 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ part of the true value of x 1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ reported as part of x 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.09 2.31
CS4 change the sign of x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ sign error in x 4 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.11 2.09
CS5 change the sign of x 5 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ sign error in x 5 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadIhapaWaaSbaaSqaa8qacaaIZaaapaqabaaaaa@3BF6@ 0.13 1.90

Several error localization approaches were applied to this data set. First of all, I tested error localization according to the Fellegi-Holt paradigm (i.e., using only the edit operations FH1 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9IqqrpepC0xbbL8F4rqqr=hbbG8pue9Fbe9q8 qqvqFr0dXdbrVc=b0P0xb9peuD0xXddrpe0=1qpeea0=yrVue9Fve9 Fve8meaabaqaciGacaGaaeWabaWaaeaaeaaakeaaieaajugybabaaa aaaaaapeGaa83eGaaa@384B@ FH5) and according to the new paradigm (i.e., using all edit operations in Table 7.1). Both approaches were tested once using the “ideal” weights listed in Table 7.1 and once with all weights equal to 1 (“no weights”). The latter case simulates a situation where the relevant edit operations would be known, but not their respective frequencies. Finally, to test the robustness of the new error localization approach to a lack of information about relevant edit operations, I also applied this approach with one of the non-FH operations in Table 7.1 missing from the set of allowed edit operations.

The quality of error localization was evaluated in two ways. Firstly, I evaluated how well the optimal paths of edit operations found by the algorithm matched the true distribution of errors, using the following contingency table for all 1,025 × 9 = 9,225 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaacaqGXaGaae ilaiaabcdacaqGYaGaaeynaiabgEna0kaaiMdacqGH9aqpcaqG5aGa aeilaiaabkdacaqGYaGaaeynaaaa@428F@ combinations of records and edit operations:

Table 7.2
Contingency table of errors and edit operations suggested by the algorithm
Table summary
This table displays the results of Contingency table of errors and edit operations suggested by the algorithm edit operation was suggested and edit operation was not suggested (appearing as column headers).
  edit operation was suggested edit operation was not suggested
associated error occurred TP MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadsfacaWGqbaaaa@3B90@ FN MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadAeacaWGobaaaa@3B80@
associated error did not occur FP MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadAeacaWGqbaaaa@3B82@ TN MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa WdbiaadsfacaWGobaaaa@3B8E@

From this table, I computed indicators that measure the proportion of false negatives, false positives, and overall wrong decisions, respectively:

α = F N T P + F N ;           β = F P F P + T N ;           δ = F N + F P T P + F N + F P + T N . MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiabeg7aHjabg2da9maalaaapaqaa8qacaWGgbGaamOtaaWdaeaa peGaamivaiaadcfacqGHRaWkcaWGgbGaamOtaaaacaGG7aGaaiiOai aacckacaGGGcGaaiiOaiaacckacqaHYoGycqGH9aqpdaWcaaWdaeaa peGaamOraiaadcfaa8aabaWdbiaadAeacaWGqbGaey4kaSIaamivai aad6eaaaGaai4oaiaacckacaGGGcGaaiiOaiaacckacaGGGcGaeqiT dqMaeyypa0ZaaSaaa8aabaWdbiaadAeacaWGobGaey4kaSIaamOrai aadcfaa8aabaWdbiaadsfacaWGqbGaey4kaSIaamOraiaad6eacqGH RaWkcaWGgbGaamiuaiabgUcaRiaadsfacaWGobaaaiaac6caaaa@672F@

Similar indicators are discussed by de Waal et al. (2011, pages 410-411). I also computed ρ ¯ = 1 ρ , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiqbeg8aY9aagaqea8qacqGH9aqpcaaIXaGaeyOeI0IaeqyWdiNa aiilaaaa@3ED4@ with ρ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiabeg8aYbaa@397F@ the fraction of records in the data set for which the error localization algorithm found exactly the right solution. A good error localization algorithm should have low scores on all four indicators.

It should be noted that the above quality indicators put the original Fellegi-Holt approach at a disadvantage, as this approach does not use all the edit operations listed in Table 7.1. Therefore, I also calculated a second set of quality indicators α , β , δ , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiabeg7aHjaacYcacqaHYoGycaGGSaGaeqiTdqMaaiilaaaa@3EB4@ and ρ ¯ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiqbeg8aY9aagaqeaaaa@39A6@ that look at erroneous values rather than edit operations. In this case, α MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meaabaqaciGacaGaaeqabaWaaeaaeaaakeaaqaaaaaaaaa Wdbiabeg7aHbaa@395E@ measures the proportion of values in the data set that were affected by errors but left unchanged by the optimal solution of the error localization problem, and similarly for the other measures.

Table 7.3 displays the results of the simulation study for both sets of quality indicators. In both cases, a considerable improvement in the quality of the error localization results is seen for the approach that used all edit operations, compared to the approach that used only FH operations. In addition, leaving one relevant edit operation out of the set of allowed edit operations had a negative effect on the quality of error localization. In some cases this effect was quite large MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9IqqrpepC0xbbL8F4rqqr=hbbG8pue9Fbe9q8 qqvqFr0dXdbrVc=b0P0xb9peuD0xXddrpe0=1qpeea0=yrVue9Fve9 Fve8meaabaqaciGacaGaaeWabaWaaeaaeaaakeaaieaajugybabaaa aaaaaapeGaa83eGaaa@384B@ particularly in terms of edit operations used MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9IqqrpepC0xbbL8F4rqqr=hbbG8pue9Fbe9q8 qqvqFr0dXdbrVc=b0P0xb9peuD0xXddrpe0=1qpeea0=yrVue9Fve9 Fve8meaabaqaciGacaGaaeWabaWaaeaaeaaakeaaieaajugybabaaa aaaaaapeGaa83eGaaa@384B@ , but the results of the new error localization approach still remained substantially better than those of the Fellegi-Holt approach. Contrary to expectation, not using different confidence weights actually improved the quality of the error localization results somewhat for this data set under the Fellegi-Holt approach (both sets of indicators) and to some extent also under the new approach (only the second set of indicators). Finally, it is seen that using all edit operations led to an increase in computing time compared to using only FH operations, but this increase was not dramatic.

Table 7.3
Quality of error localization in terms of edit operations used and identified erroneous values; computing time required
Table summary
This table displays the results of Quality of error localization in terms of edit operations used and identified erroneous values; computing time required. The information is grouped by approach (appearing as row headers), quality indicators (edit operations), quality indicators (erroneous values) and time*, calculated using XXXX units of measure (appearing as column headers).
approach quality indicators (edit operations) quality indicators (erroneous values) timeNote *
α MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabeg7aHbaa@3B8B@ β MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabek7aIbaa@3B8D@ δ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabes7aKbaa@3B91@ ρ ¯ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiqbeg8aY9aagaqeaaaa@3BD3@ α MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabeg7aHbaa@3B8B@ β MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabek7aIbaa@3B8D@ δ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiabes7aKbaa@3B91@ ρ ¯ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqk0Jf9crFfpeea0xh9v8qiW7rqqrVfpeea0xe9Lqpe0x e9q8qqvqFr0dXdHiVc=bYP0xH8peuj0lXxdrpe0=1qpeeaY=rrVue9 Fve9Fve8meqabeqadiWaceGabeqabeWabeqaeeaakeaaqaaaaaaaaa Wdbiqbeg8aY9aagaqeaaaa@3BD3@
Fellegi-Holt (weights) 74% 12% 23% 80% 19% 10% 13% 32% 46
Fellegi-Holt (no weights) 70% 12% 21% 74% 13% 8% 9% 24% 33
all operations (weights) 14% 3% 5% 24% 10% 5% 7% 17% 98
except IC34 29% 5% 9% 35% 15% 9% 11% 29% 113
except TF21 34% 5% 10% 37% 10% 5% 7% 18% 80
except CS4 28% 6% 9% 39% 10% 5% 7% 17% 80
except CS5 35% 7% 10% 47% 11% 6% 7% 18% 82
all operations (no weights) 27% 5% 8% 36% 6% 4% 5% 13% 99
Date modified: