Editing and imputation

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Type

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (8)

All (8) ((8 results))

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 12-001-X201600214661
    Description:

    An example presented by Jean-Claude Deville in 2005 is subjected to three estimation methods: the method of moments, the maximum likelihood method, and generalized calibration. The three methods yield exactly the same results for the two non-response models. A discussion follows on how to choose the most appropriate model.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X20060029555
    Description:

    Researchers and policy makers often use data from nationally representative probability sample surveys. The number of topics covered by such surveys, and hence the amount of interviewing time involved, have typically increased over the years, resulting in increased costs and respondent burden. A potential solution to this problem is to carefully form subsets of the items in a survey and administer one such subset to each respondent. Designs of this type are called "split-questionnaire" designs or "matrix sampling" designs. The administration of only a subset of the survey items to each respondent in a matrix sampling design creates what can be considered missing data. Multiple imputation (Rubin 1987), a general-purpose approach developed for handling data with missing values, is appealing for the analysis of data from a matrix sample, because once the multiple imputations are created, data analysts can apply standard methods for analyzing complete data from a sample survey. This paper develops and evaluates a method for creating matrix sampling forms, each form containing a subset of items to be administered to randomly selected respondents. The method can be applied in complex settings, including situations in which skip patterns are present. Forms are created in such a way that each form includes items that are predictive of the excluded items, so that subsequent analyses based on multiple imputation can recover some of the information about the excluded items that would have been collected had there been no matrix sampling. The matrix sampling and multiple-imputation methods are evaluated using data from the National Health and Nutrition Examination Survey, one of many nationally representative probability sample surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. The study demonstrates the feasibility of the approach applied to a major national health survey with complex structure, and it provides practical advice about appropriate items to include in matrix sampling designs in future surveys.

    Release date: 2006-12-21

  • Articles and reports: 12-001-X20020026427
    Description:

    We proposed an item imputation method for categorical data based on a Maximum Likelihood Estimator (MLE) derived from a conditional probability model (Besag 1974). We also defined a measure for the item non-response error that was useful in evaluating the bias relative to other imputation methods. To compute this measure, we used Bayesian iterative proportional fitting (Gelman and Rubin 1991; Schafer 1997). We implemented our imputation method for the 1998 dress rehearsal of the Census 2000 in Sacramento, and we used the error measure to compare item imputations between our method and a version of the nearest neighbour hot-deck method (Fay 1999; Chen and Shao 1997, 2000) at aggregate levels. Our results suggest that our method gives additional protection against imputation biases caused by heterogeneities between domains of study, relative to the hot-deck method.

    Release date: 2003-01-29

  • Articles and reports: 11-522-X20010016304
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    This paper describes a test of two alternative sets of ratio edit and imputation procedures, both using the U.S. Census Bureau's generalized editing and imputation subsystem ("Plain Vanilla") on 1997 Economic Census data. The quality of the edited and imputed data from both sets of procedures were compared - both at the micro and macro level. Discussions followed on how these quantitative methods of comparison gave rise to the recommended changes for the current editing and imputation procedures.

    Release date: 2002-09-12

  • Articles and reports: 11-522-X20010016305
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    A review of the Office for National Statistics (ONS) identified the need for new methods which would improve the efficiency of the data validation and editing processes in business surveys, but would not adversely impact the data quality. Methods for automating the correction of systematic errors, and for applying selective editing, were developed. However, the ways in which the organization and the procedures for ONS business surveys have evolved, presented a number of challenges in implementing these methods. This paper describes these challenges and how they were addressed and considers the relevance of these challenges to other organizations. Approaches to evaluating the impact of new methods on both quality and efficiency are also discussed.

    Release date: 2002-09-12

  • Articles and reports: 11-522-X20010016306
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    The paper deals with concerns regarding the problem of automatic detection and correction of inconsistent or out-of-range data in a general process of statistical data collection. The proposed approach is capable of handling both qualitative and quantitative values. The purpose of this new approach is to overcome the computational limits of the Fellegi-Holt method, while maintaining its positive features. As customary, data records must respect a set of rules in order to be declared correct. By encoding the rules with linear inequalities, we develop mathematical models for the problems of interest. As a first relevant point, by solving a sequence of feasibility problems, the set of rules itself is checked for inconsistency or redundancy. As a second relevant point, imputation is performed by solving a sequence of set-covering problems.

    Release date: 2002-09-12
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (8)

Analysis (8) ((8 results))

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 12-001-X201600214661
    Description:

    An example presented by Jean-Claude Deville in 2005 is subjected to three estimation methods: the method of moments, the maximum likelihood method, and generalized calibration. The three methods yield exactly the same results for the two non-response models. A discussion follows on how to choose the most appropriate model.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X20060029555
    Description:

    Researchers and policy makers often use data from nationally representative probability sample surveys. The number of topics covered by such surveys, and hence the amount of interviewing time involved, have typically increased over the years, resulting in increased costs and respondent burden. A potential solution to this problem is to carefully form subsets of the items in a survey and administer one such subset to each respondent. Designs of this type are called "split-questionnaire" designs or "matrix sampling" designs. The administration of only a subset of the survey items to each respondent in a matrix sampling design creates what can be considered missing data. Multiple imputation (Rubin 1987), a general-purpose approach developed for handling data with missing values, is appealing for the analysis of data from a matrix sample, because once the multiple imputations are created, data analysts can apply standard methods for analyzing complete data from a sample survey. This paper develops and evaluates a method for creating matrix sampling forms, each form containing a subset of items to be administered to randomly selected respondents. The method can be applied in complex settings, including situations in which skip patterns are present. Forms are created in such a way that each form includes items that are predictive of the excluded items, so that subsequent analyses based on multiple imputation can recover some of the information about the excluded items that would have been collected had there been no matrix sampling. The matrix sampling and multiple-imputation methods are evaluated using data from the National Health and Nutrition Examination Survey, one of many nationally representative probability sample surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. The study demonstrates the feasibility of the approach applied to a major national health survey with complex structure, and it provides practical advice about appropriate items to include in matrix sampling designs in future surveys.

    Release date: 2006-12-21

  • Articles and reports: 12-001-X20020026427
    Description:

    We proposed an item imputation method for categorical data based on a Maximum Likelihood Estimator (MLE) derived from a conditional probability model (Besag 1974). We also defined a measure for the item non-response error that was useful in evaluating the bias relative to other imputation methods. To compute this measure, we used Bayesian iterative proportional fitting (Gelman and Rubin 1991; Schafer 1997). We implemented our imputation method for the 1998 dress rehearsal of the Census 2000 in Sacramento, and we used the error measure to compare item imputations between our method and a version of the nearest neighbour hot-deck method (Fay 1999; Chen and Shao 1997, 2000) at aggregate levels. Our results suggest that our method gives additional protection against imputation biases caused by heterogeneities between domains of study, relative to the hot-deck method.

    Release date: 2003-01-29

  • Articles and reports: 11-522-X20010016304
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    This paper describes a test of two alternative sets of ratio edit and imputation procedures, both using the U.S. Census Bureau's generalized editing and imputation subsystem ("Plain Vanilla") on 1997 Economic Census data. The quality of the edited and imputed data from both sets of procedures were compared - both at the micro and macro level. Discussions followed on how these quantitative methods of comparison gave rise to the recommended changes for the current editing and imputation procedures.

    Release date: 2002-09-12

  • Articles and reports: 11-522-X20010016305
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    A review of the Office for National Statistics (ONS) identified the need for new methods which would improve the efficiency of the data validation and editing processes in business surveys, but would not adversely impact the data quality. Methods for automating the correction of systematic errors, and for applying selective editing, were developed. However, the ways in which the organization and the procedures for ONS business surveys have evolved, presented a number of challenges in implementing these methods. This paper describes these challenges and how they were addressed and considers the relevance of these challenges to other organizations. Approaches to evaluating the impact of new methods on both quality and efficiency are also discussed.

    Release date: 2002-09-12

  • Articles and reports: 11-522-X20010016306
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    The paper deals with concerns regarding the problem of automatic detection and correction of inconsistent or out-of-range data in a general process of statistical data collection. The proposed approach is capable of handling both qualitative and quantitative values. The purpose of this new approach is to overcome the computational limits of the Fellegi-Holt method, while maintaining its positive features. As customary, data records must respect a set of rules in order to be declared correct. By encoding the rules with linear inequalities, we develop mathematical models for the problems of interest. As a first relevant point, by solving a sequence of feasibility problems, the set of rules itself is checked for inconsistency or redundancy. As a second relevant point, imputation is performed by solving a sequence of set-covering problems.

    Release date: 2002-09-12
Reference (0)

Reference (0) (0 results)

No content available at this time.

Date modified: