Editing and imputation

Skip to main content
Skip to footer

Language selection

Français

Search and menus

Search and menus

Search

Skip to filters. View results.

Results

All (8)

All (8) ((8 results))

1. A cautionary note on Clark Winsorization Archived
Articles and reports: 12-001-X201600214676
Description:
Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).
Release date: 2016-12-20
2. The influence of sampling method and interviewers on sample realization in the European Social Survey Archived
Articles and reports: 12-001-X201400114001
Description:
This article addresses the impact of different sampling procedures on realised sample quality in the case of probability samples. This impact was expected to result from varying degrees of freedom on the part of interviewers to interview easily available or cooperative individuals (thus producing substitutions). The analysis was conducted in a cross-cultural context using data from the first four rounds of the European Social Survey (ESS). Substitutions are measured as deviations from a 50/50 gender ratio in subsamples with heterosexual couples. Significant deviations were found in numerous countries of the ESS. They were also found to be lowest in cases of samples with official registers of residents as sample frame (individual person register samples) if one partner was more difficult to contact than the other. This scope of substitutions did not differ across the ESS rounds and it was weakly correlated with payment and control procedures. It can be concluded from the results that individual person register samples are associated with higher sample quality.
Release date: 2014-06-27
3. Bayesian multiple imputation for large-scale categorical data with structural zeros Archived
Articles and reports: 12-001-X201400114002
Description:
We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, U.S.A.
Release date: 2014-06-27
4. Multiple imputation of missing income data at the individual and family levels using sequential regression imputation: Application to the National Health Interview Survey Archived
Articles and reports: 11-522-X20020016715
Description:
This paper will describe the multiple imputation of income in the National Health Interview Survey and discuss the methodological issues involved. In addition, the paper will present empirical summaries of the imputations as well as results of a Monte Carlo evaluation of inferences based on multiply imputed income items.
Analysts of health data are often interested in studying relationships between income and health. The National Health Interview Survey, conducted by the National Center for Health Statistics of the U.S. Centers for Disease Control and Prevention, provides a rich source of data for studying such relationships. However, the nonresponse rates on two key income items, an individual's earned income and a family's total income, are over 20%. Moreover, these nonresponse rates appear to be increasing over time. A project is currently underway to multiply impute individual earnings and family income along with some other covariates for the National Health Interview Survey in 1997 and subsequent years.
There are many challenges in developing appropriate multiple imputations for such large-scale surveys. First, there are many variables of different types, with different skip patterns and logical relationships. Second, it is not known what types of associations will be investigated by the analysts of multiply imputed data. Finally, some variables, such as family income, are collected at the family level and others, such as earned income, are collected at the individual level. To make the imputations for both the family- and individual-level variables conditional on as many predictors as possible, and to simplify modelling, we are using a modified version of the sequential regression imputation method described in Raghunathan et al. ( Survey Methodology, 2001).
Besides issues related to the hierarchical nature of the imputations just described, there are other methodological issues of interest such as the use of transformations of the income variables, the imposition of restrictions on the values of variables, the general validity of sequential regression imputation and, even more generally, the validity of multiple-imputation inferences for surveys with complex sample designs.
Release date: 2004-09-13
5. Evaluating the impact of alternative edit parameters on data quality Archived
Articles and reports: 11-522-X20010016304
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
This paper describes a test of two alternative sets of ratio edit and imputation procedures, both using the U.S. Census Bureau's generalized editing and imputation subsystem ("Plain Vanilla") on 1997 Economic Census data. The quality of the edited and imputed data from both sets of procedures were compared - both at the micro and macro level. Discussions followed on how these quantitative methods of comparison gave rise to the recommended changes for the current editing and imputation procedures.
Release date: 2002-09-12
6. Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption Archived
Articles and reports: 12-001-X199300114475
Description:
In the creation of micro-simulation databases which are frequently used by policy analysts and planners, several datafiles are combined by statistical matching techniques for enriching the host datafile. This process requires the conditional independence assumption (CIA) which could lead to serious bias in the resulting joint relationships among variables. Appropriate auxiliary information could be used to avoid the CIA. In this report, methods of statistical matching corresponding to three methods of imputation, namely, regression, hot deck, and log linear, with and without auxiliary information are considered. The log linear methods consist of adding categorical constraints to either the regression or hot deck methods. Based on an extensive simulation study with synthetic data, sensitivity analyses for departures from the CIA are performed and gains from using auxiliary information are discussed. Different scenarios for the underlying distribution and relationships, such as symmetric versus skewed data and proxy versus nonproxy auxiliary data, are created using synthetic data. Some recommendations on the use of statistical matching methods are also made. Specifically, it was confirmed that the CIA could be a serious limitation which could be overcome by the use of appropriate auxiliary information. Hot deck methods were found to be generally preferable to regression methods. Also, when auxiliary information is available, log linear categorical constraints can improve performance of hot deck methods. This study was motivated by concerns about the use of the CIA in the construction of the Social Policy Simulation Database at Statistics Canada.
Release date: 1993-06-15
7. Comparison of weighting and imputation methods for estimating unsampled data Archived
Articles and reports: 12-001-X198600214451
Description:
The Canadian Census of Construction (COC) uses a complex plan for sampling small businesses (those having a gross income of less than $750,000). Stratified samples are drawn from overlapping frames. Two subsamples are selected independently from one of the samples, and more detailed information is collected on the businesses in the subsamples. There are two possible methods of estimating totals for the variables collected in the subsamples. The first approach is to determine weights based on sampling rates. A number of different weights must be used. The second approach is to impute values to the businesses included in the sample but not in the subsamples. This approach creates a complete “rectangular” sample file, and a single weight may then be used to produce estimates for the population. This “large-scale imputation” technique is presently applied for the Census of Construction. The purpose of the study is to compare the figures obtained using various estimation techniques with the estimates produced by means of large-scale imputation.
Release date: 1986-12-15
8. Sample design of the Monthly Restaurants, Caterers and Taverns Survey Archived
Articles and reports: 12-001-X198000154837
Description: Statistics on sales of establishments classified as restaurants, caterers and taverns have been collected since 1951. The sample has not been updated for births since 1968 and as a result, it is not representative of the current universe. This paper reports on several methodological aspects of the redesign. The sampling unit, sample design, sample size and allocation, data collection methods, edits and imputations, accumulations and calculations, frame and sample maintenance are described. The new survey will reduce manual procedures wherever possible. Collection, editing, imputation, tabulation and updating procedures will be completely computerized. Data collection will be decentralized and will take place via telephone.
Release date: 1980-06-15

Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (8)

Analysis (8) ((8 results))

1. A cautionary note on Clark Winsorization Archived
Articles and reports: 12-001-X201600214676
Description:
Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).
Release date: 2016-12-20
2. The influence of sampling method and interviewers on sample realization in the European Social Survey Archived
Articles and reports: 12-001-X201400114001
Description:
This article addresses the impact of different sampling procedures on realised sample quality in the case of probability samples. This impact was expected to result from varying degrees of freedom on the part of interviewers to interview easily available or cooperative individuals (thus producing substitutions). The analysis was conducted in a cross-cultural context using data from the first four rounds of the European Social Survey (ESS). Substitutions are measured as deviations from a 50/50 gender ratio in subsamples with heterosexual couples. Significant deviations were found in numerous countries of the ESS. They were also found to be lowest in cases of samples with official registers of residents as sample frame (individual person register samples) if one partner was more difficult to contact than the other. This scope of substitutions did not differ across the ESS rounds and it was weakly correlated with payment and control procedures. It can be concluded from the results that individual person register samples are associated with higher sample quality.
Release date: 2014-06-27
3. Bayesian multiple imputation for large-scale categorical data with structural zeros Archived
Articles and reports: 12-001-X201400114002
Description:
We propose an approach for multiple imputation of items missing at random in large-scale surveys with exclusively categorical variables that have structural zeros. Our approach is to use mixtures of multinomial distributions as imputation engines, accounting for structural zeros by conceiving of the observed data as a truncated sample from a hypothetical population without structural zeros. This approach has several appealing features: imputations are generated from coherent, Bayesian joint models that automatically capture complex dependencies and readily scale to large numbers of variables. We outline a Gibbs sampling algorithm for implementing the approach, and we illustrate its potential with a repeated sampling study using public use census microdata from the state of New York, U.S.A.
Release date: 2014-06-27
4. Multiple imputation of missing income data at the individual and family levels using sequential regression imputation: Application to the National Health Interview Survey Archived
Articles and reports: 11-522-X20020016715
Description:
This paper will describe the multiple imputation of income in the National Health Interview Survey and discuss the methodological issues involved. In addition, the paper will present empirical summaries of the imputations as well as results of a Monte Carlo evaluation of inferences based on multiply imputed income items.
Analysts of health data are often interested in studying relationships between income and health. The National Health Interview Survey, conducted by the National Center for Health Statistics of the U.S. Centers for Disease Control and Prevention, provides a rich source of data for studying such relationships. However, the nonresponse rates on two key income items, an individual's earned income and a family's total income, are over 20%. Moreover, these nonresponse rates appear to be increasing over time. A project is currently underway to multiply impute individual earnings and family income along with some other covariates for the National Health Interview Survey in 1997 and subsequent years.
There are many challenges in developing appropriate multiple imputations for such large-scale surveys. First, there are many variables of different types, with different skip patterns and logical relationships. Second, it is not known what types of associations will be investigated by the analysts of multiply imputed data. Finally, some variables, such as family income, are collected at the family level and others, such as earned income, are collected at the individual level. To make the imputations for both the family- and individual-level variables conditional on as many predictors as possible, and to simplify modelling, we are using a modified version of the sequential regression imputation method described in Raghunathan et al. ( Survey Methodology, 2001).
Besides issues related to the hierarchical nature of the imputations just described, there are other methodological issues of interest such as the use of transformations of the income variables, the imposition of restrictions on the values of variables, the general validity of sequential regression imputation and, even more generally, the validity of multiple-imputation inferences for surveys with complex sample designs.
Release date: 2004-09-13
5. Evaluating the impact of alternative edit parameters on data quality Archived
Articles and reports: 11-522-X20010016304
Description:
This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.
This paper describes a test of two alternative sets of ratio edit and imputation procedures, both using the U.S. Census Bureau's generalized editing and imputation subsystem ("Plain Vanilla") on 1997 Economic Census data. The quality of the edited and imputed data from both sets of procedures were compared - both at the micro and macro level. Discussions followed on how these quantitative methods of comparison gave rise to the recommended changes for the current editing and imputation procedures.
Release date: 2002-09-12
6. Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption Archived
Articles and reports: 12-001-X199300114475
Description:
In the creation of micro-simulation databases which are frequently used by policy analysts and planners, several datafiles are combined by statistical matching techniques for enriching the host datafile. This process requires the conditional independence assumption (CIA) which could lead to serious bias in the resulting joint relationships among variables. Appropriate auxiliary information could be used to avoid the CIA. In this report, methods of statistical matching corresponding to three methods of imputation, namely, regression, hot deck, and log linear, with and without auxiliary information are considered. The log linear methods consist of adding categorical constraints to either the regression or hot deck methods. Based on an extensive simulation study with synthetic data, sensitivity analyses for departures from the CIA are performed and gains from using auxiliary information are discussed. Different scenarios for the underlying distribution and relationships, such as symmetric versus skewed data and proxy versus nonproxy auxiliary data, are created using synthetic data. Some recommendations on the use of statistical matching methods are also made. Specifically, it was confirmed that the CIA could be a serious limitation which could be overcome by the use of appropriate auxiliary information. Hot deck methods were found to be generally preferable to regression methods. Also, when auxiliary information is available, log linear categorical constraints can improve performance of hot deck methods. This study was motivated by concerns about the use of the CIA in the construction of the Social Policy Simulation Database at Statistics Canada.
Release date: 1993-06-15
7. Comparison of weighting and imputation methods for estimating unsampled data Archived
Articles and reports: 12-001-X198600214451
Description:
The Canadian Census of Construction (COC) uses a complex plan for sampling small businesses (those having a gross income of less than $750,000). Stratified samples are drawn from overlapping frames. Two subsamples are selected independently from one of the samples, and more detailed information is collected on the businesses in the subsamples. There are two possible methods of estimating totals for the variables collected in the subsamples. The first approach is to determine weights based on sampling rates. A number of different weights must be used. The second approach is to impute values to the businesses included in the sample but not in the subsamples. This approach creates a complete “rectangular” sample file, and a single weight may then be used to produce estimates for the population. This “large-scale imputation” technique is presently applied for the Census of Construction. The purpose of the study is to compare the figures obtained using various estimation techniques with the estimates produced by means of large-scale imputation.
Release date: 1986-12-15
8. Sample design of the Monthly Restaurants, Caterers and Taverns Survey Archived
Articles and reports: 12-001-X198000154837
Description: Statistics on sales of establishments classified as restaurants, caterers and taverns have been collected since 1951. The sample has not been updated for births since 1968 and as a result, it is not representative of the current universe. This paper reports on several methodological aspects of the redesign. The sampling unit, sample design, sample size and allocation, data collection methods, edits and imputations, accumulations and calculations, frame and sample maintenance are described. The new survey will reduce manual procedures wherever possible. Collection, editing, imputation, tabulation and updating procedures will be completely computerized. Data collection will be decentralized and will take place via telephone.
Release date: 1980-06-15

Reference (0)

Reference (0) (0 results)

No content available at this time.

Report a problem or mistake on this page

Date modified:: 2024-05-07