Statistics by subject – Statistical methods

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Type of information

2 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Type of information

2 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
Loading
Loading in progress, please wait...
All (52)

All (52) (25 of 52 results)

  • Technical products: 11-522-X201700014735
    Description:

    Microdata dissemination normally requires data reduction and modification methods be applied, and the degree to which these methods are applied depend on the control methods that will be required to access and use the data. An approach that is in some circumstances more suitable for accessing data for statistical purposes is secure computation, which involves computing analytic functions on encrypted data without the need to decrypt the underlying source data to run a statistical analysis. This approach also allows multiple sites to contribute data while providing strong privacy guarantees. This way the data can be pooled and contributors can compute analytic functions without either party knowing their inputs. We explain how secure computation can be applied in practical contexts, with some theoretical results and real healthcare examples.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014723
    Description:

    The U.S. Census Bureau is researching uses of administrative records in survey and decennial operations in order to reduce costs and respondent burden while preserving data quality. One potential use of administrative records is to utilize the data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across different administrative records sources. We explore different sets of business rules used to assign one race and one Hispanic response when these responses are discrepant across sources. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data across several demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records than in the 2010 Census. Hispanics are less likely to have missing Hispanic origin data but more likely to have missing race data in administrative records. Non-Hispanic Asians and non-Hispanic Pacific Islanders are more likely to have missing race and Hispanic origin data in administrative records. Younger individuals, renters, individuals living in households with two or more people, individuals who responded to the census in the nonresponse follow-up operation, and individuals residing in urban areas are more likely to have non-matching race and Hispanic origin responses.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014747
    Description:

    The Longitudinal Immigration Database (IMDB) combines the Immigrant Landing File (ILF) with annual tax files. This record linkage is performed using a tax filer database. The ILF includes all immigrants who have landed in Canada since 1980. In looking to enhance the IMDB, the possibility of adding temporary residents (TR) and immigrants who landed between 1952 and 1979 (PRE80) was studied. Adding this information would give a more complete picture of the immigrant population living in Canada. To integrate the TR and PRE80 files into the IMDB, record linkages between these two files and the tax filer database, were performed. This exercise was challenging in part due to the presence of duplicates in the files and conflicting links between the different record linkages.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014739
    Description:

    Vital statistics datasets such as the Canadian Mortality Database lack identifiers for certain populations of interest such as First Nations, Métis and Inuit. Record linkage between vital statistics and survey or other administrative datasets can circumvent this limitation. This paper describes a linkage between the Canadian Mortality Database and the 2006 Census of the Population and the planned analysis using the linked data.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014732
    Description:

    The Institute for Employment Research (IAB) is the research unit of the German Federal Employment Agency. Via the Research Data Centre (FDZ) at the IAB, administrative and survey data on individuals and establishments are provided to researchers. In cooperation with the Institute for the Study of Labor (IZA), the FDZ has implemented the Job Submission Application (JoSuA) environment which enables researchers to submit jobs for remote data execution through a custom-built web interface. Moreover, two types of user-generated output files may be distinguished within the JoSuA environment which allows for faster and more efficient disclosure review services.

    Release date: 2016-03-24

  • Articles and reports: 82-003-X201501114243
    Description:

    A surveillance tool was developed to assess dietary intake collected by surveys in relation to Eating Well with Canada’s Food Guide (CFG). The tool classifies foods in the Canadian Nutrient File (CNF) according to how closely they reflect CFG. This article describes the validation exercise conducted to ensure that CNF foods determined to be “in line with CFG” were appropriately classified.

    Release date: 2015-11-18

  • Articles and reports: 12-001-X201400214092
    Description:

    Survey methodologists have long studied the effects of interviewers on the variance of survey estimates. Statistical models including random interviewer effects are often fitted in such investigations, and research interest lies in the magnitude of the interviewer variance component. One question that might arise in a methodological investigation is whether or not different groups of interviewers (e.g., those with prior experience on a given survey vs. new hires, or CAPI interviewers vs. CATI interviewers) have significantly different variance components in these models. Significant differences may indicate a need for additional training in particular subgroups, or sub-optimal properties of different modes or interviewing styles for particular survey items (in terms of the overall mean squared error of survey estimates). Survey researchers seeking answers to these types of questions have different statistical tools available to them. This paper aims to provide an overview of alternative frequentist and Bayesian approaches to the comparison of variance components in different groups of survey interviewers, using a hierarchical generalized linear modeling framework that accommodates a variety of different types of survey variables. We first consider the benefits and limitations of each approach, contrasting the methods used for estimation and inference. We next present a simulation study, empirically evaluating the ability of each approach to efficiently estimate differences in variance components. We then apply the two approaches to an analysis of real survey data collected in the U.S. National Survey of Family Growth (NSFG). We conclude that the two approaches tend to result in very similar inferences, and we provide suggestions for practice given some of the subtle differences observed.

    Release date: 2014-12-19

  • Articles and reports: 12-001-X201400214089
    Description:

    This manuscript describes the use of multiple imputation to combine information from multiple surveys of the same underlying population. We use a newly developed method to generate synthetic populations nonparametrically using a finite population Bayesian bootstrap that automatically accounts for complex sample designs. We then analyze each synthetic population with standard complete-data software for simple random samples and obtain valid inference by combining the point and variance estimates using extensions of existing combining rules for synthetic data. We illustrate the approach by combining data from the 2006 National Health Interview Survey (NHIS) and the 2006 Medical Expenditure Panel Survey (MEPS).

    Release date: 2014-12-19

  • Technical products: 11-522-X201300014282
    Description:

    The IAB-Establishment Panel is the most comprehensive establishment survey in Germany with almost 16.000 firms participating every year. Face-to-face interviews with paper and pencil (PAPI) are conducted since 1993. An ongoing project examines possible effects of a change of the survey to computer aided personal interviews (CAPI) combined with a web based version of the questionnaire (CAWI). In a first step, questions about the internet access, the willingness to complete the questionnaire online and reasons for refusal were included in the 2012 wave. First results are indicating a widespread refusal to take part in a web survey. A closer look reveals that smaller establishments, long time participants and older respondents are reluctant to use the internet.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014262
    Description:

    Measurement error is one source of bias in statistical analysis. However, its possible implications are mostly ignored One class of models that can be especially affected by measurement error are fixed-effects models. By validating the survey response of five panel survey waves for welfare receipt with register data, the size and form of longitudinal measurement error can be determined. It is shown, that the measurement error for welfare receipt is serially correlated and non-differential. However, when estimating the coefficients of longitudinal fixed effect models of welfare receipt on subjective health for men and women, the coefficients are biased only for the male subpopulation.

    Release date: 2014-10-31

  • Articles and reports: 12-001-X201400114003
    Description:

    Outside of the survey sampling literature, samples are often assumed to be generated by simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.

    Release date: 2014-06-27

  • Articles and reports: 12-001-X201200211758
    Description:

    This paper develops two Bayesian methods for inference about finite population quantiles of continuous survey variables from unequal probability sampling. The first method estimates cumulative distribution functions of the continuous survey variable by fitting a number of probit penalized spline regression models on the inclusion probabilities. The finite population quantiles are then obtained by inverting the estimated distribution function. This method is quite computationally demanding. The second method predicts non-sampled values by assuming a smoothly-varying relationship between the continuous survey variable and the probability of inclusion, by modeling both the mean function and the variance function using splines. The two Bayesian spline-model-based estimators yield a desirable balance between robustness and efficiency. Simulation studies show that both methods yield smaller root mean squared errors than the sample-weighted estimator and the ratio and difference estimators described by Rao, Kovar, and Mantel (RKM 1990), and are more robust to model misspecification than the regression through the origin model-based estimator described in Chambers and Dunstan (1986). When the sample size is small, the 95% credible intervals of the two new methods have closer to nominal confidence coverage than the sample-weighted estimator.

    Release date: 2012-12-19

  • Articles and reports: 12-001-X201100211606
    Description:

    This paper introduces a U.S. Census Bureau special compilation by presenting four other papers of the current issue: three papers from authors Tillé, Lohr and Thompson as well as a discussion paper from Opsomer.

    Release date: 2011-12-21

  • Articles and reports: 12-001-X201000111250
    Description:

    We propose a Bayesian Penalized Spline Predictive (BPSP) estimator for a finite population proportion in an unequal probability sampling setting. This new method allows the probabilities of inclusion to be directly incorporated into the estimation of a population proportion, using a probit regression of the binary outcome on the penalized spline of the inclusion probabilities. The posterior predictive distribution of the population proportion is obtained using Gibbs sampling. The advantages of the BPSP estimator over the Hájek (HK), Generalized Regression (GR), and parametric model-based prediction estimators are demonstrated by simulation studies and a real example in tax auditing. Simulation studies show that the BPSP estimator is more efficient, and its 95% credible interval provides better confidence coverage with shorter average width than the HK and GR estimators, especially when the population proportion is close to zero or one or when the sample is small. Compared to linear model-based predictive estimators, the BPSP estimators are robust to model misspecification and influential observations in the sample.

    Release date: 2010-06-29

  • Articles and reports: 12-001-X200900211045
    Description:

    In analysis of sample survey data, degrees-of-freedom quantities are often used to assess the stability of design-based variance estimators. For example, these degrees-of-freedom values are used in construction of confidence intervals based on t distribution approximations; and of related t tests. In addition, a small degrees-of-freedom term provides a qualitative indication of the possible limitations of a given variance estimator in a specific application. Degrees-of-freedom calculations sometimes are based on forms of the Satterthwaite approximation. These Satterthwaite-based calculations depend primarily on the relative magnitudes of stratum-level variances. However, for designs involving a small number of primary units selected per stratum, standard stratum-level variance estimators provide limited information on the true stratum variances. For such cases, customary Satterthwaite-based calculations can be problematic, especially in analyses for subpopulations that are concentrated in a relatively small number of strata. To address this problem, this paper uses estimated within-primary-sample-unit (within PSU) variances to provide auxiliary information regarding the relative magnitudes of the overall stratum-level variances. Analytic results indicate that the resulting degrees-of-freedom estimator will be better than modified Satterthwaite-type estimators provided: (a) the overall stratum-level variances are approximately proportional to the corresponding within-stratum variances; and (b) the variances of the within-PSU variance estimators are relatively small. In addition, this paper develops errors-in-variables methods that can be used to check conditions (a) and (b) empirically. For these model checks, we develop simulation-based reference distributions, which differ substantially from reference distributions based on customary large-sample normal approximations. The proposed methods are applied to four variables from the U.S. Third National Health and Nutrition Examination Survey (NHANES III).

    Release date: 2009-12-23

  • Technical products: 11-522-X200800010987
    Description:

    Over the last few years, there have been large progress in the web data collection area. Today, many statistical offices offer a web alternative in many different types of surveys. It is widely believed that web data collection may raise data quality while lowering data collection costs. Experience has shown that, offered web as a second alternative to paper questionnaires; enterprises have been slow to embrace the web alternative. On the other hand, experiments have also shown that by promoting web over paper, it is possible to raise the web take up rates. However, there are still few studies on what happens when the contact strategy is changed radically and the web option is the only option given in a complex enterprise survey. In 2008, Statistics Sweden took the step of using more or less a web-only strategy in the survey of industrial production (PRODCOM). The web questionnaire was developed in the generalised tool for web surveys used by Statistics Sweden. The paper presents the web solution and some experiences from the 2008 PRODCOM survey, including process data on response rates and error ratios as well as the results of a cognitive follow-up of the survey. Some important lessons learned are also presented.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010970
    Description:

    RTI International is currently conducting a longitudinal education study. One component of the study involved collecting transcripts and course catalogs from high schools that the sample members attended. Information from the transcripts and course catalogs also needed to be keyed and coded. This presented a challenge because the transcripts and course catalogs were collected from different types of schools, including public, private, and religious schools, from across the nation and they varied widely in both content and format. The challenge called for a sophisticated system that could be used by multiple users simultaneously. RTI developed such a system possessing all the characteristics of a high-end, high-tech, multi-user, multitask, user-friendly and low maintenance cost high school transcript and course catalog keying and coding system. The system is web based and has three major functions: transcript and catalog keying and coding, transcript and catalog keying quality control (keyer-coder end), and transcript and catalog coding QC (management end). Given the complex nature of transcript and catalog keying and coding, the system was designed to be flexible and to have the ability to transport keyed and coded data throughout the system to reduce the keying time, the ability to logically guide users through all the pages that a type of activity required, the ability to display appropriate information to help keying performance, and the ability to track all the keying, coding, and QC activities. Hundreds of catalogs and thousands of transcripts were successfully keyed, coded, and verified using the system. This paper will report on the system needs and design, implementation tips, problems faced and their solutions, and lessons learned.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010956
    Description:

    The use of Computer Audio-Recorded Interviewing (CARI) as a tool to identify interview falsification is quickly growing in survey research (Biemer, 2000, 2003; Thissen, 2007). Similarly, survey researchers are starting to expand the usefulness of CARI by combining recordings with coding to address data quality (Herget, 2001; Hansen, 2005; McGee, 2007). This paper presents results from a study included as part of the establishment-based National Center for Health Statistics' National Home and Hospice Care Survey (NHHCS) which used CARI behavior coding and CARI-specific paradata to: 1) identify and correct problematic interviewer behavior or question issues early in the data collection period before either negatively impact data quality, and; 2) identify ways to diminish measurement error in future implementations of the NHHCS. During the first 9 weeks of the 30-week field period, CARI recorded a subset of questions from the NHHCS application for all interviewers. Recordings were linked with the interview application and output and then coded in one of two modes: Code by Interviewer or Code by Question. The Code by Interviewer method provided visibility into problems specific to an interviewer as well as more generalized problems potentially applicable to all interviewers. The Code by Question method yielded data that spoke to understandability of the questions and other response problems. In this mode, coders coded multiple implementations of the same question across multiple interviewers. Using the Code by Question approach, researchers identified issues with three key survey questions in the first few weeks of data collection and provided guidance to interviewers in how to handle those questions as data collection continued. Results from coding the audio recordings (which were linked with the survey application and output) will inform question wording and interviewer training in the next implementation of the NHHCS, and guide future enhancement of CARI and the coding system.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011014
    Description:

    In many countries, improved quality of economic statistics is one of the most important goals of the 21st century. First and foremost, the quality of National Accounts is in focus, regarding both annual and quarterly accounts. To achieve this goal, data quality regarding the largest enterprises is of vital importance. To assure that the quality of data for the largest enterprises is good, coherence analysis is an important tool. Coherence means that data from different sources fit together and give a consistent view of the development within these enterprises. Working with coherence analysis in an efficient way is normally a work-intensive task consisting mainly of collecting data from different sources and comparing them in a structured manner. Over the last two years, Statistics Sweden has made great progress in improving the routines for coherence analysis. An IT tool that collects data for the largest enterprises from a large number of sources and presents it in a structured and logical matter has been built, and a systematic approach to analyse data for National Accounts on a quarterly basis has been developed. The paper describes the work in both these areas and gives an overview of the IT tool and the agreed routines.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010991
    Description:

    In the evaluation of prospective survey designs, statistical agencies generally must consider a large number of design factors that may have a substantial impact on both survey costs and data quality. Assessments of trade-offs between cost and quality are often complicated by limitations on the amount of information available regarding fixed and marginal costs related to: instrument redesign and field testing; the number of primary sample units and sample elements included in the sample; assignment of instrument sections and collection modes to specific sample elements; and (for longitudinal surveys) the number and periodicity of interviews. Similarly, designers often have limited information on the impact of these design factors on data quality.

    This paper extends standard design-optimization approaches to account for uncertainty in the abovementioned components of cost and quality. Special attention is directed toward the level of precision required for cost and quality information to provide useful input into the design process; sensitivity of cost-quality trade-offs to changes in assumptions regarding functional forms; and implications for preliminary work focused on collection of cost and quality information. In addition, the paper considers distinctions between cost and quality components encountered in field testing and production work, respectively; incorporation of production-level cost and quality information into adaptive design work; as well as costs and operational risks arising from the collection of detailed cost and quality data during production work. The proposed methods are motivated by, and applied to, work with partitioned redesign of the interview and diary components of the U.S. Consumer Expenditure Survey.

    Release date: 2009-12-03

  • Technical products: 11-536-X200900110809
    Description:

    Cluster sampling and multi stage designs involve sampling of units from more than one population. Auxiliary information is usually available for the population and sample at each of these levels. Calibration weights for a sample are generally produced using only the auxiliary information at that level. This approach ignores the auxiliary information available at other levels. Moreover, it is often of practical interest to link the calibration weights between samples at different levels. Integrated weighting in cluster sampling ensures that the weights for the units in a cluster are all the same and equal to the cluster weight. This presentation discusses a generalization of integrated weighting to multi stage sample designs. This is called linked weighting.

    Release date: 2009-08-11

  • Articles and reports: 12-001-X200900110880
    Description:

    This paper provides a framework for estimation by calibration in two phase sampling designs. This work grew out of the continuing development of generalized estimation software at Statistics Canada. An important objective in this development is to provide a wide range of options for effective use of auxiliary information in different sampling designs. This objective is reflected in the general methodology for two phase designs presented in this paper.

    We consider the traditional two phase sampling design. A phase one sample is drawn from the finite population and then a phase two sample is drawn as a sub sample of the first. The study variable, whose unknown population total is to be estimated, is observed only for the units in the phase two sample. Arbitrary sampling designs are allowed in each phase of sampling. Different types of auxiliary information are identified for the computation of the calibration weights at each phase. The auxiliary variables and the study variables can be continuous or categorical.

    The paper contributes to four important areas in the general context of calibration for two phase designs:(1) Three broad types of auxiliary information for two phase designs are identified and used in the estimation. The information is incorporated into the weights in two steps: a phase one calibration and a phase two calibration. We discuss the composition of the appropriate auxiliary vectors for each step, and use a linearization method to arrive at the residuals that determine the asymptotic variance of the calibration estimator.(2) We examine the effect of alternative choices of starting weights for the calibration. The two "natural" choices for the starting weights generally produce slightly different estimators. However, under certain conditions, these two estimators have the same asymptotic variance.(3) We re examine variance estimation for the two phase calibration estimator. A new procedure is proposed that can improve significantly on the usual technique of conditioning on the phase one sample. A simulation in section 10 serves to validate the advantage of this new method.(4) We compare the calibration approach with the traditional model assisted regression technique which uses a linear regression fit at two levels. We show that the model assisted estimator has properties similar to a two phase calibration estimator.

    Release date: 2009-06-22

  • Articles and reports: 12-001-X200800210760
    Description:

    The design of a stratified simple random sample without replacement from a finite population deals with two main issues: the definition of a rule to partition the population into strata, and the allocation of sampling units in the selected strata. This article examines a tree-based strategy which plans to approach jointly these issues when the survey is multipurpose and multivariate information, quantitative or qualitative, is available. Strata are formed through a hierarchical divisive algorithm that selects finer and finer partitions by minimizing, at each step, the sample allocation required to achieve the precision levels set for each surveyed variable. In this way, large numbers of constraints can be satisfied without drastically increasing the sample size, and also without discarding variables selected for stratification or diminishing the number of their class intervals. Furthermore, the algorithm tends not to define empty or almost empty strata, thus avoiding the need for strata collapsing aggregations. The procedure was applied to redesign the Italian Farm Structure Survey. The results indicate that the gain in efficiency held using our strategy is nontrivial. For a given sample size, this procedure achieves the required precision by exploiting a number of strata which is usually a very small fraction of the number of strata available when combining all possible classes from any of the covariates.

    Release date: 2008-12-23

  • Technical products: 11-522-X200600110409
    Description:

    In unequal-probability-of-selection sample, correlations between the probability of selection and the sampled data can induce bias. Weights equal to the inverse of the probability of selection are often used to counteract this bias. Highly disproportional sample designs have large weights, which can introduce unnecessary variability in statistics such as the population mean estimate. Weight trimming reduces large weights to a fixed cutpoint value and adjusts weights below this value to maintain the untrimmed weight sum. This reduces variability at the cost of introducing some bias. Standard approaches are not "data-driven": they do not use the data to make the appropriate bias-variance tradeoff, or else do so in a highly inefficient fashion. This presentation develops Bayesian variable selection methods for weight trimming to supplement standard, ad-hoc design-based methods in disproportional probability-of-inclusion designs where variances due to sample weights exceeds bias correction. These methods are used to estimate linear and generalized linear regression model population parameters in the context of stratified and poststratified known-probability sample designs. Applications will be considered in the context of traffic injury survey data, in which highly disproportional sample designs are often utilized.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110442
    Description:

    The District of Columbia Healthy Outcomes of Pregnancy Education (DC-HOPE) project is a randomized trial funded by the National Institute of Child Health and Human Development to test the effectiveness of an integrated education and counseling intervention (INT) versus usual care (UC) to reduce four risk behaviors among pregnant women. Participants were interviewed at baseline and three additional time points. Multiple imputation (MI) was used to estimate data for missing interviews. MI was done twice: once with all data imputed simultaneously, and once with data for women in the INT and UC groups imputed separately. Analyses of both imputed data sets and the pre-imputation data are compared.

    Release date: 2008-03-17

Data (0)

Data (0) (0 results)

Your search for "" found no results in this section of the site.

You may try:

Analysis (21)

Analysis (21) (21 of 21 results)

  • Articles and reports: 82-003-X201501114243
    Description:

    A surveillance tool was developed to assess dietary intake collected by surveys in relation to Eating Well with Canada’s Food Guide (CFG). The tool classifies foods in the Canadian Nutrient File (CNF) according to how closely they reflect CFG. This article describes the validation exercise conducted to ensure that CNF foods determined to be “in line with CFG” were appropriately classified.

    Release date: 2015-11-18

  • Articles and reports: 12-001-X201400214092
    Description:

    Survey methodologists have long studied the effects of interviewers on the variance of survey estimates. Statistical models including random interviewer effects are often fitted in such investigations, and research interest lies in the magnitude of the interviewer variance component. One question that might arise in a methodological investigation is whether or not different groups of interviewers (e.g., those with prior experience on a given survey vs. new hires, or CAPI interviewers vs. CATI interviewers) have significantly different variance components in these models. Significant differences may indicate a need for additional training in particular subgroups, or sub-optimal properties of different modes or interviewing styles for particular survey items (in terms of the overall mean squared error of survey estimates). Survey researchers seeking answers to these types of questions have different statistical tools available to them. This paper aims to provide an overview of alternative frequentist and Bayesian approaches to the comparison of variance components in different groups of survey interviewers, using a hierarchical generalized linear modeling framework that accommodates a variety of different types of survey variables. We first consider the benefits and limitations of each approach, contrasting the methods used for estimation and inference. We next present a simulation study, empirically evaluating the ability of each approach to efficiently estimate differences in variance components. We then apply the two approaches to an analysis of real survey data collected in the U.S. National Survey of Family Growth (NSFG). We conclude that the two approaches tend to result in very similar inferences, and we provide suggestions for practice given some of the subtle differences observed.

    Release date: 2014-12-19

  • Articles and reports: 12-001-X201400214089
    Description:

    This manuscript describes the use of multiple imputation to combine information from multiple surveys of the same underlying population. We use a newly developed method to generate synthetic populations nonparametrically using a finite population Bayesian bootstrap that automatically accounts for complex sample designs. We then analyze each synthetic population with standard complete-data software for simple random samples and obtain valid inference by combining the point and variance estimates using extensions of existing combining rules for synthetic data. We illustrate the approach by combining data from the 2006 National Health Interview Survey (NHIS) and the 2006 Medical Expenditure Panel Survey (MEPS).

    Release date: 2014-12-19

  • Articles and reports: 12-001-X201400114003
    Description:

    Outside of the survey sampling literature, samples are often assumed to be generated by simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.

    Release date: 2014-06-27

  • Articles and reports: 12-001-X201200211758
    Description:

    This paper develops two Bayesian methods for inference about finite population quantiles of continuous survey variables from unequal probability sampling. The first method estimates cumulative distribution functions of the continuous survey variable by fitting a number of probit penalized spline regression models on the inclusion probabilities. The finite population quantiles are then obtained by inverting the estimated distribution function. This method is quite computationally demanding. The second method predicts non-sampled values by assuming a smoothly-varying relationship between the continuous survey variable and the probability of inclusion, by modeling both the mean function and the variance function using splines. The two Bayesian spline-model-based estimators yield a desirable balance between robustness and efficiency. Simulation studies show that both methods yield smaller root mean squared errors than the sample-weighted estimator and the ratio and difference estimators described by Rao, Kovar, and Mantel (RKM 1990), and are more robust to model misspecification than the regression through the origin model-based estimator described in Chambers and Dunstan (1986). When the sample size is small, the 95% credible intervals of the two new methods have closer to nominal confidence coverage than the sample-weighted estimator.

    Release date: 2012-12-19

  • Articles and reports: 12-001-X201100211606
    Description:

    This paper introduces a U.S. Census Bureau special compilation by presenting four other papers of the current issue: three papers from authors Tillé, Lohr and Thompson as well as a discussion paper from Opsomer.

    Release date: 2011-12-21

  • Articles and reports: 12-001-X201000111250
    Description:

    We propose a Bayesian Penalized Spline Predictive (BPSP) estimator for a finite population proportion in an unequal probability sampling setting. This new method allows the probabilities of inclusion to be directly incorporated into the estimation of a population proportion, using a probit regression of the binary outcome on the penalized spline of the inclusion probabilities. The posterior predictive distribution of the population proportion is obtained using Gibbs sampling. The advantages of the BPSP estimator over the Hájek (HK), Generalized Regression (GR), and parametric model-based prediction estimators are demonstrated by simulation studies and a real example in tax auditing. Simulation studies show that the BPSP estimator is more efficient, and its 95% credible interval provides better confidence coverage with shorter average width than the HK and GR estimators, especially when the population proportion is close to zero or one or when the sample is small. Compared to linear model-based predictive estimators, the BPSP estimators are robust to model misspecification and influential observations in the sample.

    Release date: 2010-06-29

  • Articles and reports: 12-001-X200900211045
    Description:

    In analysis of sample survey data, degrees-of-freedom quantities are often used to assess the stability of design-based variance estimators. For example, these degrees-of-freedom values are used in construction of confidence intervals based on t distribution approximations; and of related t tests. In addition, a small degrees-of-freedom term provides a qualitative indication of the possible limitations of a given variance estimator in a specific application. Degrees-of-freedom calculations sometimes are based on forms of the Satterthwaite approximation. These Satterthwaite-based calculations depend primarily on the relative magnitudes of stratum-level variances. However, for designs involving a small number of primary units selected per stratum, standard stratum-level variance estimators provide limited information on the true stratum variances. For such cases, customary Satterthwaite-based calculations can be problematic, especially in analyses for subpopulations that are concentrated in a relatively small number of strata. To address this problem, this paper uses estimated within-primary-sample-unit (within PSU) variances to provide auxiliary information regarding the relative magnitudes of the overall stratum-level variances. Analytic results indicate that the resulting degrees-of-freedom estimator will be better than modified Satterthwaite-type estimators provided: (a) the overall stratum-level variances are approximately proportional to the corresponding within-stratum variances; and (b) the variances of the within-PSU variance estimators are relatively small. In addition, this paper develops errors-in-variables methods that can be used to check conditions (a) and (b) empirically. For these model checks, we develop simulation-based reference distributions, which differ substantially from reference distributions based on customary large-sample normal approximations. The proposed methods are applied to four variables from the U.S. Third National Health and Nutrition Examination Survey (NHANES III).

    Release date: 2009-12-23

  • Articles and reports: 12-001-X200900110880
    Description:

    This paper provides a framework for estimation by calibration in two phase sampling designs. This work grew out of the continuing development of generalized estimation software at Statistics Canada. An important objective in this development is to provide a wide range of options for effective use of auxiliary information in different sampling designs. This objective is reflected in the general methodology for two phase designs presented in this paper.

    We consider the traditional two phase sampling design. A phase one sample is drawn from the finite population and then a phase two sample is drawn as a sub sample of the first. The study variable, whose unknown population total is to be estimated, is observed only for the units in the phase two sample. Arbitrary sampling designs are allowed in each phase of sampling. Different types of auxiliary information are identified for the computation of the calibration weights at each phase. The auxiliary variables and the study variables can be continuous or categorical.

    The paper contributes to four important areas in the general context of calibration for two phase designs:(1) Three broad types of auxiliary information for two phase designs are identified and used in the estimation. The information is incorporated into the weights in two steps: a phase one calibration and a phase two calibration. We discuss the composition of the appropriate auxiliary vectors for each step, and use a linearization method to arrive at the residuals that determine the asymptotic variance of the calibration estimator.(2) We examine the effect of alternative choices of starting weights for the calibration. The two "natural" choices for the starting weights generally produce slightly different estimators. However, under certain conditions, these two estimators have the same asymptotic variance.(3) We re examine variance estimation for the two phase calibration estimator. A new procedure is proposed that can improve significantly on the usual technique of conditioning on the phase one sample. A simulation in section 10 serves to validate the advantage of this new method.(4) We compare the calibration approach with the traditional model assisted regression technique which uses a linear regression fit at two levels. We show that the model assisted estimator has properties similar to a two phase calibration estimator.

    Release date: 2009-06-22

  • Articles and reports: 12-001-X200800210760
    Description:

    The design of a stratified simple random sample without replacement from a finite population deals with two main issues: the definition of a rule to partition the population into strata, and the allocation of sampling units in the selected strata. This article examines a tree-based strategy which plans to approach jointly these issues when the survey is multipurpose and multivariate information, quantitative or qualitative, is available. Strata are formed through a hierarchical divisive algorithm that selects finer and finer partitions by minimizing, at each step, the sample allocation required to achieve the precision levels set for each surveyed variable. In this way, large numbers of constraints can be satisfied without drastically increasing the sample size, and also without discarding variables selected for stratification or diminishing the number of their class intervals. Furthermore, the algorithm tends not to define empty or almost empty strata, thus avoiding the need for strata collapsing aggregations. The procedure was applied to redesign the Italian Farm Structure Survey. The results indicate that the gain in efficiency held using our strategy is nontrivial. For a given sample size, this procedure achieves the required precision by exploiting a number of strata which is usually a very small fraction of the number of strata available when combining all possible classes from any of the covariates.

    Release date: 2008-12-23

  • Articles and reports: 12-001-X200700210498
    Description:

    In this paper we describe a methodology for combining a convenience sample with a probability sample in order to produce an estimator with a smaller mean squared error (MSE) than estimators based on only the probability sample. We then explore the properties of the resulting composite estimator, a linear combination of the convenience and probability sample estimators with weights that are a function of bias. We discuss the estimator's properties in the context of web-based convenience sampling. Our analysis demonstrates that the use of a convenience sample to supplement a probability sample for improvements in the MSE of estimation may be practical only under limited circumstances. First, the remaining bias of the estimator based on the convenience sample must be quite small, equivalent to no more than 0.1 of the outcome's population standard deviation. For a dichotomous outcome, this implies a bias of no more than five percentage points at 50 percent prevalence and no more than three percentage points at 10 percent prevalence. Second, the probability sample should contain at least 1,000-10,000 observations for adequate estimation of the bias of the convenience sample estimator. Third, it must be inexpensive and feasible to collect at least thousands (and probably tens of thousands) of web-based convenience observations. The conclusions about the limited usefulness of convenience samples with estimator bias of more than 0.1 standard deviations also apply to direct use of estimators based on that sample.

    Release date: 2008-01-03

  • Articles and reports: 82-003-S200700010362
    Description:

    This article summarizes the design, methods and results emerging from the Canadian Health Measures Survey pre-test, which took place from October through December 2004 in Calgary, Alberta.

    Release date: 2007-12-05

  • Articles and reports: 12-001-X20070019849
    Description:

    In sample surveys where units have unequal probabilities of inclusion in the sample, associations between the probability of inclusion and the statistic of interest can induce bias. Weights equal to the inverse of the probability of inclusion are often used to counteract this bias. Highly disproportional sample designs have large weights, which can introduce undesirable variability in statistics such as the population mean estimator or population regression estimator. Weight trimming reduces large weights to a fixed cutpoint value and adjusts weights below this value to maintain the untrimmed weight sum, reducing variability at the cost of introducing some bias. Most standard approaches are ad-hoc in that they do not use the data to optimize bias-variance tradeoffs. Approaches described in the literature that are data-driven are a little more efficient than fully-weighted estimators. This paper develops Bayesian methods for weight trimming of linear and generalized linear regression estimators in unequal probability-of-inclusion designs. An application to estimate injury risk of children rear-seated in compact extended-cab pickup trucks using the Partners for Child Passenger Safety surveillance survey is considered.

    Release date: 2007-06-28

  • Articles and reports: 12-001-X20030026780
    Description:

    Coverage errors and other coverage issues related to the population censuses are examined in the light of the recent literature. Especially, when the actual population census count of persons are matched with their corresponding post enumeration survey counts, the aggregated results in a dual record system setting can provide some coverage error statistics.

    In this paper, the coverage error issues are evaluated and alternative solutions are discussed in the light of the results from the latest Population Census of Turkey. By using the Census and post enumeration survey data, regional comparison of census coverage was also made and has shown greater variability among regions. Some methodological remarks are also made on the possible improvements on the current enumeration procedures.

    Release date: 2004-01-27

  • Articles and reports: 88-003-X20020026371
    Description:

    When constructing questions for questionnaires, one of the rules of thumb has always been "keep it short and simple." This article is the third in a series of lessons learned during cognitive testing of the pilot Knowledge Management Practices Survey. It studies the responses given to long questions, thick questionnaires and too many response boxes.

    Release date: 2002-06-14

  • Articles and reports: 88-003-X20020026369
    Description:

    Eliminating the "neutral" response in an opinion question not only encourages the respondent to choose a side, it gently persuades respondents to read the question. Learn how we used this technique to our advantage in the Knowledge Management Practices Survey, 2001.

    Release date: 2002-06-14

  • Articles and reports: 12-001-X19990024884
    Description:

    In the final paper of this special issue, Estevao and Särndal consider two types of design-based estimators used for domain estimation. The first, a linear prediction estimator, is built on the principle of model fitting, requires known auxiliary information at the domain level, and results in weights that depend on the domain to be estimated. The second, a uni-weight estimator, has weights which are independent of the domain being estimated and has the clear advantage that it does not require the calculation of different weight systems for each different domain of interest. These estimators are compared and situations under which one is preferred over the other are identified.

    Release date: 2000-03-01

  • Articles and reports: 12-001-X19970013103
    Description:

    This paper discusses the use of some simple diagnostics to guide the formation of nonresponse adjustment cells. Following Little (1986), we consider construction of adjustment cells by grouping sample units according to their estimated response probabilities or estimated survey items. Four issues receive principal attention: assessment of the sensitivity of adjusted mean estimates to changes in k, the number of cells used; identification of specific cells that require additional refinement; comparison of adjusted and unadjusted mean estimates; and comparison of estimation results from estimated-probability and estimated-item based cells. The proposed methods are motivated and illustrated with an application involving estimation of mean consumer unit income from the U.S. Consumer Expenditure Survey.

    Release date: 1997-08-18

  • Articles and reports: 12-001-X19960022982
    Description:

    In work with sample surveys, we often use estimators of the variance components associated with sampling within and between primary sample units. For these applications, it can be important to have some indication of whether the variance component estimators are stable, i.e., have relatively low variance. This paper discusses several data-based measures of the stability of design-based variance component estimators and related quantities. The development emphasizes methods that can be applied to surveys with moderate or large numbers of strata and small numbers of primary sample units per stratum. We direct principal attention toward the design variance of a within-PSU variance estimator, and two related degrees-of-freedom terms. A simulation-based method allows one to assess whether an observed stability measure is consistent with standard assumptions regarding variance estimator stability. We also develop two sets of stability measures for design-based estimators of between-PSU variance components and the ratio of the overall variance to the within-PSU variance. The proposed methods are applied to interview and examination data from the U.S. Third National Health and Nutrition Examination Survey (NHANES III). These results indicate that the true stability properties may vary substantially across variables. In addition, for some variables, within-PSU variance estimators appear to be considerably less stable than one would anticipate from a simple count of secondary units within each stratum.

    Release date: 1997-01-30

  • Articles and reports: 12-001-X199500214395
    Description:

    When redesigning a sample with a stratified multi-stage design, it is sometimes considered desirable to maximize the number of primary sampling units retained in the new sample without altering unconditional selection probabilities. For this problem, an optimal solution which uses transportation theory exists for a very general class of designs. However, this procedure has never been used in the redesign of any survey (that the authors are aware of), in part because even for moderately-sized strata, the resulting transportation problem may be too large to solve in practice. In this paper, a modified reduced-size transportation algorithm is presented for maximizing the overlap, which substantially reduces the size of the problem. This reduced-size overlap procedure was used in the recent redesign of the Survey of Income and Program Participation (SIPP). The performance of the reduced-size algorithm is summarized, both for the actual production SIPP overlap and for earlier, artificial simulations of the SIPP overlap. Although the procedure is not optimal and theoretically can produce only negligible improvements in expected overlap compared to independent selection, in practice it gave substantial improvements in overlap over independent selection for SIPP, and generally provided an overlap that is close to optimal.

    Release date: 1995-12-15

  • Articles and reports: 12-001-X198800214583
    Description:

    This note portrays SQL, highlighting its strengths and weaknesses.

    Release date: 1988-12-15

Reference (31)

Reference (31) (25 of 31 results)

  • Technical products: 11-522-X201700014735
    Description:

    Microdata dissemination normally requires data reduction and modification methods be applied, and the degree to which these methods are applied depend on the control methods that will be required to access and use the data. An approach that is in some circumstances more suitable for accessing data for statistical purposes is secure computation, which involves computing analytic functions on encrypted data without the need to decrypt the underlying source data to run a statistical analysis. This approach also allows multiple sites to contribute data while providing strong privacy guarantees. This way the data can be pooled and contributors can compute analytic functions without either party knowing their inputs. We explain how secure computation can be applied in practical contexts, with some theoretical results and real healthcare examples.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014723
    Description:

    The U.S. Census Bureau is researching uses of administrative records in survey and decennial operations in order to reduce costs and respondent burden while preserving data quality. One potential use of administrative records is to utilize the data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across different administrative records sources. We explore different sets of business rules used to assign one race and one Hispanic response when these responses are discrepant across sources. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data across several demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records than in the 2010 Census. Hispanics are less likely to have missing Hispanic origin data but more likely to have missing race data in administrative records. Non-Hispanic Asians and non-Hispanic Pacific Islanders are more likely to have missing race and Hispanic origin data in administrative records. Younger individuals, renters, individuals living in households with two or more people, individuals who responded to the census in the nonresponse follow-up operation, and individuals residing in urban areas are more likely to have non-matching race and Hispanic origin responses.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014747
    Description:

    The Longitudinal Immigration Database (IMDB) combines the Immigrant Landing File (ILF) with annual tax files. This record linkage is performed using a tax filer database. The ILF includes all immigrants who have landed in Canada since 1980. In looking to enhance the IMDB, the possibility of adding temporary residents (TR) and immigrants who landed between 1952 and 1979 (PRE80) was studied. Adding this information would give a more complete picture of the immigrant population living in Canada. To integrate the TR and PRE80 files into the IMDB, record linkages between these two files and the tax filer database, were performed. This exercise was challenging in part due to the presence of duplicates in the files and conflicting links between the different record linkages.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014739
    Description:

    Vital statistics datasets such as the Canadian Mortality Database lack identifiers for certain populations of interest such as First Nations, Métis and Inuit. Record linkage between vital statistics and survey or other administrative datasets can circumvent this limitation. This paper describes a linkage between the Canadian Mortality Database and the 2006 Census of the Population and the planned analysis using the linked data.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014732
    Description:

    The Institute for Employment Research (IAB) is the research unit of the German Federal Employment Agency. Via the Research Data Centre (FDZ) at the IAB, administrative and survey data on individuals and establishments are provided to researchers. In cooperation with the Institute for the Study of Labor (IZA), the FDZ has implemented the Job Submission Application (JoSuA) environment which enables researchers to submit jobs for remote data execution through a custom-built web interface. Moreover, two types of user-generated output files may be distinguished within the JoSuA environment which allows for faster and more efficient disclosure review services.

    Release date: 2016-03-24

  • Technical products: 11-522-X201300014282
    Description:

    The IAB-Establishment Panel is the most comprehensive establishment survey in Germany with almost 16.000 firms participating every year. Face-to-face interviews with paper and pencil (PAPI) are conducted since 1993. An ongoing project examines possible effects of a change of the survey to computer aided personal interviews (CAPI) combined with a web based version of the questionnaire (CAWI). In a first step, questions about the internet access, the willingness to complete the questionnaire online and reasons for refusal were included in the 2012 wave. First results are indicating a widespread refusal to take part in a web survey. A closer look reveals that smaller establishments, long time participants and older respondents are reluctant to use the internet.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014262
    Description:

    Measurement error is one source of bias in statistical analysis. However, its possible implications are mostly ignored One class of models that can be especially affected by measurement error are fixed-effects models. By validating the survey response of five panel survey waves for welfare receipt with register data, the size and form of longitudinal measurement error can be determined. It is shown, that the measurement error for welfare receipt is serially correlated and non-differential. However, when estimating the coefficients of longitudinal fixed effect models of welfare receipt on subjective health for men and women, the coefficients are biased only for the male subpopulation.

    Release date: 2014-10-31

  • Technical products: 11-522-X200800010987
    Description:

    Over the last few years, there have been large progress in the web data collection area. Today, many statistical offices offer a web alternative in many different types of surveys. It is widely believed that web data collection may raise data quality while lowering data collection costs. Experience has shown that, offered web as a second alternative to paper questionnaires; enterprises have been slow to embrace the web alternative. On the other hand, experiments have also shown that by promoting web over paper, it is possible to raise the web take up rates. However, there are still few studies on what happens when the contact strategy is changed radically and the web option is the only option given in a complex enterprise survey. In 2008, Statistics Sweden took the step of using more or less a web-only strategy in the survey of industrial production (PRODCOM). The web questionnaire was developed in the generalised tool for web surveys used by Statistics Sweden. The paper presents the web solution and some experiences from the 2008 PRODCOM survey, including process data on response rates and error ratios as well as the results of a cognitive follow-up of the survey. Some important lessons learned are also presented.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010970
    Description:

    RTI International is currently conducting a longitudinal education study. One component of the study involved collecting transcripts and course catalogs from high schools that the sample members attended. Information from the transcripts and course catalogs also needed to be keyed and coded. This presented a challenge because the transcripts and course catalogs were collected from different types of schools, including public, private, and religious schools, from across the nation and they varied widely in both content and format. The challenge called for a sophisticated system that could be used by multiple users simultaneously. RTI developed such a system possessing all the characteristics of a high-end, high-tech, multi-user, multitask, user-friendly and low maintenance cost high school transcript and course catalog keying and coding system. The system is web based and has three major functions: transcript and catalog keying and coding, transcript and catalog keying quality control (keyer-coder end), and transcript and catalog coding QC (management end). Given the complex nature of transcript and catalog keying and coding, the system was designed to be flexible and to have the ability to transport keyed and coded data throughout the system to reduce the keying time, the ability to logically guide users through all the pages that a type of activity required, the ability to display appropriate information to help keying performance, and the ability to track all the keying, coding, and QC activities. Hundreds of catalogs and thousands of transcripts were successfully keyed, coded, and verified using the system. This paper will report on the system needs and design, implementation tips, problems faced and their solutions, and lessons learned.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010956
    Description:

    The use of Computer Audio-Recorded Interviewing (CARI) as a tool to identify interview falsification is quickly growing in survey research (Biemer, 2000, 2003; Thissen, 2007). Similarly, survey researchers are starting to expand the usefulness of CARI by combining recordings with coding to address data quality (Herget, 2001; Hansen, 2005; McGee, 2007). This paper presents results from a study included as part of the establishment-based National Center for Health Statistics' National Home and Hospice Care Survey (NHHCS) which used CARI behavior coding and CARI-specific paradata to: 1) identify and correct problematic interviewer behavior or question issues early in the data collection period before either negatively impact data quality, and; 2) identify ways to diminish measurement error in future implementations of the NHHCS. During the first 9 weeks of the 30-week field period, CARI recorded a subset of questions from the NHHCS application for all interviewers. Recordings were linked with the interview application and output and then coded in one of two modes: Code by Interviewer or Code by Question. The Code by Interviewer method provided visibility into problems specific to an interviewer as well as more generalized problems potentially applicable to all interviewers. The Code by Question method yielded data that spoke to understandability of the questions and other response problems. In this mode, coders coded multiple implementations of the same question across multiple interviewers. Using the Code by Question approach, researchers identified issues with three key survey questions in the first few weeks of data collection and provided guidance to interviewers in how to handle those questions as data collection continued. Results from coding the audio recordings (which were linked with the survey application and output) will inform question wording and interviewer training in the next implementation of the NHHCS, and guide future enhancement of CARI and the coding system.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011014
    Description:

    In many countries, improved quality of economic statistics is one of the most important goals of the 21st century. First and foremost, the quality of National Accounts is in focus, regarding both annual and quarterly accounts. To achieve this goal, data quality regarding the largest enterprises is of vital importance. To assure that the quality of data for the largest enterprises is good, coherence analysis is an important tool. Coherence means that data from different sources fit together and give a consistent view of the development within these enterprises. Working with coherence analysis in an efficient way is normally a work-intensive task consisting mainly of collecting data from different sources and comparing them in a structured manner. Over the last two years, Statistics Sweden has made great progress in improving the routines for coherence analysis. An IT tool that collects data for the largest enterprises from a large number of sources and presents it in a structured and logical matter has been built, and a systematic approach to analyse data for National Accounts on a quarterly basis has been developed. The paper describes the work in both these areas and gives an overview of the IT tool and the agreed routines.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010991
    Description:

    In the evaluation of prospective survey designs, statistical agencies generally must consider a large number of design factors that may have a substantial impact on both survey costs and data quality. Assessments of trade-offs between cost and quality are often complicated by limitations on the amount of information available regarding fixed and marginal costs related to: instrument redesign and field testing; the number of primary sample units and sample elements included in the sample; assignment of instrument sections and collection modes to specific sample elements; and (for longitudinal surveys) the number and periodicity of interviews. Similarly, designers often have limited information on the impact of these design factors on data quality.

    This paper extends standard design-optimization approaches to account for uncertainty in the abovementioned components of cost and quality. Special attention is directed toward the level of precision required for cost and quality information to provide useful input into the design process; sensitivity of cost-quality trade-offs to changes in assumptions regarding functional forms; and implications for preliminary work focused on collection of cost and quality information. In addition, the paper considers distinctions between cost and quality components encountered in field testing and production work, respectively; incorporation of production-level cost and quality information into adaptive design work; as well as costs and operational risks arising from the collection of detailed cost and quality data during production work. The proposed methods are motivated by, and applied to, work with partitioned redesign of the interview and diary components of the U.S. Consumer Expenditure Survey.

    Release date: 2009-12-03

  • Technical products: 11-536-X200900110809
    Description:

    Cluster sampling and multi stage designs involve sampling of units from more than one population. Auxiliary information is usually available for the population and sample at each of these levels. Calibration weights for a sample are generally produced using only the auxiliary information at that level. This approach ignores the auxiliary information available at other levels. Moreover, it is often of practical interest to link the calibration weights between samples at different levels. Integrated weighting in cluster sampling ensures that the weights for the units in a cluster are all the same and equal to the cluster weight. This presentation discusses a generalization of integrated weighting to multi stage sample designs. This is called linked weighting.

    Release date: 2009-08-11

  • Technical products: 11-522-X200600110409
    Description:

    In unequal-probability-of-selection sample, correlations between the probability of selection and the sampled data can induce bias. Weights equal to the inverse of the probability of selection are often used to counteract this bias. Highly disproportional sample designs have large weights, which can introduce unnecessary variability in statistics such as the population mean estimate. Weight trimming reduces large weights to a fixed cutpoint value and adjusts weights below this value to maintain the untrimmed weight sum. This reduces variability at the cost of introducing some bias. Standard approaches are not "data-driven": they do not use the data to make the appropriate bias-variance tradeoff, or else do so in a highly inefficient fashion. This presentation develops Bayesian variable selection methods for weight trimming to supplement standard, ad-hoc design-based methods in disproportional probability-of-inclusion designs where variances due to sample weights exceeds bias correction. These methods are used to estimate linear and generalized linear regression model population parameters in the context of stratified and poststratified known-probability sample designs. Applications will be considered in the context of traffic injury survey data, in which highly disproportional sample designs are often utilized.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110442
    Description:

    The District of Columbia Healthy Outcomes of Pregnancy Education (DC-HOPE) project is a randomized trial funded by the National Institute of Child Health and Human Development to test the effectiveness of an integrated education and counseling intervention (INT) versus usual care (UC) to reduce four risk behaviors among pregnant women. Participants were interviewed at baseline and three additional time points. Multiple imputation (MI) was used to estimate data for missing interviews. MI was done twice: once with all data imputed simultaneously, and once with data for women in the INT and UC groups imputed separately. Analyses of both imputed data sets and the pre-imputation data are compared.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110410
    Description:

    The U.S. Survey of Occupational Illnesses and Injuries (SOII) is a large-scale establishment survey conducted by the Bureau of Labor Statistics to measure incidence rates and impact of occupational illnesses and injuries within specified industries at the national and state levels. This survey currently uses relatively simple procedures for detection and treatment of outliers. The outlier-detection methods center on comparison of reported establishment-level incidence rates to the corresponding distribution of reports within specified cells defined by the intersection of state and industry classifications. The treatment methods involve replacement of standard probability weights with a weight set equal to one, followed by a benchmark adjustment.

    One could use more complex methods for detection and treatment of outliers for the SOII, e.g., detection methods that use influence functions, probability weights and multivariate observations; or treatment methods based on Winsorization or M-estimation. Evaluation of the practical benefits of these more complex methods requires one to consider three important factors. First, severe outliers are relatively rare, but when they occur, they may have a severe impact on SOII estimators in cells defined by the intersection of states and industries. Consequently, practical evaluation of the impact of outlier methods focuses primarily on the tails of the distributions of estimators, rather than standard aggregate performance measures like variance or mean squared error. Second, the analytic and data-based evaluations focus on the incremental improvement obtained through use of the more complex methods, relative to the performance of the simple methods currently in place. Third, development of the abovementioned tools requires somewhat nonstandard asymptotics the reflect trade-offs in effects associated with, respectively, increasing sample sizes; increasing numbers of publication cells; and changing tails of underlying distributions of observations.

    Release date: 2008-03-17

  • Technical products: 11-522-X20050019445
    Description:

    This paper describes an innovative use of data mining on response data and metadata to identify, characterize and prevent falsification by field interviewers on the National Survey on Drug Use and Health (NSDUH). Interviewer falsification is the deliberate creation of survey responses by the interviewer without input from the respondent.

    Release date: 2007-03-02

  • Technical products: 11-522-X20050019455
    Description:

    The Data Documentation Initiative (DDI) is an internationally developed standard used to develop metadata. The Data Liberation Initiative (DLI) along with partner universities, including the University of Guelph are working towards the goal of creating metadata for all Statistics Canada surveys available to the DLI community.

    Release date: 2007-03-02

  • Technical products: 11-522-X20040018749
    Description:

    In its attempt to measure the mental health of Cambodian refugees in the U.S., Rand Corporation introduces novel methodology for efficiently listing, screening, and identifying households to ultimately yield a random sample of eligible participants.

    Release date: 2005-10-27

  • Technical products: 11-522-X20040018747
    Description:

    This document describes the development and pilot of the first American Indian and Alaska Native Adult Tobacco Survey. Meetings with expert panels and tribal representatives helped to adapt methods.

    Release date: 2005-10-27

  • Technical products: 11-522-X20040018755
    Description:

    This paper reviews the robustness of methods dealing with response errors for rare populations. It also reviews problems with weighting scheme for these populations. It develops an asymptotic framework intended to deal with such problems.

    Release date: 2005-10-27

  • Technical products: 11-522-X20030017700
    Description:

    This paper suggests a useful framework for exploring the effects of moderate deviations from idealized conditions. It offers evaluation criteria for point estimators and interval estimators.

    Release date: 2005-01-26

  • Technical products: 11-522-X20020016714
    Description:

    In this highly technical paper, we illustrate the application of the delete-a-group jack-knife variance estimator approach to a particular complex multi-wave longitudinal study, demonstrating its utility for linear regression and other analytic models. The delete-a-group jack-knife variance estimator is proving a very useful tool for measuring variances under complex sampling designs. This technique divides the first-phase sample into mutually exclusive and nearly equal variance groups, deletes one group at a time to create a set of replicates and makes analogous weighting adjustments in each replicate to those done for the sample as a whole. Variance estimation proceeds in the standard (unstratified) jack-knife fashion.

    Our application is to the Chicago Health and Aging Project (CHAP), a community-based longitudinal study examining risk factors for chronic health problems of older adults. A major aim of the study is the investigation of risk factors for incident Alzheimer's disease. The current design of CHAP has two components: (1) Every three years, all surviving members of the cohort are interviewed on a variety of health-related topics. These interviews include cognitive and physical function measures. (2) At each of these waves of data collection, a stratified Poisson sample is drawn from among the respondents to the full population interview for detailed clinical evaluation and neuropsychological testing. To investigate risk factors for incident disease, a 'disease-free' cohort is identified at the preceding time point and forms one major stratum in the sampling frame.

    We provide proofs of the theoretical applicability of the delete-a-group jack-knife for particular estimators under this Poisson design, paying needed attention to the distinction between finite-population and infinite-population (model) inference. In addition, we examine the issue of determining the 'right number' of variance groups.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016718
    Description:

    Cancer surveillance research requires accurate estimates of risk factors at the small area level. These risk factors are often obtained from surveys such as the National Health Interview Survey (NHIS) or the Behavioral Risk Factors Surveillance Survey (BRFSS). Unfortunately, no one population-based survey provides ideal prevalence estimates of such risk factors. One strategy is to combine information from multiple surveys, using the complementary strengths of one survey to compensate for the weakness of the other. The NHIS is a nationally representative, face-to-face survey with a high response rate; however, it cannot produce state or substate estimates of risk factor prevalence because sample sizes are too small. The BRFSS is a state-level telephone survey that excludes non-telephone households and has a lower response rate, but does provide reasonable sample sizes in all states and many counties. Several methods are available for constructing small-area estimators that combine information from both the NHIS and the BRFSS, including direct estimators, estimators under hierarchical Bayes models and model-assisted estimators. In this paper, we focus on the latter, constructing generalized regression (GREG) and 'minimum-distance' estimators and using existing and newly developed small-area smoothing techniques to smooth the resulting estimators.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016750
    Description:

    Analyses of data from social and economic surveys sometimes use generalized variance function models to approximate the design variance of point estimators of population means and proportions. Analysts may use the resulting standard error estimates to compute associated confidence intervals or test statistics for the means and proportions of interest. In comparison with design-based variance estimators computed directly from survey microdata, generalized variance function models have several potential advantages, as will be discussed in this paper, including operational simplicity; increased stability of standard errors; and, for cases involving public-use datasets, reduction of disclosure limitation problems arising from the public release of stratum and cluster indicators.

    These potential advantages, however, may be offset in part by several inferential issues. First, the properties of inferential statistics based on generalized variance functions (e.g., confidence interval coverage rates and widths) depend heavily on the relative empirical magnitudes of the components of variability associated, respectively, with:

    (a) the random selection of a subset of items used in estimation of the generalized variance function model(b) the selection of sample units under a complex sample design (c) the lack of fit of the generalized variance function model (d) the generation of a finite population under a superpopulation model.

    Second, under conditions, one may link each of components (a) through (d) with different empirical measures of the predictive adequacy of a generalized variance function model. Consequently, these measures of predictive adequacy can offer us some insight into the extent to which a given generalized variance function model may be appropriate for inferential use in specific applications.

    Some of the proposed diagnostics are applied to data from the US Survey of Doctoral Recipients and the US Current Employment Survey. For the Survey of Doctoral Recipients, components (a), (c) and (d) are of principal concern. For the Current Employment Survey, components (b), (c) and (d) receive principal attention, and the availability of population microdata allow the development of especially detailed models for components (b) and (c).

    Release date: 2004-09-13

Date modified: