Articles and reports: 12-001-X201700114822
Description:

We use a Bayesian method to infer about a finite population proportion when binary data are collected using a two-fold sample design from small areas. The two-fold sample design has a two-stage cluster sample design within each area. A former hierarchical Bayesian model assumes that for each area the first stage binary responses are independent Bernoulli distributions, and the probabilities have beta distributions which are parameterized by a mean and a correlation coefficient. The means vary with areas but the correlation is the same over areas. However, to gain some flexibility we have now extended this model to accommodate different correlations. The means and the correlations have independent beta distributions. We call the former model a homogeneous model and the new model a heterogeneous model. All hyperparameters have proper noninformative priors. An additional complexity is that some of the parameters are weakly identified making it difficult to use a standard Gibbs sampler for computation. So we have used unimodal constraints for the beta prior distributions and a blocked Gibbs sampler to perform the computation. We have compared the heterogeneous and homogeneous models using an illustrative example and simulation study. As expected, the two-fold model with heterogeneous correlations is preferred.

Release date: 2017-06-22

Articles and reports: 82-003-X201700614829
Description:

POHEM-BMI is a microsimulation tool that includes a model of adult body mass index (BMI) and a model of childhood BMI history. This overview describes the development of BMI prediction models for adults and of childhood BMI history, and compares projected BMI estimates with those from nationally representative survey data to establish validity.

Release date: 2017-06-21

Articles and reports: 12-001-X201600114543
Description:

The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

Release date: 2016-06-22

Technical products: 11-522-X201700014711
Description:

After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.

Release date: 2016-03-24

Technical products: 11-522-X201700014717
Description:

Release date: 2016-03-24

Articles and reports: 12-001-X201500214249
Description:

The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods have been proposed. Basically, these methods are divided into two classes: The first class comprises methods that seek an allocation which minimizes survey costs while keeping the coefficients of variation of estimators of totals below specified thresholds for all survey variables of interest. The second aims to minimize a weighted average of the relative variances of the estimators of totals given a maximum overall sample size or a maximum cost. This paper proposes a new optimization approach for the sample allocation problem in multivariate surveys. This approach is based on a binary integer programming formulation. Several numerical experiments showed that the proposed approach provides efficient solutions to this problem, which improve upon a ‘textbook algorithm’ and can be more efficient than the algorithm by Bethel (1985, 1989).

Release date: 2015-12-17

Articles and reports: 12-001-X201500114200
Description:

We consider the observed best prediction (OBP; Jiang, Nguyen and Rao 2011) for small area estimation under the nested-error regression model, where both the mean and variance functions may be misspecified. We show via a simulation study that the OBP may significantly outperform the empirical best linear unbiased prediction (EBLUP) method not just in the overall mean squared prediction error (MSPE) but also in the area-specific MSPE for every one of the small areas. A bootstrap method is proposed for estimating the design-based area-specific MSPE, which is simple and always produces positive MSPE estimates. The performance of the proposed MSPE estimator is evaluated through a simulation study. An application to the Television School and Family Smoking Prevention and Cessation study is considered.

Release date: 2015-06-29

Articles and reports: 82-003-X201500314143
Description:

This study evaluates the representativeness of the pooled 2007/2009-2009/2011 Canadian Health Measures Survey immigrant sample by comparing it with socio-demographic distributions from the 2006 Census and the 2011 National Household Survey, and with selected self-reported health and health behaviour indicators from the 2009/2010 Canadian Community Health Survey.

Release date: 2015-03-18

Articles and reports: 12-001-X201400214113
Description:

Rotating panel surveys are used to calculate estimates of gross flows between two consecutive periods of measurement. This paper considers a general procedure for the estimation of gross flows when the rotating panel survey has been generated from a complex survey design with random nonresponse. A pseudo maximum likelihood approach is considered through a two-stage model of Markov chains for the allocation of individuals among the categories in the survey and for modeling for nonresponse.

Release date: 2014-12-19

Technical products: 11-522-X201300014255
Description:

The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

Release date: 2014-10-31

Technical products: 11-522-X201300014284
Description:

The decline in response rates observed by several national statistical institutes, their desire to limit response burden and the significant budget pressures they face support greater use of administrative data to produce statistical information. The administrative data sources they must consider have to be evaluated according to several aspects to determine their fitness for use. Statistics Canada recently developed a process to evaluate administrative data sources for use as inputs to the statistical information production process. This evaluation is conducted in two phases. The initial phase requires access only to the metadata associated with the administrative data considered, whereas the second phase uses a version of data that can be evaluated. This article outlines the evaluation process and tool.

Release date: 2014-10-31

Technical products: 11-522-X201300014264
Description:

While wetlands represent only 6.4% of the world’s surface area, they are essential to the survival of terrestrial species. These ecosystems require special attention in Canada, since that is where nearly 25% of the world’s wetlands are found. Environment Canada (EC) has massive databases that contain all kinds of wetland information from various sources. Before the information in these databases could be used for any environmental initiative, it had to be classified and its quality had to be assessed. In this paper, we will give an overview of the joint pilot project carried out by EC and Statistics Canada to assess the quality of the information contained in these databases, which has characteristics specific to big data, administrative data and survey data.

Release date: 2014-10-31

Articles and reports: 82-003-X201401014098
Description:

This study compares registry and non-registry approaches to linking 2006 Census of Population data for Manitoba and Ontario to Hospital data from the Discharge Abstract Database.

Release date: 2014-10-15

Articles and reports: 82-003-X201301011873
Description:

A computer simulation model of physical activity was developed for the Canadian adult population using longitudinal data from the National Population Health Survey and cross-sectional data from the Canadian Community Health Survey. The model is based on the Population Health Model (POHEM) platform developed by Statistics Canada. This article presents an overview of POHEM and describes the additions that were made to create the physical activity module (POHEM-PA). These additions include changes in physical activity over time, and the relationship between physical activity levels and health-adjusted life expectancy, life expectancy and the onset of selected chronic conditions. Estimates from simulation projections are compared with nationally representative survey data to provide an indication of the validity of POHEM-PA.

Release date: 2013-10-16

Articles and reports: 12-001-X201200111688
Description:

We study the problem of nonignorable nonresponse in a two dimensional contingency table which can be constructed for each of several small areas when there is both item and unit nonresponse. In general, the provision for both types of nonresponse with small areas introduces significant additional complexity in the estimation of model parameters. For this paper, we conceptualize the full data array for each area to consist of a table for complete data and three supplemental tables for missing row data, missing column data, and missing row and column data. For nonignorable nonresponse, the total cell probabilities are allowed to vary by area, cell and these three types of "missingness". The underlying cell probabilities (i.e., those which would apply if full classification were always possible) for each area are generated from a common distribution and their similarity across the areas is parametrically quantified. Our approach is an extension of the selection approach for nonignorable nonresponse investigated by Nandram and Choi (2002a, b) for binary data; this extension creates additional complexity because of the multivariate nature of the data coupled with the small area structure. As in that earlier work, the extension is an expansion model centered on an ignorable nonresponse model so that the total cell probability is dependent upon which of the categories is the response. Our investigation employs hierarchical Bayesian models and Markov chain Monte Carlo methods for posterior inference. The models and methods are illustrated with data from the third National Health and Nutrition Examination Survey.

Release date: 2012-06-27

Articles and reports: 12-001-X201100211603
Description:

In many sample surveys there are items requesting binary response (e.g., obese, not obese) from a number of small areas. Inference is required about the probability for a positive response (e.g., obese) in each area, the probability being the same for all individuals in each area and different across areas. Because of the sparseness of the data within areas, direct estimators are not reliable, and there is a need to use data from other areas to improve inference for a specific area. Essentially, a priori the areas are assumed to be similar, and a hierarchical Bayesian model, the standard beta-binomial model, is a natural choice. The innovation is that a practitioner may have much-needed additional prior information about a linear combination of the probabilities. For example, a weighted average of the probabilities is a parameter, and information can be elicited about this parameter, thereby making the Bayesian paradigm appropriate. We have modified the standard beta-binomial model for small areas to incorporate the prior information on the linear combination of the probabilities, which we call a constraint. Thus, there are three cases. The practitioner (a) does not specify a constraint, (b) specifies a constraint and the parameter completely, and (c) specifies a constraint and information which can be used to construct a prior distribution for the parameter. The griddy Gibbs sampler is used to fit the models. To illustrate our method, we use an example on obesity of children in the National Health and Nutrition Examination Survey in which the small areas are formed by crossing school (middle, high), ethnicity (white, black, Mexican) and gender (male, female). We use a simulation study to assess some of the statistical features of our method. We have shown that the gain in precision beyond (a) is in the order with (b) larger than (c).

Release date: 2011-12-21

Articles and reports: 12-001-X201100111443
Description:

Dual frame telephone surveys are becoming common in the U.S. because of the incompleteness of the landline frame as people transition to cell phones. This article examines nonsampling errors in dual frame telephone surveys. Even though nonsampling errors are ignored in much of the dual frame literature, we find that under some conditions substantial biases may arise in dual frame telephone surveys due to these errors. We specifically explore biases due to nonresponse and measurement error in these telephone surveys. To reduce the bias resulting from these errors, we propose dual frame sampling and weighting methods. The compositing factor for combining the estimates from the two frames is shown to play an important role in reducing nonresponse bias.

Release date: 2011-06-29

Articles and reports: 12-001-X201000111244
Description:

This paper considers the problem of selecting nonparametric models for small area estimation, which recently have received much attention. We develop a procedure based on the idea of fence method (Jiang, Rao, Gu and Nguyen 2008) for selecting the mean function for the small areas from a class of approximating splines. Simulation results show impressive performance of the new procedure even when the number of small areas is fairly small. The method is applied to a hospital graft failure dataset for selecting a nonparametric Fay-Herriot type model.

Release date: 2010-06-29

Technical products: 11-522-X200800010970
Description:

RTI International is currently conducting a longitudinal education study. One component of the study involved collecting transcripts and course catalogs from high schools that the sample members attended. Information from the transcripts and course catalogs also needed to be keyed and coded. This presented a challenge because the transcripts and course catalogs were collected from different types of schools, including public, private, and religious schools, from across the nation and they varied widely in both content and format. The challenge called for a sophisticated system that could be used by multiple users simultaneously. RTI developed such a system possessing all the characteristics of a high-end, high-tech, multi-user, multitask, user-friendly and low maintenance cost high school transcript and course catalog keying and coding system. The system is web based and has three major functions: transcript and catalog keying and coding, transcript and catalog keying quality control (keyer-coder end), and transcript and catalog coding QC (management end). Given the complex nature of transcript and catalog keying and coding, the system was designed to be flexible and to have the ability to transport keyed and coded data throughout the system to reduce the keying time, the ability to logically guide users through all the pages that a type of activity required, the ability to display appropriate information to help keying performance, and the ability to track all the keying, coding, and QC activities. Hundreds of catalogs and thousands of transcripts were successfully keyed, coded, and verified using the system. This paper will report on the system needs and design, implementation tips, problems faced and their solutions, and lessons learned.

Release date: 2009-12-03

Technical products: 11-522-X200800011004
Description:

The issue of reducing the response burden is not new. Statistics Sweden works in different ways to reduce response burden and to decrease the administrative costs of data collection from enterprises and organizations. According to legislation Statistics Sweden must reduce response burden for the business community. Therefore, this work is a priority. There is a fixed level decided by the Government to decrease the administrative costs of enterprises by twenty-five percent until year 2010. This goal is valid also for data collection for statistical purposes. The goal concerns surveys with response compulsory legislation. In addition to these surveys there are many more surveys and a need to measure and reduce the burden from these surveys as well. In order to help measure, analyze and reduce the burden, Statistics Sweden has developed the Register of Data providers concerning enterprises and organization (ULR). The purpose of the register is twofold, to measure and analyze the burden on an aggregated level and to be able to give information to each individual enterprise which surveys they are participating in.

Release date: 2009-12-03

Technical products: 11-522-X200800010968
Description:

Statistics Canada has embarked on a program of increasing and improving the usage of imaging technology for paper survey questionnaires. The goal is to make the process an efficient, reliable and cost effective method of capturing survey data. The objective is to continue using Optical Character Recognition (OCR) to capture the data from questionnaires, documents and faxes received whilst improving the process integration and Quality Assurance/Quality Control (QC) of the data capture process. These improvements are discussed in this paper.

Release date: 2009-12-03

Technical products: 11-522-X200800010988
Description:

Online data collection emerged in 1995 as an alternative approach for conducting certain types of consumer research studies and has grown in 2008. This growth has been primarily in studies where non-probability sampling methods are used. While online sampling has gained acceptance for some research applications, serious questions remain concerning online samples' suitability for research requiring precise volumetric measurement of the behavior of the U.S. population, particularly their travel behavior. This paper reviews literature and compares results from studies using probability samples and online samples to understand whether results differ from the two sampling approaches. The paper also demonstrates that online samples underestimate critical types of travel even after demographic and geographic weighting.

Release date: 2009-12-03

Technical products: 11-522-X200800010993
Description:

Until now, years of experience in questionnaire design were required to estimate how long it would take a respondent, on the average, to complete a CATI questionnaire for a new survey. This presentation focuses on a new method which produces interview time estimates for questionnaires at the development stage. The method uses Blaise Audit Trail data and previous surveys. It was developed, tested and verified for accuracy on some large scale surveys.

First, audit trail data was used to determine the average time previous respondents have taken to answer specific types of questions. These would include questions that require a yes/no answer, scaled questions, "mark all that apply" questions, etc. Second, for any given questionnaire, the paths taken by population sub-groups were mapped to identify the series of questions answered by different types of respondents, and timed to determine what the longest possible interview time would be. Finally, the overall expected time it takes to complete the questionnaire is calculated using estimated proportions of the population expected to answer each question.

So far, we used paradata to accurately estimate average respondent interview completion times. We note that the method that we developed could also be used to estimate specific respondent interview completion times.

Release date: 2009-12-03

Articles and reports: 12-001-X200800210761
Description:

Optimum stratification is the method of choosing the best boundaries that make strata internally homogeneous, given some sample allocation. In order to make the strata internally homogenous, the strata should be constructed in such a way that the strata variances for the characteristic under study be as small as possible. This could be achieved effectively by having the distribution of the main study variable known and create strata by cutting the range of the distribution at suitable points. If the frequency distribution of the study variable is unknown, it may be approximated from the past experience or some prior knowledge obtained at a recent study. In this paper the problem of finding Optimum Strata Boundaries (OSB) is considered as the problem of determining Optimum Strata Widths (OSW). The problem is formulated as a Mathematical Programming Problem (MPP), which minimizes the variance of the estimated population parameter under Neyman allocation subject to the restriction that sum of the widths of all the strata is equal to the total range of the distribution. The distributions of the study variable are considered as continuous with Triangular and Standard Normal density functions. The formulated MPPs, which turn out to be multistage decision problems, can then be solved using dynamic programming technique proposed by Bühler and Deutler (1975). Numerical examples are presented to illustrate the computational details. The results obtained are also compared with the method of Dalenius and Hodges (1959) with an example of normal distribution.

Release date: 2008-12-23

Articles and reports: 12-001-X200800110606
Description:

Data from election polls in the US are typically presented in two-way categorical tables, and there are many polls before the actual election in November. For example, in the Buckeye State Poll in 1998 for governor there are three polls, January, April and October; the first category represents the candidates (e.g., Fisher, Taft and other) and the second category represents the current status of the voters (likely to vote and not likely to vote for governor of Ohio). There is a substantial number of undecided voters for one or both categories in all three polls, and we use a Bayesian method to allocate the undecided voters to the three candidates. This method permits modeling different patterns of missingness under ignorable and nonignorable assumptions, and a multinomial-Dirichlet model is used to estimate the cell probabilities which can help to predict the winner. We propose a time-dependent nonignorable nonresponse model for the three tables. Here, a nonignorable nonresponse model is centered on an ignorable nonresponse model to induce some flexibility and uncertainty about ignorabilty or nonignorability. As competitors we also consider two other models, an ignorable and a nonignorable nonresponse model. These latter two models assume a common stochastic process to borrow strength over time. Markov chain Monte Carlo methods are used to fit the models. We also construct a parameter that can potentially be used to predict the winner among the candidates in the November election.

Release date: 2008-06-26

