Statistics by subject – Statistical methods

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Type of information

2 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Type of information

2 facets displayed. 0 facets selected.

Author(s)

41 facets displayed. 1 facets selected.

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Author(s)

41 facets displayed. 1 facets selected.

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Author(s)

41 facets displayed. 1 facets selected.

Content

1 facets displayed. 0 facets selected.

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
Loading
Loading in progress, please wait...
All (61)

All (61) (25 of 61 results)

  • Articles and reports: 12-001-X201700114822
    Description:

    We use a Bayesian method to infer about a finite population proportion when binary data are collected using a two-fold sample design from small areas. The two-fold sample design has a two-stage cluster sample design within each area. A former hierarchical Bayesian model assumes that for each area the first stage binary responses are independent Bernoulli distributions, and the probabilities have beta distributions which are parameterized by a mean and a correlation coefficient. The means vary with areas but the correlation is the same over areas. However, to gain some flexibility we have now extended this model to accommodate different correlations. The means and the correlations have independent beta distributions. We call the former model a homogeneous model and the new model a heterogeneous model. All hyperparameters have proper noninformative priors. An additional complexity is that some of the parameters are weakly identified making it difficult to use a standard Gibbs sampler for computation. So we have used unimodal constraints for the beta prior distributions and a blocked Gibbs sampler to perform the computation. We have compared the heterogeneous and homogeneous models using an illustrative example and simulation study. As expected, the two-fold model with heterogeneous correlations is preferred.

    Release date: 2017-06-22

  • Articles and reports: 82-003-X201700614829
    Description:

    POHEM-BMI is a microsimulation tool that includes a model of adult body mass index (BMI) and a model of childhood BMI history. This overview describes the development of BMI prediction models for adults and of childhood BMI history, and compares projected BMI estimates with those from nationally representative survey data to establish validity.

    Release date: 2017-06-21

  • Articles and reports: 12-001-X201600114543
    Description:

    The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

    Release date: 2016-06-22

  • Technical products: 11-522-X201700014711
    Description:

    After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014717
    Description:

    Files with linked data from the Statistics Canada, Postsecondary Student Information System (PSIS) and tax data can be used to examine the trajectories of students who pursue postsecondary education (PSE) programs and their post-schooling labour market outcomes. On one hand, administrative data on students linked longitudinally can provide aggregate information on student pathways during postsecondary studies such as persistence rates, graduation rates, mobility, etc. On the other hand, the tax data could supplement the PSIS data to provide information on employment outcomes such as average and median earnings or earnings progress by employment sector (industry), field of study, education level and/or other demographic information, year over year after graduation. Two longitudinal pilot studies have been done using administrative data on postsecondary students of Maritimes institutions which have been longitudinally linked and linked to Statistics Canada Ttx data (the T1 Family File) for relevant years. This article first focuses on the quality of information in the administrative data and the methodology used to conduct these longitudinal studies and derive indicators. Second, it will focus on some limitations when using administrative data, rather than a survey, to define some concepts.

    Release date: 2016-03-24

  • Articles and reports: 12-001-X201500214249
    Description:

    The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods have been proposed. Basically, these methods are divided into two classes: The first class comprises methods that seek an allocation which minimizes survey costs while keeping the coefficients of variation of estimators of totals below specified thresholds for all survey variables of interest. The second aims to minimize a weighted average of the relative variances of the estimators of totals given a maximum overall sample size or a maximum cost. This paper proposes a new optimization approach for the sample allocation problem in multivariate surveys. This approach is based on a binary integer programming formulation. Several numerical experiments showed that the proposed approach provides efficient solutions to this problem, which improve upon a ‘textbook algorithm’ and can be more efficient than the algorithm by Bethel (1985, 1989).

    Release date: 2015-12-17

  • Articles and reports: 12-001-X201500114200
    Description:

    We consider the observed best prediction (OBP; Jiang, Nguyen and Rao 2011) for small area estimation under the nested-error regression model, where both the mean and variance functions may be misspecified. We show via a simulation study that the OBP may significantly outperform the empirical best linear unbiased prediction (EBLUP) method not just in the overall mean squared prediction error (MSPE) but also in the area-specific MSPE for every one of the small areas. A bootstrap method is proposed for estimating the design-based area-specific MSPE, which is simple and always produces positive MSPE estimates. The performance of the proposed MSPE estimator is evaluated through a simulation study. An application to the Television School and Family Smoking Prevention and Cessation study is considered.

    Release date: 2015-06-29

  • Articles and reports: 82-003-X201500314143
    Description:

    This study evaluates the representativeness of the pooled 2007/2009-2009/2011 Canadian Health Measures Survey immigrant sample by comparing it with socio-demographic distributions from the 2006 Census and the 2011 National Household Survey, and with selected self-reported health and health behaviour indicators from the 2009/2010 Canadian Community Health Survey.

    Release date: 2015-03-18

  • Articles and reports: 12-001-X201400214113
    Description:

    Rotating panel surveys are used to calculate estimates of gross flows between two consecutive periods of measurement. This paper considers a general procedure for the estimation of gross flows when the rotating panel survey has been generated from a complex survey design with random nonresponse. A pseudo maximum likelihood approach is considered through a two-stage model of Markov chains for the allocation of individuals among the categories in the survey and for modeling for nonresponse.

    Release date: 2014-12-19

  • Technical products: 11-522-X201300014255
    Description:

    The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014284
    Description:

    The decline in response rates observed by several national statistical institutes, their desire to limit response burden and the significant budget pressures they face support greater use of administrative data to produce statistical information. The administrative data sources they must consider have to be evaluated according to several aspects to determine their fitness for use. Statistics Canada recently developed a process to evaluate administrative data sources for use as inputs to the statistical information production process. This evaluation is conducted in two phases. The initial phase requires access only to the metadata associated with the administrative data considered, whereas the second phase uses a version of data that can be evaluated. This article outlines the evaluation process and tool.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014264
    Description:

    While wetlands represent only 6.4% of the world’s surface area, they are essential to the survival of terrestrial species. These ecosystems require special attention in Canada, since that is where nearly 25% of the world’s wetlands are found. Environment Canada (EC) has massive databases that contain all kinds of wetland information from various sources. Before the information in these databases could be used for any environmental initiative, it had to be classified and its quality had to be assessed. In this paper, we will give an overview of the joint pilot project carried out by EC and Statistics Canada to assess the quality of the information contained in these databases, which has characteristics specific to big data, administrative data and survey data.

    Release date: 2014-10-31

  • Articles and reports: 82-003-X201401014098
    Description:

    This study compares registry and non-registry approaches to linking 2006 Census of Population data for Manitoba and Ontario to Hospital data from the Discharge Abstract Database.

    Release date: 2014-10-15

  • Articles and reports: 82-003-X201301011873
    Description:

    A computer simulation model of physical activity was developed for the Canadian adult population using longitudinal data from the National Population Health Survey and cross-sectional data from the Canadian Community Health Survey. The model is based on the Population Health Model (POHEM) platform developed by Statistics Canada. This article presents an overview of POHEM and describes the additions that were made to create the physical activity module (POHEM-PA). These additions include changes in physical activity over time, and the relationship between physical activity levels and health-adjusted life expectancy, life expectancy and the onset of selected chronic conditions. Estimates from simulation projections are compared with nationally representative survey data to provide an indication of the validity of POHEM-PA.

    Release date: 2013-10-16

  • Articles and reports: 12-001-X201200111688
    Description:

    We study the problem of nonignorable nonresponse in a two dimensional contingency table which can be constructed for each of several small areas when there is both item and unit nonresponse. In general, the provision for both types of nonresponse with small areas introduces significant additional complexity in the estimation of model parameters. For this paper, we conceptualize the full data array for each area to consist of a table for complete data and three supplemental tables for missing row data, missing column data, and missing row and column data. For nonignorable nonresponse, the total cell probabilities are allowed to vary by area, cell and these three types of "missingness". The underlying cell probabilities (i.e., those which would apply if full classification were always possible) for each area are generated from a common distribution and their similarity across the areas is parametrically quantified. Our approach is an extension of the selection approach for nonignorable nonresponse investigated by Nandram and Choi (2002a, b) for binary data; this extension creates additional complexity because of the multivariate nature of the data coupled with the small area structure. As in that earlier work, the extension is an expansion model centered on an ignorable nonresponse model so that the total cell probability is dependent upon which of the categories is the response. Our investigation employs hierarchical Bayesian models and Markov chain Monte Carlo methods for posterior inference. The models and methods are illustrated with data from the third National Health and Nutrition Examination Survey.

    Release date: 2012-06-27

  • Articles and reports: 12-001-X201100211603
    Description:

    In many sample surveys there are items requesting binary response (e.g., obese, not obese) from a number of small areas. Inference is required about the probability for a positive response (e.g., obese) in each area, the probability being the same for all individuals in each area and different across areas. Because of the sparseness of the data within areas, direct estimators are not reliable, and there is a need to use data from other areas to improve inference for a specific area. Essentially, a priori the areas are assumed to be similar, and a hierarchical Bayesian model, the standard beta-binomial model, is a natural choice. The innovation is that a practitioner may have much-needed additional prior information about a linear combination of the probabilities. For example, a weighted average of the probabilities is a parameter, and information can be elicited about this parameter, thereby making the Bayesian paradigm appropriate. We have modified the standard beta-binomial model for small areas to incorporate the prior information on the linear combination of the probabilities, which we call a constraint. Thus, there are three cases. The practitioner (a) does not specify a constraint, (b) specifies a constraint and the parameter completely, and (c) specifies a constraint and information which can be used to construct a prior distribution for the parameter. The griddy Gibbs sampler is used to fit the models. To illustrate our method, we use an example on obesity of children in the National Health and Nutrition Examination Survey in which the small areas are formed by crossing school (middle, high), ethnicity (white, black, Mexican) and gender (male, female). We use a simulation study to assess some of the statistical features of our method. We have shown that the gain in precision beyond (a) is in the order with (b) larger than (c).

    Release date: 2011-12-21

  • Articles and reports: 12-001-X201100111443
    Description:

    Dual frame telephone surveys are becoming common in the U.S. because of the incompleteness of the landline frame as people transition to cell phones. This article examines nonsampling errors in dual frame telephone surveys. Even though nonsampling errors are ignored in much of the dual frame literature, we find that under some conditions substantial biases may arise in dual frame telephone surveys due to these errors. We specifically explore biases due to nonresponse and measurement error in these telephone surveys. To reduce the bias resulting from these errors, we propose dual frame sampling and weighting methods. The compositing factor for combining the estimates from the two frames is shown to play an important role in reducing nonresponse bias.

    Release date: 2011-06-29

  • Articles and reports: 12-001-X201000111244
    Description:

    This paper considers the problem of selecting nonparametric models for small area estimation, which recently have received much attention. We develop a procedure based on the idea of fence method (Jiang, Rao, Gu and Nguyen 2008) for selecting the mean function for the small areas from a class of approximating splines. Simulation results show impressive performance of the new procedure even when the number of small areas is fairly small. The method is applied to a hospital graft failure dataset for selecting a nonparametric Fay-Herriot type model.

    Release date: 2010-06-29

  • Technical products: 11-522-X200800010970
    Description:

    RTI International is currently conducting a longitudinal education study. One component of the study involved collecting transcripts and course catalogs from high schools that the sample members attended. Information from the transcripts and course catalogs also needed to be keyed and coded. This presented a challenge because the transcripts and course catalogs were collected from different types of schools, including public, private, and religious schools, from across the nation and they varied widely in both content and format. The challenge called for a sophisticated system that could be used by multiple users simultaneously. RTI developed such a system possessing all the characteristics of a high-end, high-tech, multi-user, multitask, user-friendly and low maintenance cost high school transcript and course catalog keying and coding system. The system is web based and has three major functions: transcript and catalog keying and coding, transcript and catalog keying quality control (keyer-coder end), and transcript and catalog coding QC (management end). Given the complex nature of transcript and catalog keying and coding, the system was designed to be flexible and to have the ability to transport keyed and coded data throughout the system to reduce the keying time, the ability to logically guide users through all the pages that a type of activity required, the ability to display appropriate information to help keying performance, and the ability to track all the keying, coding, and QC activities. Hundreds of catalogs and thousands of transcripts were successfully keyed, coded, and verified using the system. This paper will report on the system needs and design, implementation tips, problems faced and their solutions, and lessons learned.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011004
    Description:

    The issue of reducing the response burden is not new. Statistics Sweden works in different ways to reduce response burden and to decrease the administrative costs of data collection from enterprises and organizations. According to legislation Statistics Sweden must reduce response burden for the business community. Therefore, this work is a priority. There is a fixed level decided by the Government to decrease the administrative costs of enterprises by twenty-five percent until year 2010. This goal is valid also for data collection for statistical purposes. The goal concerns surveys with response compulsory legislation. In addition to these surveys there are many more surveys and a need to measure and reduce the burden from these surveys as well. In order to help measure, analyze and reduce the burden, Statistics Sweden has developed the Register of Data providers concerning enterprises and organization (ULR). The purpose of the register is twofold, to measure and analyze the burden on an aggregated level and to be able to give information to each individual enterprise which surveys they are participating in.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010968
    Description:

    Statistics Canada has embarked on a program of increasing and improving the usage of imaging technology for paper survey questionnaires. The goal is to make the process an efficient, reliable and cost effective method of capturing survey data. The objective is to continue using Optical Character Recognition (OCR) to capture the data from questionnaires, documents and faxes received whilst improving the process integration and Quality Assurance/Quality Control (QC) of the data capture process. These improvements are discussed in this paper.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010988
    Description:

    Online data collection emerged in 1995 as an alternative approach for conducting certain types of consumer research studies and has grown in 2008. This growth has been primarily in studies where non-probability sampling methods are used. While online sampling has gained acceptance for some research applications, serious questions remain concerning online samples' suitability for research requiring precise volumetric measurement of the behavior of the U.S. population, particularly their travel behavior. This paper reviews literature and compares results from studies using probability samples and online samples to understand whether results differ from the two sampling approaches. The paper also demonstrates that online samples underestimate critical types of travel even after demographic and geographic weighting.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010993
    Description:

    Until now, years of experience in questionnaire design were required to estimate how long it would take a respondent, on the average, to complete a CATI questionnaire for a new survey. This presentation focuses on a new method which produces interview time estimates for questionnaires at the development stage. The method uses Blaise Audit Trail data and previous surveys. It was developed, tested and verified for accuracy on some large scale surveys.

    First, audit trail data was used to determine the average time previous respondents have taken to answer specific types of questions. These would include questions that require a yes/no answer, scaled questions, "mark all that apply" questions, etc. Second, for any given questionnaire, the paths taken by population sub-groups were mapped to identify the series of questions answered by different types of respondents, and timed to determine what the longest possible interview time would be. Finally, the overall expected time it takes to complete the questionnaire is calculated using estimated proportions of the population expected to answer each question.

    So far, we used paradata to accurately estimate average respondent interview completion times. We note that the method that we developed could also be used to estimate specific respondent interview completion times.

    Release date: 2009-12-03

  • Articles and reports: 12-001-X200800210761
    Description:

    Optimum stratification is the method of choosing the best boundaries that make strata internally homogeneous, given some sample allocation. In order to make the strata internally homogenous, the strata should be constructed in such a way that the strata variances for the characteristic under study be as small as possible. This could be achieved effectively by having the distribution of the main study variable known and create strata by cutting the range of the distribution at suitable points. If the frequency distribution of the study variable is unknown, it may be approximated from the past experience or some prior knowledge obtained at a recent study. In this paper the problem of finding Optimum Strata Boundaries (OSB) is considered as the problem of determining Optimum Strata Widths (OSW). The problem is formulated as a Mathematical Programming Problem (MPP), which minimizes the variance of the estimated population parameter under Neyman allocation subject to the restriction that sum of the widths of all the strata is equal to the total range of the distribution. The distributions of the study variable are considered as continuous with Triangular and Standard Normal density functions. The formulated MPPs, which turn out to be multistage decision problems, can then be solved using dynamic programming technique proposed by Bühler and Deutler (1975). Numerical examples are presented to illustrate the computational details. The results obtained are also compared with the method of Dalenius and Hodges (1959) with an example of normal distribution.

    Release date: 2008-12-23

  • Articles and reports: 12-001-X200800110606
    Description:

    Data from election polls in the US are typically presented in two-way categorical tables, and there are many polls before the actual election in November. For example, in the Buckeye State Poll in 1998 for governor there are three polls, January, April and October; the first category represents the candidates (e.g., Fisher, Taft and other) and the second category represents the current status of the voters (likely to vote and not likely to vote for governor of Ohio). There is a substantial number of undecided voters for one or both categories in all three polls, and we use a Bayesian method to allocate the undecided voters to the three candidates. This method permits modeling different patterns of missingness under ignorable and nonignorable assumptions, and a multinomial-Dirichlet model is used to estimate the cell probabilities which can help to predict the winner. We propose a time-dependent nonignorable nonresponse model for the three tables. Here, a nonignorable nonresponse model is centered on an ignorable nonresponse model to induce some flexibility and uncertainty about ignorabilty or nonignorability. As competitors we also consider two other models, an ignorable and a nonignorable nonresponse model. These latter two models assume a common stochastic process to borrow strength over time. Markov chain Monte Carlo methods are used to fit the models. We also construct a parameter that can potentially be used to predict the winner among the candidates in the November election.

    Release date: 2008-06-26

Data (0)

Data (0) (0 results)

Your search for "" found no results in this section of the site.

You may try:

Analysis (36)

Analysis (36) (25 of 36 results)

  • Articles and reports: 12-001-X201700114822
    Description:

    We use a Bayesian method to infer about a finite population proportion when binary data are collected using a two-fold sample design from small areas. The two-fold sample design has a two-stage cluster sample design within each area. A former hierarchical Bayesian model assumes that for each area the first stage binary responses are independent Bernoulli distributions, and the probabilities have beta distributions which are parameterized by a mean and a correlation coefficient. The means vary with areas but the correlation is the same over areas. However, to gain some flexibility we have now extended this model to accommodate different correlations. The means and the correlations have independent beta distributions. We call the former model a homogeneous model and the new model a heterogeneous model. All hyperparameters have proper noninformative priors. An additional complexity is that some of the parameters are weakly identified making it difficult to use a standard Gibbs sampler for computation. So we have used unimodal constraints for the beta prior distributions and a blocked Gibbs sampler to perform the computation. We have compared the heterogeneous and homogeneous models using an illustrative example and simulation study. As expected, the two-fold model with heterogeneous correlations is preferred.

    Release date: 2017-06-22

  • Articles and reports: 82-003-X201700614829
    Description:

    POHEM-BMI is a microsimulation tool that includes a model of adult body mass index (BMI) and a model of childhood BMI history. This overview describes the development of BMI prediction models for adults and of childhood BMI history, and compares projected BMI estimates with those from nationally representative survey data to establish validity.

    Release date: 2017-06-21

  • Articles and reports: 12-001-X201600114543
    Description:

    The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

    Release date: 2016-06-22

  • Articles and reports: 12-001-X201500214249
    Description:

    The problem of optimal allocation of samples in surveys using a stratified sampling plan was first discussed by Neyman in 1934. Since then, many researchers have studied the problem of the sample allocation in multivariate surveys and several methods have been proposed. Basically, these methods are divided into two classes: The first class comprises methods that seek an allocation which minimizes survey costs while keeping the coefficients of variation of estimators of totals below specified thresholds for all survey variables of interest. The second aims to minimize a weighted average of the relative variances of the estimators of totals given a maximum overall sample size or a maximum cost. This paper proposes a new optimization approach for the sample allocation problem in multivariate surveys. This approach is based on a binary integer programming formulation. Several numerical experiments showed that the proposed approach provides efficient solutions to this problem, which improve upon a ‘textbook algorithm’ and can be more efficient than the algorithm by Bethel (1985, 1989).

    Release date: 2015-12-17

  • Articles and reports: 12-001-X201500114200
    Description:

    We consider the observed best prediction (OBP; Jiang, Nguyen and Rao 2011) for small area estimation under the nested-error regression model, where both the mean and variance functions may be misspecified. We show via a simulation study that the OBP may significantly outperform the empirical best linear unbiased prediction (EBLUP) method not just in the overall mean squared prediction error (MSPE) but also in the area-specific MSPE for every one of the small areas. A bootstrap method is proposed for estimating the design-based area-specific MSPE, which is simple and always produces positive MSPE estimates. The performance of the proposed MSPE estimator is evaluated through a simulation study. An application to the Television School and Family Smoking Prevention and Cessation study is considered.

    Release date: 2015-06-29

  • Articles and reports: 82-003-X201500314143
    Description:

    This study evaluates the representativeness of the pooled 2007/2009-2009/2011 Canadian Health Measures Survey immigrant sample by comparing it with socio-demographic distributions from the 2006 Census and the 2011 National Household Survey, and with selected self-reported health and health behaviour indicators from the 2009/2010 Canadian Community Health Survey.

    Release date: 2015-03-18

  • Articles and reports: 12-001-X201400214113
    Description:

    Rotating panel surveys are used to calculate estimates of gross flows between two consecutive periods of measurement. This paper considers a general procedure for the estimation of gross flows when the rotating panel survey has been generated from a complex survey design with random nonresponse. A pseudo maximum likelihood approach is considered through a two-stage model of Markov chains for the allocation of individuals among the categories in the survey and for modeling for nonresponse.

    Release date: 2014-12-19

  • Articles and reports: 82-003-X201401014098
    Description:

    This study compares registry and non-registry approaches to linking 2006 Census of Population data for Manitoba and Ontario to Hospital data from the Discharge Abstract Database.

    Release date: 2014-10-15

  • Articles and reports: 82-003-X201301011873
    Description:

    A computer simulation model of physical activity was developed for the Canadian adult population using longitudinal data from the National Population Health Survey and cross-sectional data from the Canadian Community Health Survey. The model is based on the Population Health Model (POHEM) platform developed by Statistics Canada. This article presents an overview of POHEM and describes the additions that were made to create the physical activity module (POHEM-PA). These additions include changes in physical activity over time, and the relationship between physical activity levels and health-adjusted life expectancy, life expectancy and the onset of selected chronic conditions. Estimates from simulation projections are compared with nationally representative survey data to provide an indication of the validity of POHEM-PA.

    Release date: 2013-10-16

  • Articles and reports: 12-001-X201200111688
    Description:

    We study the problem of nonignorable nonresponse in a two dimensional contingency table which can be constructed for each of several small areas when there is both item and unit nonresponse. In general, the provision for both types of nonresponse with small areas introduces significant additional complexity in the estimation of model parameters. For this paper, we conceptualize the full data array for each area to consist of a table for complete data and three supplemental tables for missing row data, missing column data, and missing row and column data. For nonignorable nonresponse, the total cell probabilities are allowed to vary by area, cell and these three types of "missingness". The underlying cell probabilities (i.e., those which would apply if full classification were always possible) for each area are generated from a common distribution and their similarity across the areas is parametrically quantified. Our approach is an extension of the selection approach for nonignorable nonresponse investigated by Nandram and Choi (2002a, b) for binary data; this extension creates additional complexity because of the multivariate nature of the data coupled with the small area structure. As in that earlier work, the extension is an expansion model centered on an ignorable nonresponse model so that the total cell probability is dependent upon which of the categories is the response. Our investigation employs hierarchical Bayesian models and Markov chain Monte Carlo methods for posterior inference. The models and methods are illustrated with data from the third National Health and Nutrition Examination Survey.

    Release date: 2012-06-27

  • Articles and reports: 12-001-X201100211603
    Description:

    In many sample surveys there are items requesting binary response (e.g., obese, not obese) from a number of small areas. Inference is required about the probability for a positive response (e.g., obese) in each area, the probability being the same for all individuals in each area and different across areas. Because of the sparseness of the data within areas, direct estimators are not reliable, and there is a need to use data from other areas to improve inference for a specific area. Essentially, a priori the areas are assumed to be similar, and a hierarchical Bayesian model, the standard beta-binomial model, is a natural choice. The innovation is that a practitioner may have much-needed additional prior information about a linear combination of the probabilities. For example, a weighted average of the probabilities is a parameter, and information can be elicited about this parameter, thereby making the Bayesian paradigm appropriate. We have modified the standard beta-binomial model for small areas to incorporate the prior information on the linear combination of the probabilities, which we call a constraint. Thus, there are three cases. The practitioner (a) does not specify a constraint, (b) specifies a constraint and the parameter completely, and (c) specifies a constraint and information which can be used to construct a prior distribution for the parameter. The griddy Gibbs sampler is used to fit the models. To illustrate our method, we use an example on obesity of children in the National Health and Nutrition Examination Survey in which the small areas are formed by crossing school (middle, high), ethnicity (white, black, Mexican) and gender (male, female). We use a simulation study to assess some of the statistical features of our method. We have shown that the gain in precision beyond (a) is in the order with (b) larger than (c).

    Release date: 2011-12-21

  • Articles and reports: 12-001-X201100111443
    Description:

    Dual frame telephone surveys are becoming common in the U.S. because of the incompleteness of the landline frame as people transition to cell phones. This article examines nonsampling errors in dual frame telephone surveys. Even though nonsampling errors are ignored in much of the dual frame literature, we find that under some conditions substantial biases may arise in dual frame telephone surveys due to these errors. We specifically explore biases due to nonresponse and measurement error in these telephone surveys. To reduce the bias resulting from these errors, we propose dual frame sampling and weighting methods. The compositing factor for combining the estimates from the two frames is shown to play an important role in reducing nonresponse bias.

    Release date: 2011-06-29

  • Articles and reports: 12-001-X201000111244
    Description:

    This paper considers the problem of selecting nonparametric models for small area estimation, which recently have received much attention. We develop a procedure based on the idea of fence method (Jiang, Rao, Gu and Nguyen 2008) for selecting the mean function for the small areas from a class of approximating splines. Simulation results show impressive performance of the new procedure even when the number of small areas is fairly small. The method is applied to a hospital graft failure dataset for selecting a nonparametric Fay-Herriot type model.

    Release date: 2010-06-29

  • Articles and reports: 12-001-X200800210761
    Description:

    Optimum stratification is the method of choosing the best boundaries that make strata internally homogeneous, given some sample allocation. In order to make the strata internally homogenous, the strata should be constructed in such a way that the strata variances for the characteristic under study be as small as possible. This could be achieved effectively by having the distribution of the main study variable known and create strata by cutting the range of the distribution at suitable points. If the frequency distribution of the study variable is unknown, it may be approximated from the past experience or some prior knowledge obtained at a recent study. In this paper the problem of finding Optimum Strata Boundaries (OSB) is considered as the problem of determining Optimum Strata Widths (OSW). The problem is formulated as a Mathematical Programming Problem (MPP), which minimizes the variance of the estimated population parameter under Neyman allocation subject to the restriction that sum of the widths of all the strata is equal to the total range of the distribution. The distributions of the study variable are considered as continuous with Triangular and Standard Normal density functions. The formulated MPPs, which turn out to be multistage decision problems, can then be solved using dynamic programming technique proposed by Bühler and Deutler (1975). Numerical examples are presented to illustrate the computational details. The results obtained are also compared with the method of Dalenius and Hodges (1959) with an example of normal distribution.

    Release date: 2008-12-23

  • Articles and reports: 12-001-X200800110606
    Description:

    Data from election polls in the US are typically presented in two-way categorical tables, and there are many polls before the actual election in November. For example, in the Buckeye State Poll in 1998 for governor there are three polls, January, April and October; the first category represents the candidates (e.g., Fisher, Taft and other) and the second category represents the current status of the voters (likely to vote and not likely to vote for governor of Ohio). There is a substantial number of undecided voters for one or both categories in all three polls, and we use a Bayesian method to allocate the undecided voters to the three candidates. This method permits modeling different patterns of missingness under ignorable and nonignorable assumptions, and a multinomial-Dirichlet model is used to estimate the cell probabilities which can help to predict the winner. We propose a time-dependent nonignorable nonresponse model for the three tables. Here, a nonignorable nonresponse model is centered on an ignorable nonresponse model to induce some flexibility and uncertainty about ignorabilty or nonignorability. As competitors we also consider two other models, an ignorable and a nonignorable nonresponse model. These latter two models assume a common stochastic process to borrow strength over time. Markov chain Monte Carlo methods are used to fit the models. We also construct a parameter that can potentially be used to predict the winner among the candidates in the November election.

    Release date: 2008-06-26

  • Articles and reports: 12-001-X200800110611
    Description:

    In finite population sampling prior information is often available in the form of partial knowledge about an auxiliary variable, for example its mean may be known. In such cases, the ratio estimator and the regression estimator are often used for estimating the population mean of the characteristic of interest. The Polya posterior has been developed as a noninformative Bayesian approach to survey sampling. It is appropriate when little or no prior information about the population is available. Here we show that it can be extended to incorporate types of partial prior information about auxiliary variables. We will see that it typically yields procedures with good frequentist properties even in some problems where standard frequentist methods are difficult to apply.

    Release date: 2008-06-26

  • Articles and reports: 12-001-X20070019856
    Description:

    The concept of 'nearest proportional to size sampling designs' originated by Gabler (1987) is used to obtain an optimal controlled sampling design, ensuring zero selection probabilities to non-preferred samples. Variance estimation for the proposed optimal controlled sampling design using the Yates-Grundy form of the Horvitz-Thompson estimator is discussed. The true sampling variance of the proposed procedure is compared with that of the existing optimal controlled and uncontrolled high entropy selection procedures. The utility of the proposed procedure is demonstrated with the help of examples.

    Release date: 2007-06-28

  • Articles and reports: 12-001-X20050029048
    Description:

    We consider a problem in which an analysis is needed for categorical data from a single two-way table with partial classification (i.e., both item and unit nonresponses). We assume that this is the only information available. A Bayesian methodology permits modeling different patterns of missingness under ignorability and nonignorability assumptions. We construct a nonignorable nonresponse model which is obtained from the ignorable nonresponse model via a model expansion using a data-dependent prior; the nonignorable nonresponse model robustifies the ignorable nonresponse model. A multinomial-Dirichlet model, adjusted for the nonresponse, is used to estimate the cell probabilities, and a Bayes factor is used to test for association. We illustrate our methodology using data on bone mineral density and family income. A sensitivity analysis is used to assess the effects of the data-dependent prior. The ignorable and nonignorable nonresponse models are compared using a simulation study, and there are subtle differences between these models.

    Release date: 2006-02-17

  • Articles and reports: 12-001-X20050018089
    Description:

    We use hierarchical Bayesian models to analyze body mass index (BMI) data of children and adolescents with nonignorable nonresponse from the Third National Health and Nutrition Examination Survey (NHANES III). Our objective is to predict the finite population mean BMI and the proportion of respondents for domains formed by age, race and sex (covariates in the regression models) in each of thirty five large counties, accounting for the nonrespondents. Markov chain Monte Carlo methods are used to fit the models (two selection and two pattern mixture) to the NHANES III BMI data. Using a deviance measure and a cross-validation study, we show that the nonignorable selection model is the best among the four models. We also show that inference about BMI is not too sensitive to the model choice. An improvement is obtained by including a spline regression into the selection model to reflect changes in the relationship between BMI and age.

    Release date: 2005-07-21

  • Articles and reports: 12-001-X20040016998
    Description:

    The Canadian Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal micro-data - altogether consisting of millions of person-months of individual and family level data - is useful for analyses of monthly labour market dynamics over relatively long periods of time, 25 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure, for employed respondents, and on the date last worked, for those not employed - together with the date of survey responses - allow the construction of models that include terms reflecting seasonality and macro-economic cycles as well as the duration dependence of each type of transition. In addition, the LFS data permits spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been incorporated in the LifePaths microsimulation model. In that setting, the equations have been used to simulate lifetime employment activity from past, present and future birth cohorts. Simulation results have been validated by comparison with the age profiles of LFS employment/population ratios for the period 1976 to 2001.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20020026428
    Description:

    The analysis of survey data from different geographical areas where the data from each area are polychotomous can be easily performed using hierarchical Bayesian models, even if there are small cell counts in some of these areas. However, there are difficulties when the survey data have missing information in the form of non-response, especially when the characteristics of the respondents differ from the non-respondents. We use the selection approach for estimation when there are non-respondents because it permits inference for all the parameters. Specifically, we describe a hierarchical Bayesian model to analyse multinomial non-ignorable non-response data from different geographical areas; some of them can be small. For the model, we use a Dirichlet prior density for the multinomial probabilities and a beta prior density for the response probabilities. This permits a 'borrowing of strength' of the data from larger areas to improve the reliability in the estimates of the model parameters corresponding to the smaller areas. Because the joint posterior density of all the parameters is complex, inference is sampling-based and Markov chain Monte Carlo methods are used. We apply our method to provide an analysis of body mass index (BMI) data from the third National Health and Nutrition Examination Survey (NHANES III). For simplicity, the BMI is categorized into 3 natural levels, and this is done for each of 8 age-race-sex domains and 34 counties. We assess the performance of our model using the NHANES III data and simulated examples, which show our model works reasonably well.

    Release date: 2003-01-29

  • Articles and reports: 12-001-X20010015851
    Description:

    We consider 'telesurveys' as surveys in which the predominant or unique mode of collection is based on some means of electronic telecommunications - including both the telephone and other more advanced technological devices such as e-mail, Internet, videophone or fax. We review, briefly, the early history of telephone surveys, and, in more detail, recent developments in the areas of sample design and estimation, coverage and nonresponse and evaluation of data quality. All these methodological developments have led the telephone survey to become the major mode of collection in the sample survey field in the past quarter of a century. Other modes of advanced telecommunication are fast becoming important supplements and even competitors to the fixed line telephone and are already being used in various ways for sample surveys. We examine their potential for survey work and the possible impact of current and future technological developments of the communications industry on survey practice and their methodological implications.

    Release date: 2001-08-22

  • Articles and reports: 12-001-X20000015178
    Description:

    Longitudinal observations consist of repeated measurements on the same units over a number of occasions, with fixed or varying time spells between the occasions. Each vector observation can be viewed therefore as a time series, usually of short length. Analyzing the measurements for all the units permits the fitting of low-order time series models, despite the short lengths of the individual series.

    Release date: 2000-08-30

  • Articles and reports: 12-001-X19980024352
    Description:

    The National Population Health Survey (NPHS) is one of Statistics Canada's three major longitudinal household surveys providing an extensive coverage of the Canadian population. A panel of approximately 17,000 people are being followed up every two years for up to twenty years. The survey data are used for longitudinal analyses, although an important objective is the production of cross-sectional estimates. Each cycle panel respondents provide detailed health information (H) while, to augment the cross-sectional sample, general socio-demographic and health information (G) are collected from all members of their households. This particular collection strategy presents several observable response patterns for Panel Members after two cycles: GH-GH, GH-G*, GH-**, G*-GH, G*-G* and G*-**, where "*" denotes a missing portion of data. The article presents the methodology developed to deal with these types of longitudinal nonresponse as well as with nonresponse from a cross-sectional perspective. The use of weight adjustments for nonresponse and the creation of adjustment cells for weighting using a CHAID algorithm are discussed.

    Release date: 1999-01-14

  • Articles and reports: 12-001-X19970023615
    Description:

    This paper demonstrates the utility of a multi-stage survey design that obtains a total count of health facilities and of the potential client population in an area. The design has been used for a state-level survey conducted in mid-1995 in Uttar Pradesh, India. The design involves a multi-stage, areal cluster sample, wherein the primary sampling unit is either an urban block or rural village. All health service delivery points, either self-standing facilities or distribution agents, in or formally assigned to the primary sampling unit are mapped, listed, and selected. A systematic sample of households is selected, and all resident females meeting predetermined eligibility criteria are interviewed. Sample weights for facilities and individuals are applied. For facilities, the weights are adjusted for survey response levels. The survey estimate of the total number of government facilities compares well against the total published counts. Similarly the female client population estimated in the survey compares well with the total enumerated in the 1991 census.

    Release date: 1998-03-12

Reference (25)

Reference (25) (25 of 25 results)

  • Technical products: 11-522-X201700014711
    Description:

    After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014717
    Description:

    Files with linked data from the Statistics Canada, Postsecondary Student Information System (PSIS) and tax data can be used to examine the trajectories of students who pursue postsecondary education (PSE) programs and their post-schooling labour market outcomes. On one hand, administrative data on students linked longitudinally can provide aggregate information on student pathways during postsecondary studies such as persistence rates, graduation rates, mobility, etc. On the other hand, the tax data could supplement the PSIS data to provide information on employment outcomes such as average and median earnings or earnings progress by employment sector (industry), field of study, education level and/or other demographic information, year over year after graduation. Two longitudinal pilot studies have been done using administrative data on postsecondary students of Maritimes institutions which have been longitudinally linked and linked to Statistics Canada Ttx data (the T1 Family File) for relevant years. This article first focuses on the quality of information in the administrative data and the methodology used to conduct these longitudinal studies and derive indicators. Second, it will focus on some limitations when using administrative data, rather than a survey, to define some concepts.

    Release date: 2016-03-24

  • Technical products: 11-522-X201300014255
    Description:

    The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014284
    Description:

    The decline in response rates observed by several national statistical institutes, their desire to limit response burden and the significant budget pressures they face support greater use of administrative data to produce statistical information. The administrative data sources they must consider have to be evaluated according to several aspects to determine their fitness for use. Statistics Canada recently developed a process to evaluate administrative data sources for use as inputs to the statistical information production process. This evaluation is conducted in two phases. The initial phase requires access only to the metadata associated with the administrative data considered, whereas the second phase uses a version of data that can be evaluated. This article outlines the evaluation process and tool.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014264
    Description:

    While wetlands represent only 6.4% of the world’s surface area, they are essential to the survival of terrestrial species. These ecosystems require special attention in Canada, since that is where nearly 25% of the world’s wetlands are found. Environment Canada (EC) has massive databases that contain all kinds of wetland information from various sources. Before the information in these databases could be used for any environmental initiative, it had to be classified and its quality had to be assessed. In this paper, we will give an overview of the joint pilot project carried out by EC and Statistics Canada to assess the quality of the information contained in these databases, which has characteristics specific to big data, administrative data and survey data.

    Release date: 2014-10-31

  • Technical products: 11-522-X200800010970
    Description:

    RTI International is currently conducting a longitudinal education study. One component of the study involved collecting transcripts and course catalogs from high schools that the sample members attended. Information from the transcripts and course catalogs also needed to be keyed and coded. This presented a challenge because the transcripts and course catalogs were collected from different types of schools, including public, private, and religious schools, from across the nation and they varied widely in both content and format. The challenge called for a sophisticated system that could be used by multiple users simultaneously. RTI developed such a system possessing all the characteristics of a high-end, high-tech, multi-user, multitask, user-friendly and low maintenance cost high school transcript and course catalog keying and coding system. The system is web based and has three major functions: transcript and catalog keying and coding, transcript and catalog keying quality control (keyer-coder end), and transcript and catalog coding QC (management end). Given the complex nature of transcript and catalog keying and coding, the system was designed to be flexible and to have the ability to transport keyed and coded data throughout the system to reduce the keying time, the ability to logically guide users through all the pages that a type of activity required, the ability to display appropriate information to help keying performance, and the ability to track all the keying, coding, and QC activities. Hundreds of catalogs and thousands of transcripts were successfully keyed, coded, and verified using the system. This paper will report on the system needs and design, implementation tips, problems faced and their solutions, and lessons learned.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011004
    Description:

    The issue of reducing the response burden is not new. Statistics Sweden works in different ways to reduce response burden and to decrease the administrative costs of data collection from enterprises and organizations. According to legislation Statistics Sweden must reduce response burden for the business community. Therefore, this work is a priority. There is a fixed level decided by the Government to decrease the administrative costs of enterprises by twenty-five percent until year 2010. This goal is valid also for data collection for statistical purposes. The goal concerns surveys with response compulsory legislation. In addition to these surveys there are many more surveys and a need to measure and reduce the burden from these surveys as well. In order to help measure, analyze and reduce the burden, Statistics Sweden has developed the Register of Data providers concerning enterprises and organization (ULR). The purpose of the register is twofold, to measure and analyze the burden on an aggregated level and to be able to give information to each individual enterprise which surveys they are participating in.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010968
    Description:

    Statistics Canada has embarked on a program of increasing and improving the usage of imaging technology for paper survey questionnaires. The goal is to make the process an efficient, reliable and cost effective method of capturing survey data. The objective is to continue using Optical Character Recognition (OCR) to capture the data from questionnaires, documents and faxes received whilst improving the process integration and Quality Assurance/Quality Control (QC) of the data capture process. These improvements are discussed in this paper.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010988
    Description:

    Online data collection emerged in 1995 as an alternative approach for conducting certain types of consumer research studies and has grown in 2008. This growth has been primarily in studies where non-probability sampling methods are used. While online sampling has gained acceptance for some research applications, serious questions remain concerning online samples' suitability for research requiring precise volumetric measurement of the behavior of the U.S. population, particularly their travel behavior. This paper reviews literature and compares results from studies using probability samples and online samples to understand whether results differ from the two sampling approaches. The paper also demonstrates that online samples underestimate critical types of travel even after demographic and geographic weighting.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010993
    Description:

    Until now, years of experience in questionnaire design were required to estimate how long it would take a respondent, on the average, to complete a CATI questionnaire for a new survey. This presentation focuses on a new method which produces interview time estimates for questionnaires at the development stage. The method uses Blaise Audit Trail data and previous surveys. It was developed, tested and verified for accuracy on some large scale surveys.

    First, audit trail data was used to determine the average time previous respondents have taken to answer specific types of questions. These would include questions that require a yes/no answer, scaled questions, "mark all that apply" questions, etc. Second, for any given questionnaire, the paths taken by population sub-groups were mapped to identify the series of questions answered by different types of respondents, and timed to determine what the longest possible interview time would be. Finally, the overall expected time it takes to complete the questionnaire is calculated using estimated proportions of the population expected to answer each question.

    So far, we used paradata to accurately estimate average respondent interview completion times. We note that the method that we developed could also be used to estimate specific respondent interview completion times.

    Release date: 2009-12-03

  • Technical products: 11-522-X200600110404
    Description:

    Pursuing reduction in cost and response burden in survey programs has led to increased use of information available in administrative databases. Linkages between these two data sources is a way to exploit their complementary nature and maximize their respective usefulness. This paper discusses the various ways we have performed record linkage between the Canadian Community Health Survey (CCHS) and the Health Person-Oriented Information (HPOI) databases. The files resulting from selected linkage methods are used in an analysis of risk factors for having been hospitalized for heart disease. The sensitivity of the analysis with respect to the various linkage approaches is investigated.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110446
    Description:

    Immigrants have health advantages over native-born Canadians, but those advantages are threatened by specific risk situations. This study explores cardiovascular health outcomes in districts of Montréal classified by the proportion of immigrants in the population, using a principal component analysis. The first three components are immigration, degree of socio-economic disadvantage and degree of economic disadvantage. The incidence of myocardial infarction is lower in districts with large immigrant populations than in districts dominated by native-born Canadians. Mortality rates are associated with the degree of socio-economic disadvantage, while revascularization is associated with the proportion of seniors in the population.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110392
    Description:

    We use a robust Bayesian method to analyze data with possibly nonignorable nonresponse and selection bias. A robust logistic regression model is used to relate the response indicators (Bernoulli random variable) to the covariates, which are available for everyone in the finite population. This relationship can adequately explain the difference between respondents and nonrespondents for the sample. This robust model is obtained by expanding the standard logistic regression model to a mixture of Student's distributions, thereby providing propensity scores (selection probability) which are used to construct adjustment cells. The nonrespondents' values are filled in by drawing a random sample from a kernel density estimator, formed from the respondents' values within the adjustment cells. Prediction uses a linear spline rank-based regression of the response variable on the covariates by areas, sampling the errors from another kernel density estimator; thereby further robustifying our method. We use Markov chain Monte Carlo (MCMC) methods to fit our model. The posterior distribution of a quantile of the response variable is obtained within each sub-area using the order statistic over all the individuals (sampled and nonsampled). We compare our robust method with recent parametric methods

    Release date: 2008-03-17

  • Technical products: 11-522-X20050019477
    Description:

    Using probabilistic data linkage, an integrated database of injuries is obtained by linking on some subset of various key variables or their derivatives: names (given names, surnames and alternative names), age, sex, birthdate, phone numbers, injury date, unique identification numbers, diagnosis. To assess the quality of the links produced, false positive rates and false negative rates are computed. These rates however do not give an indication of whether the databases used for linking have undercounted injuries (bias). It is of interest to an injury researcher moreover, to have some idea of the error margin for the figures generated from integrating various injury databases, similar to what one would get in a survey for instance.

    Release date: 2007-03-02

  • Technical products: 11-522-X20040018752
    Description:

    This paper outlines some possible applications of the permanent sample of households ready to respond with respect to surveying difficult-to-reach population groups.

    Release date: 2005-10-27

  • Technical products: 11-522-X20040018753
    Description:

    For the estimation of low-income households, a supplementary sample is selected within a limited number of geographic areas. This paper presents the dual sample design used, along with scenarios considered and some findings that led to the choices made.

    Release date: 2005-10-27

  • Technical products: 11-522-X20030017729
    Description:

    This paper describes the design of the samples and analyses factors that affect the scope of the direct data collection for the first Integrated Census (IC) experiment.

    Release date: 2005-01-26

  • Technical products: 11-522-X20030017716
    Description:

    This paper examines how risk and quality can be used to assist with investment decisions across the Office for National Statistics (ONS) in the United Kingdom. It discusses the construction of a table developed to provide measures of the strengths and weaknesses of statistical inputs and outputs.

    Release date: 2005-01-26

  • Technical products: 11-522-X20030017721
    Description:

    This paper discusses a regression model to estimate the variance components for a two-stage sample design.

    Release date: 2005-01-26

  • Technical products: 11-522-X20020016732
    Description:

    Analysis of dose-response relationships has long been important in toxicology. More recently, this type of analysis has been employed to evaluate public education campaigns. The data that are collected in such evaluations are likely to come from standard household survey designs with all the usual complexities of multiple stages, stratification and variable selection probabilities. On a recent evaluation, a system was developed with the following features: categorization of doses into three or four levels, propensity scoring of dose selection and a new jack-knifed Jonckheere-Terpstra test for a monotone dose-response relationship. This system allows rapid production of tests for monotone dose-response relationships that are corrected both for sample design and for confounding. The focus of this paper will be the results of a Monte-Carlo simulation of the properties of the jack-knifed Jonckheere-Terpstra.

    Moreover, there is no experimental control over dosages and the possibility of confounding variables must be considered. Standard regressions in WESVAR and SUDAAN could be used to determine if there is a linear dose-response relationship while controlling on confounders, but such an approach obviously has low power to detect nonlinear but monotone dose-response relationships and is time-consuming to implement if there are a large number of possible outcomes of interest.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016744
    Description:

    A developmental trajectory describes the course of a behaviour over age or time. This technical paper provides an overview of a semi-parametric, group-based method for analysing developmental trajectories. This methodology provides an alternative to assuming a homogenous population of trajectories as is done in standard growth modelling.

    Four capabilities are described: (1) the capability to identify, rather than assume, distinctive groups of trajectories; (2) the capability to estimate the proportion of the population following each such trajectory group; (3) the capability to relate group membership probability to individual characteristics and circumstances; and (4) the capability to use the group membership probabilities for various other purposes, such as creating profiles of group members.

    In addition, two important extensions of the method are described: the capability to add time-varying covariates to trajectory models and the capability to estimate joint trajectory models of distinct but related behaviours. The former provides the statistical capacity for testing if a contemporary factor, such as an experimental intervention or a non-experimental event like pregnancy, deflects a pre-existing trajectory. The latter provides the capability to study the unfolding of distinct but related behaviours such as problematic childhood behaviour and adolescent drug abuse.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016739
    Description:

    The Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal data (altogether consisting of millions of person-months of individual- and family-level data) is useful for analyses of monthly labour market dynamics over relatively long periods of time, 20 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure for the employed, and data on the date last worked for the not employed - together with the date of survey responses - permit the estimated models to include terms reflecting seasonality and macro-economic cycles, as well as the duration dependence of each type of transition. In addition, the LFS data permit spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been included in the LifePaths socio-economic microsimulation model. In this setting, the equations may be used to simulate lifetime employment activity from past, present and future birth cohorts. Cross-sectional simulation results have been used to validate these models by comparisons with census data from the period 1971 to 1996.

    Release date: 2004-09-13

  • Technical products: 11-522-X20010016289
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    Increasing demand for electronic reporting in establishment surveys has placed additional emphasis on incorporating usability into electronic forms. We are just beginning to understand the implications surrounding electronic forms design. Cognitive interviewing and usability testing are analogous in that both types of testing have similar goals: to build an end instrument (paper or electronic) that reduces both respondent burden and measurement error. Cognitive testing has greatly influenced paper forms design and can also be applied towards the development of electronic forms. Usability testing expands on existing cognitive testing methodology to include examination of the interaction between the respondent and the electronic form.

    The upcoming U.S. 2002 Economic Census will offer businesses the ability to report information using electronic forms. The U.S. Census Bureau is creating an electronic forms style guide outlining the design standards to be used in electronic form creation. The style guide's design standards are based on usability principles, usability and cognitive test results, and Graphical User Interface standards. This paper highlights the major electronic forms design issues raised during the preparation of the style guide and describes how usability testing and cognitive interviewing resolved these issues.

    Release date: 2002-09-12

  • Technical products: 11-522-X20010016235
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    Police records collected by the Federal Bureau of Investigation (FBI) through the Uniform Crime Reporting (UCR) Program are the leading source of national crime statistics. Recently, audits to correct UCR records have raised concerns as to how to handle the errors discovered in these files. Concerns centre around the methodology used to detect errors and the procedures used to correct errors once they have been discovered. This paper explores these concerns, focusing on sampling methodology, establishment of a statistical-adjustment factor, and alternative solutions. The paper distinguishes the difference between sample adjustment and sample estimates of an agency's data, and recommends sample adjustment as the most accurate way of dealing with errors.

    Release date: 2002-09-12

  • Technical products: 11-522-X19990015660
    Description:

    There are many different situations in which one or more files need to be linked. With one file the purpose of the linkage would be to locate duplicates within the file. When there are two files, the linkage is done to identify the units that are the same on both files and thus create matched pairs. Often records that need to be linked do not have a unique identifier. Hierarchical record linkage, probabilistic record linkage and statistical matching are three methods that can be used when there is no unique identifier on the files that need to be linked. We describe the major differences between the methods. We consider how to choose variables to link, how to prepare files for linkage and how the links are identified. As well, we review tips and tricks used when linking files. Two examples, the probabilistic record linkage used in the reverse record check and the hierarchical record linkage of the Business Number (BN) master file to the Statistical Universe File (SUF) of unincorporated tax filers (T1) will be illustrated.

    Release date: 2000-03-02

Date modified: