Statistics by subject – Statistical methods

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
Loading
Loading in progress, please wait...
All (90)

All (90) (25 of 90 results)

  • Articles and reports: 11F0019M2004219
    Description:

    This study investigates trends in family income inequality in the 1980s and 1990s, with particular attention paid to the recovery period of the 1990s.

    Release date: 2004-12-16

  • Index and guides: 92-395-X
    Description:

    This report describes sampling and weighting procedures used in the 2001 Census. It reviews the history of these procedures in Canadian censuses, provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

    Release date: 2004-12-15

  • Index and guides: 92-394-X
    Description:

    This report deals with coverage errors that occur when persons, households, dwellings or families are missed or enumerated in error by the census. After the 2001 Census was taken, a number of studies were carried out to estimate gross undercoverage, gross overcoverage and net undercoverage. This report presents the results of the Dwelling Classification Study, the Reverse Record Check Study, the Automated Match Study and the Collective Dwelling Study. The report first describes census universes, coverage error and census collection and processing procedures that may result in coverage error. Then it gives estimates of net undercoverage for a number of demographic characteristics. After, the technical report presents the methodology and results of each coverage study and the estimates of coverage error after describing how the results of the various studies are combined. A historical perspective completes the product.

    Release date: 2004-11-25

  • Articles and reports: 13-604-M2004045
    Description:

    How "good" are the National Tourism Indicators (NTI)? How can their quality be measured? This study looks to answer these questions by analysing the revisions to the NTI estimates for the period 1997 through 2001.

    Release date: 2004-10-25

  • Table: 53-500-X
    Description:

    This report presents the results of a pilot survey conducted by Statistics Canada to measure the fuel consumption of on-road motor vehicles registered in Canada. This study was carried out in connection with the Canadian Vehicle Survey (CVS) which collects information on road activity such as distance traveled, number of passengers and trip purpose.

    Release date: 2004-10-21

  • Surveys and statistical programs – Documentation: 31-533-X
    Description:

    Starting with the August 2004 reference month, the Monthly Survey of Manufacturing (MSM) is using administrative data (Goods and Services Tax files) to derive shipments for a portion of the small establishments in the sample. This document is being published to complement the release of MSM data for that month.

    Release date: 2004-10-15

  • Technical products: 12-002-X20040027035
    Description:

    As part of the processing of the National Longitudinal Survey of Children and Youth (NLSCY) cycle 4 data, historical revisions have been made to the data of the first 3 cycles, either to correct errors or to update the data. During processing, particular attention was given to the PERSRUK (Person Identifier) and the FIELDRUK (Household Identifier). The same level of attention has not been given to the other identifiers that are included in the data base, the CHILDID (Person identifier) and the _IDHD01 (Household identifier). These identifiers have been created for the public files and can also be found in the master files by default. The PERSRUK should be used to link records between files and the FIELDRUK to determine the household when using the master files.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027034
    Description:

    The use of command files in Stat/Transfer can expedite the transfer of several data sets in an efficient replicable manner. This note outlines a simple step-by-step method for creating command files and provides sample code.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Technical products: 21-601-M2004072
    Description:

    The Farm Product Price Index (FPPI) is a monthly series that measures the changes in prices that farmers receive for the agriculture commodities they produce and sell.

    The FPPI was discontinued in March 1995; it was revived in April 2001 owing to continued demand for an index of prices received by farmers.

    Release date: 2004-09-28

  • Surveys and statistical programs – Documentation: 62F0026M2004001
    Description:

    This report describes the quality indicators produced for the 2002 Survey of Household Spending. These quality indicators, such as coefficients of variation, nonresponse rates, slippage rates and imputation rates, help users interpret the survey data.

    Release date: 2004-09-15

  • Technical products: 11-522-X2002001
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987.

    Symposium 2002 was the nineteenth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2002 the theme was: "Modelling Survey Data for Social and Economic Research".

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016744
    Description:

    A developmental trajectory describes the course of a behaviour over age or time. This technical paper provides an overview of a semi-parametric, group-based method for analysing developmental trajectories. This methodology provides an alternative to assuming a homogenous population of trajectories as is done in standard growth modelling.

    Four capabilities are described: (1) the capability to identify, rather than assume, distinctive groups of trajectories; (2) the capability to estimate the proportion of the population following each such trajectory group; (3) the capability to relate group membership probability to individual characteristics and circumstances; and (4) the capability to use the group membership probabilities for various other purposes, such as creating profiles of group members.

    In addition, two important extensions of the method are described: the capability to add time-varying covariates to trajectory models and the capability to estimate joint trajectory models of distinct but related behaviours. The former provides the statistical capacity for testing if a contemporary factor, such as an experimental intervention or a non-experimental event like pregnancy, deflects a pre-existing trajectory. The latter provides the capability to study the unfolding of distinct but related behaviours such as problematic childhood behaviour and adolescent drug abuse.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016739
    Description:

    The Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal data (altogether consisting of millions of person-months of individual- and family-level data) is useful for analyses of monthly labour market dynamics over relatively long periods of time, 20 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure for the employed, and data on the date last worked for the not employed - together with the date of survey responses - permit the estimated models to include terms reflecting seasonality and macro-economic cycles, as well as the duration dependence of each type of transition. In addition, the LFS data permit spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been included in the LifePaths socio-economic microsimulation model. In this setting, the equations may be used to simulate lifetime employment activity from past, present and future birth cohorts. Cross-sectional simulation results have been used to validate these models by comparisons with census data from the period 1971 to 1996.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016735
    Description:

    In the 2001 Canadian Census of Population, calibration or regression estimation was used to calculate a single set of household level weights to be used for all census estimates based on a one in five national sample of more than two million households. Because many auxiliary variables were available, only a subset of them could be used. Otherwise, some of the weights would have been smaller than the number one or even negative. In this technical paper, a forward selection procedure was used to discard auxiliary variables that caused weights to be smaller than one or that caused a large condition number for the calibration weight matrix being inverted. Also, two calibration adjustments were done to achieve close agreement between auxiliary population counts and estimates for small areas. Prior to 2001, the projection generalized regression (GREG) estimator was used and the weights were required to be greater than zero. For the 2001 Census, a switch was made to a pseudo-optimal regression estimator that kept more auxiliary variables and, at the same time, required that the weights be one or more.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016725
    Description:

    In 1997, the US Office of Management and Budget issued revised standards for the collection of race information within the federal statistical system. One revision allows individuals to choose more than one race group when responding to federal surveys and other federal data collections. This change presents challenges for analyses that involve data collected under both the old and new race-reporting systems, since the data on race are not comparable. The following paper discusses the problems encountered by these changes and methods developed to overcome them.

    Since most people under both systems report only a single race, a common proposed solution is to try to bridge the transition by assigning a single-race category to each multiple-race reporter under the new system, and to conduct analyses using just the observed and assigned single-race categories. Thus, the problem can be viewed as a missing-data problem, in which single-race responses are missing for multiple-race reporters and needing to be imputed.

    The US Office of Management and Budget suggested several simple bridging methods to handle this missing-data problem. Schenker and Parker (Statistics in Medicine, forthcoming) analysed data from the National Health Interview Survey of the US National Center for Health Statistics, which allows multiple-race reporting but also asks multiple-race reporters to specify a primary race, and found that improved bridging methods could result from incorporating individual-level and contextual covariates into the bridging models.

    While Schenker and Parker discussed only three large multiple-race groups, the current application requires predicting single-race categories for several small multiple-race groups as well. Thus, problems of sparse data arise in fitting the bridging models. We address these problems by building combined models for several multiple-race groups, thus borrowing strength across them. These and other methodological issues are discussed.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016747
    Description:

    This project seeks to shed light not only on the degree to which individuals are stuck in the low-income range, but also on those who have sufficient opportunity to move into the upper part of the income distribution. It also seeks to compare patterns of mobility through the income distribution in North America and Europe, shedding light on the impact of different models of integration. Cross-National Equivalent File data from the British Household Panel Survey (BHPS) for the United Kingdom, the German Socio-Economic Panel (GSOEP) for Germany, the Panel Study of Income Dynamics (PSID) for the United States and the Survey of Labour Income Dynamics (SLID) for Canada offer a comparative analysis of the dynamics of household income during the 1990s, paying particular attention to both low- and high-income dynamics. Canadian administrative data drawn from income tax files are also used. These panel datasets range in length from six years (for the SLID) to almost 20 years (for the PSID and the Canadian administrative data). The analysis focuses on developments during the 1990s, but also explores the sensitivity of the results to changes in the length of the period analysed.

    The analysis begins by offering a broad descriptive overview of the major characteristics and events (demographic versus labour market) that determine levels and changes in adjusted household incomes. Attention is paid to movements into and out of low- and high- income ranges. A number of definitions are used, incorporating absolute and relative notions of poverty. The sensitivity of the results to the use of various equivalence scales is examined. An overview offers a broad picture of the state of household income in each country and the relative roles of family structure, the labour market and welfare state in determining income mobility. The paper employs discrete time-hazard methods to model the dynamics of entry to and exit from both low and high income.

    Both observed and unobserved heterogeneity are controlled for with the intention of highlighting differences in the determinants of the transition rates between the countries. This is done in a way that assesses the importance of the relative roles of family, market and state. Attention is also paid to important institutional changes, most notably the increasing integration of product and labour markets in North America and Europe.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016721
    Description:

    This paper examines the simulation study that was conducted to assess the sampling scheme designed for the World Health Organization (WHO) Injection Safety Assessment Survey. The objective of this assessment survey is to determine whether facilities in which injections are given meet the necessary safety requirements for injection administration, equipment, supplies and waste disposal. The main parameter of interest is the proportion of health care facilities in a country that have safe injection practices.

    The objective of this simulation study was to assess the accuracy and precision of the proposed sampling design. To this end, two artificial populations were created based on the two African countries of Niger and Burkina Faso, in which the pilot survey was tested. To create a wide variety of hypothetical populations, the assignment of whether a health care facility was safe or not was based on the different combinations of the population proportion of safe health care facilities in the country, the homogeneity of the districts in the country with respect to injection safety, and whether the health care facility was located in an urban or rural district.

    Using the results of the simulation, a multi-factor analysis of variance was used to determine which factors affect the outcome measures of absolute bias, standard error and mean-squared error.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016708
    Description:

    In this paper, we discuss the analysis of complex health survey data by using multivariate modelling techniques. Main interests are in various design-based and model-based methods that aim at accounting for the design complexities, including clustering, stratification and weighting. Methods covered include generalized linear modelling based on pseudo-likelihood and generalized estimating equations, linear mixed models estimated by restricted maximum likelihood, and hierarchical Bayes techniques using Markov Chain Monte Carlo (MCMC) methods. The methods will be compared empirically, using data from an extensive health interview and examination survey conducted in Finland in 2000 (Health 2000 Study).

    The data of the Health 2000 Study were collected using personal interviews, questionnaires and clinical examinations. A stratified two-stage cluster sampling design was used in the survey. The sampling design involved positive intra-cluster correlation for many study variables. For a closer investigation, we selected a small number of study variables from the health interview and health examination phases. In many cases, the different methods produced similar numerical results and supported similar statistical conclusions. Methods that failed to account for the design complexities sometimes led to conflicting conclusions. We also discuss the application of the methods in this paper by using standard statistical software products.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016430
    Description:

    Linearization (or Taylor series) methods are widely used to estimate standard errors for the co-efficients of linear regression models fit to multi-stage samples. When the number of primary sampling units (PSUs) is large, linearization can produce accurate standard errors under quite general conditions. However, when the number of PSUs is small or a co-efficient depends primarily on data from a small number of PSUs, linearization estimators can have large negative bias.

    In this paper, we characterize features of the design matrix that produce large bias in linearization standard errors for linear regression co-efficients. We then propose a new method, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors. When the errors are independent and identically distributed (i.i.d.), the BRL estimator is unbiased for the variance. Furthermore, a simulation study shows that BRL can greatly reduce the bias, even if the errors are not i.i.d. We also propose using a Satterthwaite approximation to determine the degrees of freedom of the reference distribution for tests and confidence intervals about linear combinations of co-efficients based on the BRL estimator. We demonstrate that the jackknife estimator also tends to be biased in situations where linearization is biased. However, the jackknife's bias tends to be positive. Our bias-reduced linearization estimator can be viewed as a compromise between the traditional linearization and jackknife estimators.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016717
    Description:

    In the United States, the National Health and Nutrition Examination Survey (NHANES) is linked to the National Health Interview Survey (NHIS) at the primary sampling unit level (the same counties, but not necessarily the same persons, are in both surveys). The NHANES examines about 5,000 persons per year, while the NHIS samples about 100,000 persons per year. In this paper, we present and develop properties of models that allow NHIS and administrative data to be used as auxiliary information for estimating quantities of interest in the NHANES. The methodology, related to Fay-Herriot (1979) small-area models and to calibration estimators in Deville and Sarndal (1992), accounts for the survey designs in the error structure.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016733
    Description:

    While censuses and surveys are often said to measure populations as they are, most reflect information about individuals as they were at the time of measurement, or even at some prior time point. Inferences from such data therefore should take into account change over time at both the population and individual levels. In this paper, we provide a unifying framework for such inference problems, illustrating it through a diverse series of examples including: (1) estimating residency status on Census Day using multiple administrative records, (2) combining administrative records for estimating the size of the US population, (3) using rolling averages from the American Community Survey, and (4) estimating the prevalence of human rights abuses.

    Specifically, at the population level, the estimands of interest, such as the size or mean characteristics of a population, might be changing. At the same time, individual subjects might be moving in and out of the frame of the study or changing their characteristics. Such changes over time can affect statistical studies of government data that combine information from multiple data sources, including censuses, surveys and administrative records, an increasingly common practice. Inferences from the resulting merged databases often depend heavily on specific choices made in combining, editing and analysing the data that reflect assumptions about how populations of interest change or remain stable over time.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016719
    Description:

    This study takes a look at the modelling methods used for public health data. Public health has a renewed interest in the impact of the environment on health. Ecological or contextual studies ideally investigate these relationships using public health data augmented with environmental characteristics in multilevel or hierarchical models. In these models, individual respondents in health data are the first level and community data are the second level. Most public health data use complex sample survey designs, which require analyses accounting for the clustering, nonresponse, and poststratification to obtain representative estimates of prevalence of health risk behaviours.

    This study uses the Behavioral Risk Factor Surveillance System (BRFSS), a state-specific US health risk factor surveillance system conducted by the Center for Disease Control and Prevention, which assesses health risk factors in over 200,000 adults annually. BRFSS data are now available at the metropolitan statistical area (MSA) level and provide quality health information for studies of environmental effects. MSA-level analyses combining health and environmental data are further complicated by joint requirements of the survey sample design and the multilevel analyses.

    We compare three modelling methods in a study of physical activity and selected environmental factors using BRFSS 2000 data. Each of the methods described here is a valid way to analyse complex sample survey data augmented with environmental information, although each accounts for the survey design and multilevel data structure in a different manner and is thus appropriate for slightly different research questions.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016726
    Description:

    Although the use of school vouchers is growing in the developing world, the impact of vouchers is an open question. Any sort of long-term assessment of this activity is rare. This paper estimates the long-term effect of Colombia's PACES program, which provided over 125,000 poor children with vouchers that covered half the cost of private secondary school.

    The PACES program presents an unusual opportunity to assess the effect of demand-side education financing in a Latin American country where private schools educate a substantial proportion of pupils. The program is of special interest because many vouchers were assigned by lottery, so program effects can be reliably assessed.

    We use administrative records to assess the long-term impact of PACES vouchers on high school graduation status and test scores. The principal advantage of administrative records is that there is no loss-to-follow-up and the data are much cheaper than a costly and potentially dangerous survey effort. On the other hand, individual ID numbers may be inaccurate, complicating record linkage, and selection bias contaminates the sample of test-takers. We discuss solutions to these problems. The results suggest that the program increased secondary school completion rates, and that college-entrance test scores were higher for lottery winners than losers.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016750
    Description:

    Analyses of data from social and economic surveys sometimes use generalized variance function models to approximate the design variance of point estimators of population means and proportions. Analysts may use the resulting standard error estimates to compute associated confidence intervals or test statistics for the means and proportions of interest. In comparison with design-based variance estimators computed directly from survey microdata, generalized variance function models have several potential advantages, as will be discussed in this paper, including operational simplicity; increased stability of standard errors; and, for cases involving public-use datasets, reduction of disclosure limitation problems arising from the public release of stratum and cluster indicators.

    These potential advantages, however, may be offset in part by several inferential issues. First, the properties of inferential statistics based on generalized variance functions (e.g., confidence interval coverage rates and widths) depend heavily on the relative empirical magnitudes of the components of variability associated, respectively, with:

    (a) the random selection of a subset of items used in estimation of the generalized variance function model(b) the selection of sample units under a complex sample design (c) the lack of fit of the generalized variance function model (d) the generation of a finite population under a superpopulation model.

    Second, under conditions, one may link each of components (a) through (d) with different empirical measures of the predictive adequacy of a generalized variance function model. Consequently, these measures of predictive adequacy can offer us some insight into the extent to which a given generalized variance function model may be appropriate for inferential use in specific applications.

    Some of the proposed diagnostics are applied to data from the US Survey of Doctoral Recipients and the US Current Employment Survey. For the Survey of Doctoral Recipients, components (a), (c) and (d) are of principal concern. For the Current Employment Survey, components (b), (c) and (d) receive principal attention, and the availability of population microdata allow the development of especially detailed models for components (b) and (c).

    Release date: 2004-09-13

Data (2)

Data (2) (2 results)

  • Table: 53-500-X
    Description:

    This report presents the results of a pilot survey conducted by Statistics Canada to measure the fuel consumption of on-road motor vehicles registered in Canada. This study was carried out in connection with the Canadian Vehicle Survey (CVS) which collects information on road activity such as distance traveled, number of passengers and trip purpose.

    Release date: 2004-10-21

  • Table: 95F0495X2001012
    Description:

    This table contains information from the 2001 Census, presented according to the statistical area classification (SAC). The SAC groups census subdivisions according to whether they are a component of a census metropolitan area, a census agglomeration, a census metropolitan area and census agglomeration influenced zone (strong MIZ, moderate MIZ, weak MIZ or no MIZ) or of the territories (Northwest Territories, Nunavut and Yukon Territory). The SAC is used for data dissemination purposes.

    Data characteristics presented according to the SAC include age, visible minority groups, immigration, mother tongue, education, income, work and dwellings. Data are presented for Canada, provinces and territories. The data characteristics presented within this table may differ from those of other products in the "Profiles" series.

    Release date: 2004-02-27

Analysis (26)

Analysis (26) (25 of 26 results)

  • Articles and reports: 11F0019M2004219
    Description:

    This study investigates trends in family income inequality in the 1980s and 1990s, with particular attention paid to the recovery period of the 1990s.

    Release date: 2004-12-16

  • Articles and reports: 13-604-M2004045
    Description:

    How "good" are the National Tourism Indicators (NTI)? How can their quality be measured? This study looks to answer these questions by analysing the revisions to the NTI estimates for the period 1997 through 2001.

    Release date: 2004-10-25

  • Articles and reports: 12-001-X20040016990
    Description:

    Survey statisticians have long known that the question-answer process is a source of response effects that contribute to non-random measurement error. In the past two decades there has been substantial progress toward understanding these sources of error by applying concepts from social and cognitive psychology to the study of the question-answer process. This essay reviews the development of these approaches, discusses the present state of our knowledge, and suggests some research priorities for the future.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016991
    Description:

    In survey sampling, Taylor linearization is often used to obtain variance estimators for calibration estimators of totals and nonlinear finite population (or census) parameters, such as ratios, regression and correlation coefficients, which can be expressed as smooth functions of totals. Taylor linearization is generally applicable to any sampling design, but it can lead to multiple variance estimators that are asymptotically design unbiased under repeated sampling. The choice among the variance estimators requires other considerations such as (i) approximate unbiasedness for the model variance of the estimator under an assumed model, (ii) validity under a conditional repeated sampling framework. In this paper, a new approach to deriving Taylor linearization variance estimators is proposed. It leads directly to a variance estimator which satisfies the above considerations at least in a number of important cases. The method is applied to a variety of problems, covering estimators of a total as well as other estimators defined either explicitly or implicitly as solutions of estimating equations. In particular, estimators of logistic regression parameters with calibration weights are studied. It leads to a new variance estimator for a general class of calibration estimators that includes generalized raking ratio and generalized regression estimators. The proposed method is extended to two-phase sampling to obtain a variance estimator that makes fuller use of the first phase sample data compared to traditional linearization variance estimators.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040019186
    Description:

    In this Issue is a column where the Editor biefly presents each paper of the current issue of Survey Methodology. As well, it sometimes contain informations on structure or management changes in the journal.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016992
    Description:

    In the U.S. Census of Population and Housing, a sample of about one-in-six of the households receives a longer version of the census questionnaire called the long form. All others receive a version called the short form. Raking, using selected control totals from the short form, has been used to create two sets of weights for long form estimation; one for individuals and one for households. We describe a weight construction method based on quadratic programming that produces household weights such that the weighted sum for individual characteristics and for household characteristics agree closely with selected short form totals. The method is broadly applicable to situations where weights are to be constructed to meet both size bounds and sum-to-control restrictions. Application to the situation where the controls are estimates with an estimated covariance matrix is described.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016996
    Description:

    This article studies the use of the sample distribution for the prediction of finite population totals under single-stage sampling. The proposed predictors employ the sample values of the target study variable, the sampling weights of the sample units and possibly known population values of auxiliary variables. The prediction problem is solved by estimating the expectation of the study values for units outside the sample as a function of the corresponding expectation under the sample distribution and the sampling weights. The prediction mean square error is estimated by a combination of an inverse sampling procedure and a re-sampling method. An interesting outcome of the present analysis is that several familiar estimators in common use are shown to be special cases of the proposed approach, thus providing them a new interpretation. The performance of the new and some old predictors in common use is evaluated and compared by a Monte Carlo simulation study using a real data set.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016995
    Description:

    One of the main objectives of a sample survey is the computation of estimates of means and totals for specific domains of interest. Domains are determined either before the survey is carried out (primary domains) or after it has been carried out (secondary domains). The reliability of the associated estimates depends on the variability of the sample size as well as on the y-variables of interest. This variability cannot be controlled in the absence of auxiliary information for subgroups of the population. However, if auxiliary information is available, the estimated reliability of the resulting estimates can be controlled to some extent. In this paper, we study the potential improvements in terms of the reliability of domain estimates that use auxiliary information. The properties (bias, coverage, efficiency) of various estimators that use auxiliary information are compared using a conditional approach.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016993
    Description:

    The weighting cell estimator corrects for unit nonresponse by dividing the sample into homogeneous groups (cells) and applying a ratio correction to the respondents within each cell. Previous studies of the statistical properties of weighting cell estimators have assumed that these cells correspond to known population cells with homogeneous characteristics. In this article, we study the properties of the weighting cell estimator under a response probability model that does not require correct specification of homogeneous population cells. Instead, we assume that the response probabilities are a smooth but otherwise unspecified function of a known auxiliary variable. Under this more general model, we study the robustness of the weighting cell estimator against model misspecification. We show that, even when the population cells are unknown, the estimator is consistent with respect to the sampling design and the response model. We describe the effect of the number of weighting cells on the asymptotic properties of the estimator. Simulation experiments explore the finite sample properties of the estimator. We conclude with some guidance on how to select the size and number of cells for practical implementation of weighting cell estimation when those cells cannot be specified a priori.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016998
    Description:

    The Canadian Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal micro-data - altogether consisting of millions of person-months of individual and family level data - is useful for analyses of monthly labour market dynamics over relatively long periods of time, 25 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure, for employed respondents, and on the date last worked, for those not employed - together with the date of survey responses - allow the construction of models that include terms reflecting seasonality and macro-economic cycles as well as the duration dependence of each type of transition. In addition, the LFS data permits spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been incorporated in the LifePaths microsimulation model. In that setting, the equations have been used to simulate lifetime employment activity from past, present and future birth cohorts. Simulation results have been validated by comparison with the age profiles of LFS employment/population ratios for the period 1976 to 2001.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016994
    Description:

    When imputation is used to assign values for missing items in sample surveys, naïve methods of estimating the variances of survey estimates that treat the imputed values as if they were observed give biased variance estimates. This article addresses the problem of variance estimation for a linear estimator in which missing values are assigned by a single hot deck imputation (a form of imputation that is widely used in practice). We propose estimators of the variance of a linear hot deck imputed estimator using a decomposition of the total variance suggested by Särndal (1992). A conditional approach to variance estimation is developed that is applicable to both weighted and unweighted hot deck imputation. Estimation of the variance of a domain estimator is also examined.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016997
    Description:

    Multilevel models are often fitted to survey data gathered with a complex multistage sampling design. However, if such a design is informative, in the sense that the inclusion probabilities depend on the response variable even after conditioning on the covariates, then standard maximum likelihood estimators are biased. In this paper, following the Pseudo Maximum Likelihood (PML) approach of Skinner (1989), we propose a probability weighted estimation procedure for multilevel ordinal and binary models which eliminates the bias generated by the informativeness of the design. The reciprocals of the inclusion probabilities at each sampling stage are used to weight the log likelihood function and the weighted estimators obtained in this way are tested by means of a simulation study for the simple case of a binary random intercept model with and without covariates. The variance estimators are obtained by a bootstrap procedure. The maximization of the weighted log likelihood of the model is done by the NLMIXED procedure of the SAS, which is based on adaptive Gaussian quadrature. Also the bootstrap estimation of variances is implemented in the SAS environment.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016999
    Description:

    Combining response data from the Belgian Fertility and Family Survey with individual level and municipality level data from the 1991 Census for both nonrespondents and respondents, multilevel logistic regression models for contact and cooperation propensity are estimated. The covariates introduced are a selection of indirect features, all out of the researchers' direct control. Contrary to previous research, Socio Economic Status is found to be positively related to cooperation. Another unexpected result is the absence of any considerable impact of ecological correlates such as urbanity.

    Release date: 2004-07-14

  • Articles and reports: 89-552-M2004011
    Description:

    This paper develops a measure of investment in education from the literacy level of labour market entrants, using the 1994 International Adult Literacy Survey.

    Release date: 2004-06-22

  • Articles and reports: 91F0015M2004006
    Description:

    The paper assesses and compares new and old methodologies for official estimates of migration within and among provinces and territories for the period 1996/97 to 2000/01.

    Release date: 2004-06-17

  • Articles and reports: 82-003-X20030036847
    Description:

    This paper examines whether accepting proxy- instead of self-responses results in lower estimates of some health conditions. It analyses data from the National Population Health Survey and the Canadian Community Health Survey.

    Release date: 2004-05-18

  • Articles and reports: 12-001-X20030026784
    Description:

    Skinner and Elliot (2002) proposed a simple measure of disclosure risk for survey microdata and showed how to estimate this measure under sampling with equal probabilities. In this paper we show how their results on point estimation and variance estimation may be extended to handle unequal probability sampling. Our approach assumes a Poisson sampling design. Comments are made about the possible impact of departures from this assumption.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026782
    Description:

    This paper discusses both the general question of designing a post-enumeration survey, and how these general questions were addressed in the U.S. Census Bureau's coverage measurement planned as part of Census 2000. It relates the basic concepts of the Dual System Estimator to questions of the definition and measurement of correct enumerations, the measurement of census omissions, operational independence, reporting of residence, and the role of after-matching reinterview. It discusses estimation issues such as the treatment of movers, missing data, and synthetic estimation of local corrected population size. It also discusses where the design failed in Census 2000.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026779
    Description:

    In link-tracing designs, social links are followed from one respondent to another to obtain the sample. For hidden and hard-to-access human populations, such sampling designs are often the only practical way to obtain a sample large enough for an effective study. In this paper, we propose a Bayesian approach for the estimation problem. For studies using link-tracing designs, prior information may be available on the characteristics that one wants to estimate. Using this information effectively via a Bayesian approach should yield better estimators. When the available information is vague, one can use noninformative priors and conduct a sensitivity analysis. In our example we found that the estimators were not sensitive to the specified priors. It is important to note that, under the Bayesian setup, obtaining interval estimates to assess the accuracy of the estimators can be done without much added difficulty. By contrast, such tasks are difficult to perform using the classical approach. In general, a Bayesian analysis yields one distribution (the posterior distribution) for the unknown parameters, and from this a vast number of questions can be answered simultaneously.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030029054
    Description:

    In this Issue is a column where the Editor biefly presents each paper of the current issue of Survey Methodology. As well, it sometimes contain informations on structure or management changes in the journal.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026785
    Description:

    To avoid disclosures, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. Although partially synthetic approaches are currently used to protect public use data, valid methods of inference have not been developed for them. This article presents such methods. They are based on the concepts of multiple imputation for missing data but use different rules for combining point and variance estimates. The combining rules also differ from those for fully synthetic data sets developed by Raghunathan, Reiter and Rubin (2003). The validity of these new rules is illustrated in simulation studies.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026778
    Description:

    Using both purely design based and model assisted arguments, it is shown that, under conditions of high entropy, the variance of the Horvitz Thompson (HT) estimator depends almost entirely on first order inclusion probabilities. Approximate expressions and estimators are derived for this "high entropy" variance of the HT estimator. Monte Carlo simulation studies are conducted to examine the statistical properties of the proposed variance estimators.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026781
    Description:

    Census counts are known to be inexact based on comparisons of Census and Post Enumeration Survey (PES) figures. In Italy, the role of municipal administrations is crucial for both Census and PES field operations. In this paper we analyze the impact of municipality on Italian Census undercount rates by modeling data from the PES as well as from other sources using Poisson regression trees and hierarchical Poisson models. The Poisson regression trees cluster municipalities into homogeneous groups. The hierarchical Poisson models can be considered as tools for Small Area estimation.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026787
    Description:

    Application of classical statistical methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Methods have been developed that account for the survey design, but these methods require additional information such as survey weights, design effects or cluster identification for microdata. Inverse sampling (Hinkins, Oh and Scheuren 1997) provides an alternative approach by undoing the complex survey data structures so that standard methods can be applied. Repeated subsamples with unconditional simple random sampling structure are drawn and each subsample analysed by standard methods and then combined to increase the efficiency. This method has the potential to preserve confidentiality of microdata, although computer-intensive. We present some theory of inverse sampling and explore its limitations. A combined estimating equations approach is proposed for handling complex parameters such as ratios and "census" linear regression and logistic regression parameters. The method is applied to a cluster correlated data set reported in Battese, Harter and Fuller (1988).

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026780
    Description:

    Coverage errors and other coverage issues related to the population censuses are examined in the light of the recent literature. Especially, when the actual population census count of persons are matched with their corresponding post enumeration survey counts, the aggregated results in a dual record system setting can provide some coverage error statistics.

    In this paper, the coverage error issues are evaluated and alternative solutions are discussed in the light of the results from the latest Population Census of Turkey. By using the Census and post enumeration survey data, regional comparison of census coverage was also made and has shown greater variability among regions. Some methodological remarks are also made on the possible improvements on the current enumeration procedures.

    Release date: 2004-01-27

Reference (62)

Reference (62) (25 of 62 results)

  • Index and guides: 92-395-X
    Description:

    This report describes sampling and weighting procedures used in the 2001 Census. It reviews the history of these procedures in Canadian censuses, provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

    Release date: 2004-12-15

  • Index and guides: 92-394-X
    Description:

    This report deals with coverage errors that occur when persons, households, dwellings or families are missed or enumerated in error by the census. After the 2001 Census was taken, a number of studies were carried out to estimate gross undercoverage, gross overcoverage and net undercoverage. This report presents the results of the Dwelling Classification Study, the Reverse Record Check Study, the Automated Match Study and the Collective Dwelling Study. The report first describes census universes, coverage error and census collection and processing procedures that may result in coverage error. Then it gives estimates of net undercoverage for a number of demographic characteristics. After, the technical report presents the methodology and results of each coverage study and the estimates of coverage error after describing how the results of the various studies are combined. A historical perspective completes the product.

    Release date: 2004-11-25

  • Surveys and statistical programs – Documentation: 31-533-X
    Description:

    Starting with the August 2004 reference month, the Monthly Survey of Manufacturing (MSM) is using administrative data (Goods and Services Tax files) to derive shipments for a portion of the small establishments in the sample. This document is being published to complement the release of MSM data for that month.

    Release date: 2004-10-15

  • Technical products: 12-002-X20040027035
    Description:

    As part of the processing of the National Longitudinal Survey of Children and Youth (NLSCY) cycle 4 data, historical revisions have been made to the data of the first 3 cycles, either to correct errors or to update the data. During processing, particular attention was given to the PERSRUK (Person Identifier) and the FIELDRUK (Household Identifier). The same level of attention has not been given to the other identifiers that are included in the data base, the CHILDID (Person identifier) and the _IDHD01 (Household identifier). These identifiers have been created for the public files and can also be found in the master files by default. The PERSRUK should be used to link records between files and the FIELDRUK to determine the household when using the master files.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027034
    Description:

    The use of command files in Stat/Transfer can expedite the transfer of several data sets in an efficient replicable manner. This note outlines a simple step-by-step method for creating command files and provides sample code.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Technical products: 21-601-M2004072
    Description:

    The Farm Product Price Index (FPPI) is a monthly series that measures the changes in prices that farmers receive for the agriculture commodities they produce and sell.

    The FPPI was discontinued in March 1995; it was revived in April 2001 owing to continued demand for an index of prices received by farmers.

    Release date: 2004-09-28

  • Surveys and statistical programs – Documentation: 62F0026M2004001
    Description:

    This report describes the quality indicators produced for the 2002 Survey of Household Spending. These quality indicators, such as coefficients of variation, nonresponse rates, slippage rates and imputation rates, help users interpret the survey data.

    Release date: 2004-09-15

  • Technical products: 11-522-X2002001
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987.

    Symposium 2002 was the nineteenth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2002 the theme was: "Modelling Survey Data for Social and Economic Research".

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016744
    Description:

    A developmental trajectory describes the course of a behaviour over age or time. This technical paper provides an overview of a semi-parametric, group-based method for analysing developmental trajectories. This methodology provides an alternative to assuming a homogenous population of trajectories as is done in standard growth modelling.

    Four capabilities are described: (1) the capability to identify, rather than assume, distinctive groups of trajectories; (2) the capability to estimate the proportion of the population following each such trajectory group; (3) the capability to relate group membership probability to individual characteristics and circumstances; and (4) the capability to use the group membership probabilities for various other purposes, such as creating profiles of group members.

    In addition, two important extensions of the method are described: the capability to add time-varying covariates to trajectory models and the capability to estimate joint trajectory models of distinct but related behaviours. The former provides the statistical capacity for testing if a contemporary factor, such as an experimental intervention or a non-experimental event like pregnancy, deflects a pre-existing trajectory. The latter provides the capability to study the unfolding of distinct but related behaviours such as problematic childhood behaviour and adolescent drug abuse.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016739
    Description:

    The Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal data (altogether consisting of millions of person-months of individual- and family-level data) is useful for analyses of monthly labour market dynamics over relatively long periods of time, 20 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure for the employed, and data on the date last worked for the not employed - together with the date of survey responses - permit the estimated models to include terms reflecting seasonality and macro-economic cycles, as well as the duration dependence of each type of transition. In addition, the LFS data permit spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been included in the LifePaths socio-economic microsimulation model. In this setting, the equations may be used to simulate lifetime employment activity from past, present and future birth cohorts. Cross-sectional simulation results have been used to validate these models by comparisons with census data from the period 1971 to 1996.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016735
    Description:

    In the 2001 Canadian Census of Population, calibration or regression estimation was used to calculate a single set of household level weights to be used for all census estimates based on a one in five national sample of more than two million households. Because many auxiliary variables were available, only a subset of them could be used. Otherwise, some of the weights would have been smaller than the number one or even negative. In this technical paper, a forward selection procedure was used to discard auxiliary variables that caused weights to be smaller than one or that caused a large condition number for the calibration weight matrix being inverted. Also, two calibration adjustments were done to achieve close agreement between auxiliary population counts and estimates for small areas. Prior to 2001, the projection generalized regression (GREG) estimator was used and the weights were required to be greater than zero. For the 2001 Census, a switch was made to a pseudo-optimal regression estimator that kept more auxiliary variables and, at the same time, required that the weights be one or more.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016725
    Description:

    In 1997, the US Office of Management and Budget issued revised standards for the collection of race information within the federal statistical system. One revision allows individuals to choose more than one race group when responding to federal surveys and other federal data collections. This change presents challenges for analyses that involve data collected under both the old and new race-reporting systems, since the data on race are not comparable. The following paper discusses the problems encountered by these changes and methods developed to overcome them.

    Since most people under both systems report only a single race, a common proposed solution is to try to bridge the transition by assigning a single-race category to each multiple-race reporter under the new system, and to conduct analyses using just the observed and assigned single-race categories. Thus, the problem can be viewed as a missing-data problem, in which single-race responses are missing for multiple-race reporters and needing to be imputed.

    The US Office of Management and Budget suggested several simple bridging methods to handle this missing-data problem. Schenker and Parker (Statistics in Medicine, forthcoming) analysed data from the National Health Interview Survey of the US National Center for Health Statistics, which allows multiple-race reporting but also asks multiple-race reporters to specify a primary race, and found that improved bridging methods could result from incorporating individual-level and contextual covariates into the bridging models.

    While Schenker and Parker discussed only three large multiple-race groups, the current application requires predicting single-race categories for several small multiple-race groups as well. Thus, problems of sparse data arise in fitting the bridging models. We address these problems by building combined models for several multiple-race groups, thus borrowing strength across them. These and other methodological issues are discussed.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016747
    Description:

    This project seeks to shed light not only on the degree to which individuals are stuck in the low-income range, but also on those who have sufficient opportunity to move into the upper part of the income distribution. It also seeks to compare patterns of mobility through the income distribution in North America and Europe, shedding light on the impact of different models of integration. Cross-National Equivalent File data from the British Household Panel Survey (BHPS) for the United Kingdom, the German Socio-Economic Panel (GSOEP) for Germany, the Panel Study of Income Dynamics (PSID) for the United States and the Survey of Labour Income Dynamics (SLID) for Canada offer a comparative analysis of the dynamics of household income during the 1990s, paying particular attention to both low- and high-income dynamics. Canadian administrative data drawn from income tax files are also used. These panel datasets range in length from six years (for the SLID) to almost 20 years (for the PSID and the Canadian administrative data). The analysis focuses on developments during the 1990s, but also explores the sensitivity of the results to changes in the length of the period analysed.

    The analysis begins by offering a broad descriptive overview of the major characteristics and events (demographic versus labour market) that determine levels and changes in adjusted household incomes. Attention is paid to movements into and out of low- and high- income ranges. A number of definitions are used, incorporating absolute and relative notions of poverty. The sensitivity of the results to the use of various equivalence scales is examined. An overview offers a broad picture of the state of household income in each country and the relative roles of family structure, the labour market and welfare state in determining income mobility. The paper employs discrete time-hazard methods to model the dynamics of entry to and exit from both low and high income.

    Both observed and unobserved heterogeneity are controlled for with the intention of highlighting differences in the determinants of the transition rates between the countries. This is done in a way that assesses the importance of the relative roles of family, market and state. Attention is also paid to important institutional changes, most notably the increasing integration of product and labour markets in North America and Europe.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016721
    Description:

    This paper examines the simulation study that was conducted to assess the sampling scheme designed for the World Health Organization (WHO) Injection Safety Assessment Survey. The objective of this assessment survey is to determine whether facilities in which injections are given meet the necessary safety requirements for injection administration, equipment, supplies and waste disposal. The main parameter of interest is the proportion of health care facilities in a country that have safe injection practices.

    The objective of this simulation study was to assess the accuracy and precision of the proposed sampling design. To this end, two artificial populations were created based on the two African countries of Niger and Burkina Faso, in which the pilot survey was tested. To create a wide variety of hypothetical populations, the assignment of whether a health care facility was safe or not was based on the different combinations of the population proportion of safe health care facilities in the country, the homogeneity of the districts in the country with respect to injection safety, and whether the health care facility was located in an urban or rural district.

    Using the results of the simulation, a multi-factor analysis of variance was used to determine which factors affect the outcome measures of absolute bias, standard error and mean-squared error.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016708
    Description:

    In this paper, we discuss the analysis of complex health survey data by using multivariate modelling techniques. Main interests are in various design-based and model-based methods that aim at accounting for the design complexities, including clustering, stratification and weighting. Methods covered include generalized linear modelling based on pseudo-likelihood and generalized estimating equations, linear mixed models estimated by restricted maximum likelihood, and hierarchical Bayes techniques using Markov Chain Monte Carlo (MCMC) methods. The methods will be compared empirically, using data from an extensive health interview and examination survey conducted in Finland in 2000 (Health 2000 Study).

    The data of the Health 2000 Study were collected using personal interviews, questionnaires and clinical examinations. A stratified two-stage cluster sampling design was used in the survey. The sampling design involved positive intra-cluster correlation for many study variables. For a closer investigation, we selected a small number of study variables from the health interview and health examination phases. In many cases, the different methods produced similar numerical results and supported similar statistical conclusions. Methods that failed to account for the design complexities sometimes led to conflicting conclusions. We also discuss the application of the methods in this paper by using standard statistical software products.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016430
    Description:

    Linearization (or Taylor series) methods are widely used to estimate standard errors for the co-efficients of linear regression models fit to multi-stage samples. When the number of primary sampling units (PSUs) is large, linearization can produce accurate standard errors under quite general conditions. However, when the number of PSUs is small or a co-efficient depends primarily on data from a small number of PSUs, linearization estimators can have large negative bias.

    In this paper, we characterize features of the design matrix that produce large bias in linearization standard errors for linear regression co-efficients. We then propose a new method, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors. When the errors are independent and identically distributed (i.i.d.), the BRL estimator is unbiased for the variance. Furthermore, a simulation study shows that BRL can greatly reduce the bias, even if the errors are not i.i.d. We also propose using a Satterthwaite approximation to determine the degrees of freedom of the reference distribution for tests and confidence intervals about linear combinations of co-efficients based on the BRL estimator. We demonstrate that the jackknife estimator also tends to be biased in situations where linearization is biased. However, the jackknife's bias tends to be positive. Our bias-reduced linearization estimator can be viewed as a compromise between the traditional linearization and jackknife estimators.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016717
    Description:

    In the United States, the National Health and Nutrition Examination Survey (NHANES) is linked to the National Health Interview Survey (NHIS) at the primary sampling unit level (the same counties, but not necessarily the same persons, are in both surveys). The NHANES examines about 5,000 persons per year, while the NHIS samples about 100,000 persons per year. In this paper, we present and develop properties of models that allow NHIS and administrative data to be used as auxiliary information for estimating quantities of interest in the NHANES. The methodology, related to Fay-Herriot (1979) small-area models and to calibration estimators in Deville and Sarndal (1992), accounts for the survey designs in the error structure.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016733
    Description:

    While censuses and surveys are often said to measure populations as they are, most reflect information about individuals as they were at the time of measurement, or even at some prior time point. Inferences from such data therefore should take into account change over time at both the population and individual levels. In this paper, we provide a unifying framework for such inference problems, illustrating it through a diverse series of examples including: (1) estimating residency status on Census Day using multiple administrative records, (2) combining administrative records for estimating the size of the US population, (3) using rolling averages from the American Community Survey, and (4) estimating the prevalence of human rights abuses.

    Specifically, at the population level, the estimands of interest, such as the size or mean characteristics of a population, might be changing. At the same time, individual subjects might be moving in and out of the frame of the study or changing their characteristics. Such changes over time can affect statistical studies of government data that combine information from multiple data sources, including censuses, surveys and administrative records, an increasingly common practice. Inferences from the resulting merged databases often depend heavily on specific choices made in combining, editing and analysing the data that reflect assumptions about how populations of interest change or remain stable over time.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016719
    Description:

    This study takes a look at the modelling methods used for public health data. Public health has a renewed interest in the impact of the environment on health. Ecological or contextual studies ideally investigate these relationships using public health data augmented with environmental characteristics in multilevel or hierarchical models. In these models, individual respondents in health data are the first level and community data are the second level. Most public health data use complex sample survey designs, which require analyses accounting for the clustering, nonresponse, and poststratification to obtain representative estimates of prevalence of health risk behaviours.

    This study uses the Behavioral Risk Factor Surveillance System (BRFSS), a state-specific US health risk factor surveillance system conducted by the Center for Disease Control and Prevention, which assesses health risk factors in over 200,000 adults annually. BRFSS data are now available at the metropolitan statistical area (MSA) level and provide quality health information for studies of environmental effects. MSA-level analyses combining health and environmental data are further complicated by joint requirements of the survey sample design and the multilevel analyses.

    We compare three modelling methods in a study of physical activity and selected environmental factors using BRFSS 2000 data. Each of the methods described here is a valid way to analyse complex sample survey data augmented with environmental information, although each accounts for the survey design and multilevel data structure in a different manner and is thus appropriate for slightly different research questions.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016726
    Description:

    Although the use of school vouchers is growing in the developing world, the impact of vouchers is an open question. Any sort of long-term assessment of this activity is rare. This paper estimates the long-term effect of Colombia's PACES program, which provided over 125,000 poor children with vouchers that covered half the cost of private secondary school.

    The PACES program presents an unusual opportunity to assess the effect of demand-side education financing in a Latin American country where private schools educate a substantial proportion of pupils. The program is of special interest because many vouchers were assigned by lottery, so program effects can be reliably assessed.

    We use administrative records to assess the long-term impact of PACES vouchers on high school graduation status and test scores. The principal advantage of administrative records is that there is no loss-to-follow-up and the data are much cheaper than a costly and potentially dangerous survey effort. On the other hand, individual ID numbers may be inaccurate, complicating record linkage, and selection bias contaminates the sample of test-takers. We discuss solutions to these problems. The results suggest that the program increased secondary school completion rates, and that college-entrance test scores were higher for lottery winners than losers.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016750
    Description:

    Analyses of data from social and economic surveys sometimes use generalized variance function models to approximate the design variance of point estimators of population means and proportions. Analysts may use the resulting standard error estimates to compute associated confidence intervals or test statistics for the means and proportions of interest. In comparison with design-based variance estimators computed directly from survey microdata, generalized variance function models have several potential advantages, as will be discussed in this paper, including operational simplicity; increased stability of standard errors; and, for cases involving public-use datasets, reduction of disclosure limitation problems arising from the public release of stratum and cluster indicators.

    These potential advantages, however, may be offset in part by several inferential issues. First, the properties of inferential statistics based on generalized variance functions (e.g., confidence interval coverage rates and widths) depend heavily on the relative empirical magnitudes of the components of variability associated, respectively, with:

    (a) the random selection of a subset of items used in estimation of the generalized variance function model(b) the selection of sample units under a complex sample design (c) the lack of fit of the generalized variance function model (d) the generation of a finite population under a superpopulation model.

    Second, under conditions, one may link each of components (a) through (d) with different empirical measures of the predictive adequacy of a generalized variance function model. Consequently, these measures of predictive adequacy can offer us some insight into the extent to which a given generalized variance function model may be appropriate for inferential use in specific applications.

    Some of the proposed diagnostics are applied to data from the US Survey of Doctoral Recipients and the US Current Employment Survey. For the Survey of Doctoral Recipients, components (a), (c) and (d) are of principal concern. For the Current Employment Survey, components (b), (c) and (d) receive principal attention, and the availability of population microdata allow the development of especially detailed models for components (b) and (c).

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016730
    Description:

    A wide class of models of interest in social and economic research can be represented by specifying a parametric structure for the covariances of observed variables. The availability of software, such as LISREL (Jöreskog and Sörbom 1988) and EQS (Bentler 1995), has enabled these models to be fitted to survey data in many applications. In this paper, we consider approaches to inference about such models using survey data derived by complex sampling schemes. We consider evidence of finite sample biases in parameter estimation and ways to reduce such biases (Altonji and Segal 1996) and associated issues of efficiency of estimation, standard error estimation and testing. We use longitudinal data from the British Household Panel Survey for illustration. As these data are subject to attrition, we also consider the issue of how to use nonresponse weights in the modelling.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016748
    Description:

    Practitioners often use data collected from complex surveys (such as labour force and health surveys involving stratified cluster sampling) to fit logistic regression and other models of interest. A great deal of effort over the last two decades has been spent on developing methods to analyse survey data that take account of design features. This paper looks at an alternative method known as inverse sampling.

    Specialized programs, such as SUDAAN and WESVAR, are also available to implement some of the methods developed to take into account the design features. However, these methods require additional information such as survey weights, design effects or cluster identification of microdata and thus, another method is necessary.

    Inverse sampling (Hinkins et al., Survey Methodology, 1977) provides an alternative approach by undoing the complex data structures so that standard methods can be applied. Repeated subsamples with simple random structure are drawn and each subsample is analysed by standard methods and is combined to increase the efficiency. Although computer-intensive, this method has the potential to preserve confidentiality of microdata files. A drawback of the method is that it can lead to biased estimates of regression parameters when the subsample sizes are small (as in the case of stratified cluster sampling).

    In this paper, we propose using the estimating equation approach that combines the subsamples before estimation and thus leads to nearly unbiased estimates of regression parameters regardless of subsample sizes. This method is computationally less intensive than the original method. We apply the method to cluster-correlated data generated from a nested error linear regression model to illustrate its advantages. A real dataset from a Statistics Canada survey will also be analysed using the estimating equation method.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016716
    Description:

    Missing data are a constant problem in large-scale surveys. Such incompleteness is usually dealt with either by restricting the analysis to the cases with complete records or by imputing, for each missing item, an efficiently estimated value. The deficiencies of these approaches will be discussed in this paper, especially in the context of estimating a large number of quantities. The main part of the paper will describe two examples of analyses using multiple imputation.

    In the first, the International Labour Organization (ILO) employment status is imputed in the British Labour Force Survey by a Bayesian bootstrap method. It is an adaptation of the hot-deck method, which seeks to fully exploit the auxiliary information. Important auxiliary information is given by the previous ILO status, when available, and the standard demographic variables.

    Missing data can be interpreted more generally, as in the framework of the expectation maximization (EM) algorithm. The second example is from the Scottish House Condition Survey, and its focus is on the inconsistency of the surveyors. The surveyors assess the sampled dwelling units on a large number of elements or features of the dwelling, such as internal walls, roof and plumbing, that are scored and converted to a summarizing 'comprehensive repair cost.' The level of inconsistency is estimated from the discrepancies between the pairs of assessments of doubly surveyed dwellings. The principal research questions concern the amount of information that is lost as a result of the inconsistency and whether the naive estimators that ignore the inconsistency are unbiased. The problem is solved by multiple imputation, generating plausible scores for all the dwellings in the survey.

    Release date: 2004-09-13

Date modified: