Statistics by subject – Statistical methods

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Year of publication

1 facets displayed. 1 facets selected.

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
Loading
Loading in progress, please wait...
All (90)

All (90) (25 of 90 results)

  • Articles and reports: 11F0019M2004219
    Description:

    This study investigates trends in family income inequality in the 1980s and 1990s, with particular attention paid to the recovery period of the 1990s.

    Release date: 2004-12-16

  • Index and guides: 92-395-X
    Description:

    This report describes sampling and weighting procedures used in the 2001 Census. It reviews the history of these procedures in Canadian censuses, provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

    Release date: 2004-12-15

  • Index and guides: 92-394-X
    Description:

    This report deals with coverage errors that occur when persons, households, dwellings or families are missed or enumerated in error by the census. After the 2001 Census was taken, a number of studies were carried out to estimate gross undercoverage, gross overcoverage and net undercoverage. This report presents the results of the Dwelling Classification Study, the Reverse Record Check Study, the Automated Match Study and the Collective Dwelling Study. The report first describes census universes, coverage error and census collection and processing procedures that may result in coverage error. Then it gives estimates of net undercoverage for a number of demographic characteristics. After, the technical report presents the methodology and results of each coverage study and the estimates of coverage error after describing how the results of the various studies are combined. A historical perspective completes the product.

    Release date: 2004-11-25

  • Articles and reports: 13-604-M2004045
    Description:

    How "good" are the National Tourism Indicators (NTI)? How can their quality be measured? This study looks to answer these questions by analysing the revisions to the NTI estimates for the period 1997 through 2001.

    Release date: 2004-10-25

  • Table: 53-500-X
    Description:

    This report presents the results of a pilot survey conducted by Statistics Canada to measure the fuel consumption of on-road motor vehicles registered in Canada. This study was carried out in connection with the Canadian Vehicle Survey (CVS) which collects information on road activity such as distance traveled, number of passengers and trip purpose.

    Release date: 2004-10-21

  • Surveys and statistical programs – Documentation: 31-533-X
    Description:

    Starting with the August 2004 reference month, the Monthly Survey of Manufacturing (MSM) is using administrative data (Goods and Services Tax files) to derive shipments for a portion of the small establishments in the sample. This document is being published to complement the release of MSM data for that month.

    Release date: 2004-10-15

  • Technical products: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027035
    Description:

    As part of the processing of the National Longitudinal Survey of Children and Youth (NLSCY) cycle 4 data, historical revisions have been made to the data of the first 3 cycles, either to correct errors or to update the data. During processing, particular attention was given to the PERSRUK (Person Identifier) and the FIELDRUK (Household Identifier). The same level of attention has not been given to the other identifiers that are included in the data base, the CHILDID (Person identifier) and the _IDHD01 (Household identifier). These identifiers have been created for the public files and can also be found in the master files by default. The PERSRUK should be used to link records between files and the FIELDRUK to determine the household when using the master files.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027034
    Description:

    The use of command files in Stat/Transfer can expedite the transfer of several data sets in an efficient replicable manner. This note outlines a simple step-by-step method for creating command files and provides sample code.

    Release date: 2004-10-05

  • Technical products: 21-601-M2004072
    Description:

    The Farm Product Price Index (FPPI) is a monthly series that measures the changes in prices that farmers receive for the agriculture commodities they produce and sell.

    The FPPI was discontinued in March 1995; it was revived in April 2001 owing to continued demand for an index of prices received by farmers.

    Release date: 2004-09-28

  • Surveys and statistical programs – Documentation: 62F0026M2004001
    Description:

    This report describes the quality indicators produced for the 2002 Survey of Household Spending. These quality indicators, such as coefficients of variation, nonresponse rates, slippage rates and imputation rates, help users interpret the survey data.

    Release date: 2004-09-15

  • Technical products: 11-522-X2002001
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987.

    Symposium 2002 was the nineteenth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2002 the theme was: "Modelling Survey Data for Social and Economic Research".

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016726
    Description:

    Although the use of school vouchers is growing in the developing world, the impact of vouchers is an open question. Any sort of long-term assessment of this activity is rare. This paper estimates the long-term effect of Colombia's PACES program, which provided over 125,000 poor children with vouchers that covered half the cost of private secondary school.

    The PACES program presents an unusual opportunity to assess the effect of demand-side education financing in a Latin American country where private schools educate a substantial proportion of pupils. The program is of special interest because many vouchers were assigned by lottery, so program effects can be reliably assessed.

    We use administrative records to assess the long-term impact of PACES vouchers on high school graduation status and test scores. The principal advantage of administrative records is that there is no loss-to-follow-up and the data are much cheaper than a costly and potentially dangerous survey effort. On the other hand, individual ID numbers may be inaccurate, complicating record linkage, and selection bias contaminates the sample of test-takers. We discuss solutions to these problems. The results suggest that the program increased secondary school completion rates, and that college-entrance test scores were higher for lottery winners than losers.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016719
    Description:

    This study takes a look at the modelling methods used for public health data. Public health has a renewed interest in the impact of the environment on health. Ecological or contextual studies ideally investigate these relationships using public health data augmented with environmental characteristics in multilevel or hierarchical models. In these models, individual respondents in health data are the first level and community data are the second level. Most public health data use complex sample survey designs, which require analyses accounting for the clustering, nonresponse, and poststratification to obtain representative estimates of prevalence of health risk behaviours.

    This study uses the Behavioral Risk Factor Surveillance System (BRFSS), a state-specific US health risk factor surveillance system conducted by the Center for Disease Control and Prevention, which assesses health risk factors in over 200,000 adults annually. BRFSS data are now available at the metropolitan statistical area (MSA) level and provide quality health information for studies of environmental effects. MSA-level analyses combining health and environmental data are further complicated by joint requirements of the survey sample design and the multilevel analyses.

    We compare three modelling methods in a study of physical activity and selected environmental factors using BRFSS 2000 data. Each of the methods described here is a valid way to analyse complex sample survey data augmented with environmental information, although each accounts for the survey design and multilevel data structure in a different manner and is thus appropriate for slightly different research questions.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016745
    Description:

    The attractiveness of the Regression Discontinuity Design (RDD) rests on its close similarity to a normal experimental design. On the other hand, it is of limited applicability since it is not often the case that units are assigned to the treatment group on the basis of an observable (to the analyst) pre-program measure. Besides, it only allows identification of the mean impact on a very specific subpopulation. In this technical paper, we show that the RDD straightforwardly generalizes to the instances in which the units' eligibility is established on an observable pre-program measure with eligible units allowed to freely self-select into the program. This set-up also proves to be very convenient for building a specification test on conventional non-experimental estimators of the program mean impact. The data requirements are clearly described.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016723
    Description:

    Categorical outcomes, such as binary, ordinal and nominal responses, occur often in survey research. Logistic regression investigates the relationship between such categorical responses variables and a set of explanatory variables. The LOGISTIC procedure can be used to perform a logistic analysis on data from a random sample. However, this approach is not valid if the data come from other sample designs, such as complex survey designs with stratification, clustering and/or unequal weighting. In these cases, specialized techniques must be applied in order to produce the appropriate estimates and standard errors.

    The SURVEYLOGISTIC procedure, experimental in Version 9, brings logistic regression for survey data to the SAS System and delivers much of the functionality of the LOGISTIC procedure. This paper describes the methodological approach and applications for this new software.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016731
    Description:

    Behavioural researchers use a variety of techniques to predict respondent scores on constructs that are not directly observable. Examples of such constructs include job satisfaction, work stress, aptitude for graduate study, children's mathematical ability, etc. The techniques commonly used for modelling and predicting scores on such constructs include factor analysis, classical psychometric scaling and item response theory (IRT), and for each technique there are often several different strategies that can be used to generate individual scores. However, researchers are seldom satisfied with simply measuring these constructs. They typically use the derived scores in multiple regression, analysis of variance and numerous multivariate procedures. Though using predicted scores in this way can result in biased estimates of model parameters, not all researchers are aware of this difficulty. The paper will review the literature on this issue, with particular emphasis on IRT methods. Problems will be illustrated, some remedies suggested, and areas for further research will be identified.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016752
    Description:

    Opening remarks

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016740
    Description:

    Controlling for differences in student populations, we examine the contribution of schools to provincial differences in the reading, math and science achievement of 15-year-olds in this paper. Using a semi-parametric decomposition technique developed by DiNardo, Fortin and Lemieux (1996) for differences in distributions, we find that school differences contribute to provincial differences in different parts of the achievement distribution and that the effect varies by province and by type of skill, even within province. For example, school differences account for about 32% of the difference in mean reading achievement between New Brunswick and Alberta, but reduce the difference in the proportion of students performing at the lowest reading proficiency level. By contrast, school differences account for 94% of the New Brunswick-Alberta gap in the 10th percentile of the science distribution. Our results demonstrate that school effectiveness studies that focus on the first moment of the achievement distribution miss potentially important impacts for specific students.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016712
    Description:

    In this paper, we consider the effect of the interval censoring of cessation time on intensity parameter estimation with regard to smoking cessation and pregnancy. The three waves of the National Population Health Survey allow the methodology of event history analysis to be applied to smoking initiation, cessation and relapse. One issue of interest is the relationship between smoking cessation and pregnancy. If a longitudinal respondent who is a smoker at the first cycle ceases smoking by the second cycle, we know the cessation time to within an interval of length at most a year, since the respondent is asked for the age at which she stopped smoking, and her date of birth is known. We also know whether she is pregnant at the time of the second cycle, and whether she has given birth since the time of the first cycle. For many such subjects, we know the date of conception to within a relatively small interval. If we knew the time of smoking cessation and pregnancy period exactly for each member who experienced one or other of these events between cycles, we could model their temporal relationship through their joint intensities.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016737
    Description:

    If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016746
    Description:

    In 1961, the European Commission launched a harmonized qualitative survey program to the consumers and the heads of companies (industry, services, construction, retail trade, investments) that covers more than 40 countries today. These qualitative surveys are aimed at understanding the economic situation of these companies. Results are available a few days after the end of the reference period, well before the results of the quantitative surveys.

    Although qualitative, these surveys have quickly become an essential tool of the cyclical diagnosis and of the short-term economic forecast. This product shows how these surveys are used by the European Commission, in particular by the Directorate-General for economic and financial Affairs (DG ECFIN) and the Statistical Office of the European Communities (EUROSTAT), to evaluate the economic situation of the Euro zone.

    The first part of this product briefly presents the harmonized European business and consumer survey program. In the second part, we look at how DG ECFIN calculates a coincident indicator of the economic activity, using a dynamic factorial analysis of the questions of the survey in industry. This type of indicator makes it possible, in addition, to study the convergence of the economic cycles of the member states. The quantitative short-term indicators for the Euro zone are often criticized for the delay with which they are published. In the third part, we look at how EUROSTAT plans to publish flash estimates of the industrial product price index (IPPI) resulting from econometric models integrating the business survey series. Lastly, we show how these surveys can be used to forecast the gross domestic product (GDP) and to define proxies for some non-available key indicators (new orders in industry, etc.).

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016734
    Description:

    According to recent literature, the calibration method has gained much popularity on survey sampling and calibration estimators are routinely computed by many survey organizations. The choice of calibration variables for all existing approaches, however, remains ad hoc. In this article, we show that the model-calibration estimator for the finite population mean, which was proposed by Wu and Sitter (2001) through an intuitive argument, is indeed optimal among a class of calibration estimators. We further present optimal calibration estimators for the finite population distribution function, the population variance, variance of a linear estimator and other quadratic finite population functions under a unified framework. A limited simulation study shows that the improvement of these optimal estimators over the conventional ones can be substantial. The question of when and how auxiliary information can be used for both the estimation of the population mean using a generalized regression estimator and the estimation of its variance through calibration is addressed clearly under the proposed general methodology. Constructions of proposed estimators under two-phase sampling and some fundamental issues in using auxiliary information from survey data are also addressed under the context of optimal estimation.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016741
    Description:

    Linearization and the jack-knife method are widely used to estimate standard errors for the coefficients of linear regression models fit to multi-stage samples. With few primary sampling units (PSUs) or when a few PSUs have high leverage, linearization estimators can have large negative bias, while the jack-knife method has a correspondingly large positive bias. We characterize the design factors that produce large biases in these standard error estimators. In this technical paper, we propose an alternative estimator, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors.

    When errors are independently and identically distributed (iid), the BRL estimator is unbiased. The BRL method applies to stratified samples with non-constant selection weights and to generalized linear models such as logistic regression. We also discuss BRL standard error estimators for generalized estimating equation models that explicitly model the dependence among observations from the same PSU in data from complex sample designs. Simulation study results show that BRL standard errors are combined with the Satterthwaite approximation to determine the reference distribution yield tests with Type I error rates near nominal values. We contrast our method with alternatives proposed by Kott (1994 and 1996) and Mancl and DeRouen (2001).

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016753
    Description:

    Keynote Address.

    Release date: 2004-09-13

Data (2)

Data (2) (2 results)

  • Table: 53-500-X
    Description:

    This report presents the results of a pilot survey conducted by Statistics Canada to measure the fuel consumption of on-road motor vehicles registered in Canada. This study was carried out in connection with the Canadian Vehicle Survey (CVS) which collects information on road activity such as distance traveled, number of passengers and trip purpose.

    Release date: 2004-10-21

  • Table: 95F0495X2001012
    Description:

    This table contains information from the 2001 Census, presented according to the statistical area classification (SAC). The SAC groups census subdivisions according to whether they are a component of a census metropolitan area, a census agglomeration, a census metropolitan area and census agglomeration influenced zone (strong MIZ, moderate MIZ, weak MIZ or no MIZ) or of the territories (Northwest Territories, Nunavut and Yukon Territory). The SAC is used for data dissemination purposes.

    Data characteristics presented according to the SAC include age, visible minority groups, immigration, mother tongue, education, income, work and dwellings. Data are presented for Canada, provinces and territories. The data characteristics presented within this table may differ from those of other products in the "Profiles" series.

    Release date: 2004-02-27

Analysis (26)

Analysis (26) (25 of 26 results)

  • Articles and reports: 11F0019M2004219
    Description:

    This study investigates trends in family income inequality in the 1980s and 1990s, with particular attention paid to the recovery period of the 1990s.

    Release date: 2004-12-16

  • Articles and reports: 13-604-M2004045
    Description:

    How "good" are the National Tourism Indicators (NTI)? How can their quality be measured? This study looks to answer these questions by analysing the revisions to the NTI estimates for the period 1997 through 2001.

    Release date: 2004-10-25

  • Articles and reports: 12-001-X20040016996
    Description:

    This article studies the use of the sample distribution for the prediction of finite population totals under single-stage sampling. The proposed predictors employ the sample values of the target study variable, the sampling weights of the sample units and possibly known population values of auxiliary variables. The prediction problem is solved by estimating the expectation of the study values for units outside the sample as a function of the corresponding expectation under the sample distribution and the sampling weights. The prediction mean square error is estimated by a combination of an inverse sampling procedure and a re-sampling method. An interesting outcome of the present analysis is that several familiar estimators in common use are shown to be special cases of the proposed approach, thus providing them a new interpretation. The performance of the new and some old predictors in common use is evaluated and compared by a Monte Carlo simulation study using a real data set.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016995
    Description:

    One of the main objectives of a sample survey is the computation of estimates of means and totals for specific domains of interest. Domains are determined either before the survey is carried out (primary domains) or after it has been carried out (secondary domains). The reliability of the associated estimates depends on the variability of the sample size as well as on the y-variables of interest. This variability cannot be controlled in the absence of auxiliary information for subgroups of the population. However, if auxiliary information is available, the estimated reliability of the resulting estimates can be controlled to some extent. In this paper, we study the potential improvements in terms of the reliability of domain estimates that use auxiliary information. The properties (bias, coverage, efficiency) of various estimators that use auxiliary information are compared using a conditional approach.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016993
    Description:

    The weighting cell estimator corrects for unit nonresponse by dividing the sample into homogeneous groups (cells) and applying a ratio correction to the respondents within each cell. Previous studies of the statistical properties of weighting cell estimators have assumed that these cells correspond to known population cells with homogeneous characteristics. In this article, we study the properties of the weighting cell estimator under a response probability model that does not require correct specification of homogeneous population cells. Instead, we assume that the response probabilities are a smooth but otherwise unspecified function of a known auxiliary variable. Under this more general model, we study the robustness of the weighting cell estimator against model misspecification. We show that, even when the population cells are unknown, the estimator is consistent with respect to the sampling design and the response model. We describe the effect of the number of weighting cells on the asymptotic properties of the estimator. Simulation experiments explore the finite sample properties of the estimator. We conclude with some guidance on how to select the size and number of cells for practical implementation of weighting cell estimation when those cells cannot be specified a priori.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016998
    Description:

    The Canadian Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of longitudinal data from the monthly records of household members. Such longitudinal micro-data - altogether consisting of millions of person-months of individual and family level data - is useful for analyses of monthly labour market dynamics over relatively long periods of time, 25 years and more.

    We make use of these data to estimate hazard functions describing transitions among the labour market states: self-employed, paid employee and not employed. Data on job tenure, for employed respondents, and on the date last worked, for those not employed - together with the date of survey responses - allow the construction of models that include terms reflecting seasonality and macro-economic cycles as well as the duration dependence of each type of transition. In addition, the LFS data permits spouse labour market activity and family composition variables to be included in the hazard models as time-varying covariates. The estimated hazard equations have been incorporated in the LifePaths microsimulation model. In that setting, the equations have been used to simulate lifetime employment activity from past, present and future birth cohorts. Simulation results have been validated by comparison with the age profiles of LFS employment/population ratios for the period 1976 to 2001.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016999
    Description:

    Combining response data from the Belgian Fertility and Family Survey with individual level and municipality level data from the 1991 Census for both nonrespondents and respondents, multilevel logistic regression models for contact and cooperation propensity are estimated. The covariates introduced are a selection of indirect features, all out of the researchers' direct control. Contrary to previous research, Socio Economic Status is found to be positively related to cooperation. Another unexpected result is the absence of any considerable impact of ecological correlates such as urbanity.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016997
    Description:

    Multilevel models are often fitted to survey data gathered with a complex multistage sampling design. However, if such a design is informative, in the sense that the inclusion probabilities depend on the response variable even after conditioning on the covariates, then standard maximum likelihood estimators are biased. In this paper, following the Pseudo Maximum Likelihood (PML) approach of Skinner (1989), we propose a probability weighted estimation procedure for multilevel ordinal and binary models which eliminates the bias generated by the informativeness of the design. The reciprocals of the inclusion probabilities at each sampling stage are used to weight the log likelihood function and the weighted estimators obtained in this way are tested by means of a simulation study for the simple case of a binary random intercept model with and without covariates. The variance estimators are obtained by a bootstrap procedure. The maximization of the weighted log likelihood of the model is done by the NLMIXED procedure of the SAS, which is based on adaptive Gaussian quadrature. Also the bootstrap estimation of variances is implemented in the SAS environment.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016994
    Description:

    When imputation is used to assign values for missing items in sample surveys, naïve methods of estimating the variances of survey estimates that treat the imputed values as if they were observed give biased variance estimates. This article addresses the problem of variance estimation for a linear estimator in which missing values are assigned by a single hot deck imputation (a form of imputation that is widely used in practice). We propose estimators of the variance of a linear hot deck imputed estimator using a decomposition of the total variance suggested by Särndal (1992). A conditional approach to variance estimation is developed that is applicable to both weighted and unweighted hot deck imputation. Estimation of the variance of a domain estimator is also examined.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040019186
    Description:

    In this Issue is a column where the Editor biefly presents each paper of the current issue of Survey Methodology. As well, it sometimes contain informations on structure or management changes in the journal.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016992
    Description:

    In the U.S. Census of Population and Housing, a sample of about one-in-six of the households receives a longer version of the census questionnaire called the long form. All others receive a version called the short form. Raking, using selected control totals from the short form, has been used to create two sets of weights for long form estimation; one for individuals and one for households. We describe a weight construction method based on quadratic programming that produces household weights such that the weighted sum for individual characteristics and for household characteristics agree closely with selected short form totals. The method is broadly applicable to situations where weights are to be constructed to meet both size bounds and sum-to-control restrictions. Application to the situation where the controls are estimates with an estimated covariance matrix is described.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016991
    Description:

    In survey sampling, Taylor linearization is often used to obtain variance estimators for calibration estimators of totals and nonlinear finite population (or census) parameters, such as ratios, regression and correlation coefficients, which can be expressed as smooth functions of totals. Taylor linearization is generally applicable to any sampling design, but it can lead to multiple variance estimators that are asymptotically design unbiased under repeated sampling. The choice among the variance estimators requires other considerations such as (i) approximate unbiasedness for the model variance of the estimator under an assumed model, (ii) validity under a conditional repeated sampling framework. In this paper, a new approach to deriving Taylor linearization variance estimators is proposed. It leads directly to a variance estimator which satisfies the above considerations at least in a number of important cases. The method is applied to a variety of problems, covering estimators of a total as well as other estimators defined either explicitly or implicitly as solutions of estimating equations. In particular, estimators of logistic regression parameters with calibration weights are studied. It leads to a new variance estimator for a general class of calibration estimators that includes generalized raking ratio and generalized regression estimators. The proposed method is extended to two-phase sampling to obtain a variance estimator that makes fuller use of the first phase sample data compared to traditional linearization variance estimators.

    Release date: 2004-07-14

  • Articles and reports: 12-001-X20040016990
    Description:

    Survey statisticians have long known that the question-answer process is a source of response effects that contribute to non-random measurement error. In the past two decades there has been substantial progress toward understanding these sources of error by applying concepts from social and cognitive psychology to the study of the question-answer process. This essay reviews the development of these approaches, discusses the present state of our knowledge, and suggests some research priorities for the future.

    Release date: 2004-07-14

  • Articles and reports: 89-552-M2004011
    Description:

    This paper develops a measure of investment in education from the literacy level of labour market entrants, using the 1994 International Adult Literacy Survey.

    Release date: 2004-06-22

  • Articles and reports: 91F0015M2004006
    Description:

    The paper assesses and compares new and old methodologies for official estimates of migration within and among provinces and territories for the period 1996/97 to 2000/01.

    Release date: 2004-06-17

  • Articles and reports: 82-003-X20030036847
    Description:

    This paper examines whether accepting proxy- instead of self-responses results in lower estimates of some health conditions. It analyses data from the National Population Health Survey and the Canadian Community Health Survey.

    Release date: 2004-05-18

  • Articles and reports: 12-001-X20030026782
    Description:

    This paper discusses both the general question of designing a post-enumeration survey, and how these general questions were addressed in the U.S. Census Bureau's coverage measurement planned as part of Census 2000. It relates the basic concepts of the Dual System Estimator to questions of the definition and measurement of correct enumerations, the measurement of census omissions, operational independence, reporting of residence, and the role of after-matching reinterview. It discusses estimation issues such as the treatment of movers, missing data, and synthetic estimation of local corrected population size. It also discusses where the design failed in Census 2000.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026787
    Description:

    Application of classical statistical methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Methods have been developed that account for the survey design, but these methods require additional information such as survey weights, design effects or cluster identification for microdata. Inverse sampling (Hinkins, Oh and Scheuren 1997) provides an alternative approach by undoing the complex survey data structures so that standard methods can be applied. Repeated subsamples with unconditional simple random sampling structure are drawn and each subsample analysed by standard methods and then combined to increase the efficiency. This method has the potential to preserve confidentiality of microdata, although computer-intensive. We present some theory of inverse sampling and explore its limitations. A combined estimating equations approach is proposed for handling complex parameters such as ratios and "census" linear regression and logistic regression parameters. The method is applied to a cluster correlated data set reported in Battese, Harter and Fuller (1988).

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026780
    Description:

    Coverage errors and other coverage issues related to the population censuses are examined in the light of the recent literature. Especially, when the actual population census count of persons are matched with their corresponding post enumeration survey counts, the aggregated results in a dual record system setting can provide some coverage error statistics.

    In this paper, the coverage error issues are evaluated and alternative solutions are discussed in the light of the results from the latest Population Census of Turkey. By using the Census and post enumeration survey data, regional comparison of census coverage was also made and has shown greater variability among regions. Some methodological remarks are also made on the possible improvements on the current enumeration procedures.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026784
    Description:

    Skinner and Elliot (2002) proposed a simple measure of disclosure risk for survey microdata and showed how to estimate this measure under sampling with equal probabilities. In this paper we show how their results on point estimation and variance estimation may be extended to handle unequal probability sampling. Our approach assumes a Poisson sampling design. Comments are made about the possible impact of departures from this assumption.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026777
    Description:

    The Accuracy and Coverage Evaluation survey was conducted to estimate the coverage in the 2000 U.S. Census. After field procedures were completed, several types of missing data had to be addressed to apply dual-system estimation. Some housing units were not interviewed. Two noninterview adjustments were devised from the same set of interviews, one for each of two points in time. In addition, the resident, match, or enumeration status of some respondents was not determined. Methods applied in the past were replaced to accommodate a tighter schedule to compute and verify the estimates. This paper presents the extent of missing data in the survey, describes the procedures applied, comparing them to past and current alternatives, and provides analytical summaries of the procedures, including comparisons of dual-system estimates of population under alternatives. Because the resulting levels of missing data were low, it appears that alternative procedures would not have affected the results substantially. However some changes in the estimates are noted.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026781
    Description:

    Census counts are known to be inexact based on comparisons of Census and Post Enumeration Survey (PES) figures. In Italy, the role of municipal administrations is crucial for both Census and PES field operations. In this paper we analyze the impact of municipality on Italian Census undercount rates by modeling data from the PES as well as from other sources using Poisson regression trees and hierarchical Poisson models. The Poisson regression trees cluster municipalities into homogeneous groups. The hierarchical Poisson models can be considered as tools for Small Area estimation.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030029054
    Description:

    In this Issue is a column where the Editor biefly presents each paper of the current issue of Survey Methodology. As well, it sometimes contain informations on structure or management changes in the journal.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026785
    Description:

    To avoid disclosures, one approach is to release partially synthetic, public use microdata sets. These comprise the units originally surveyed, but some collected values, for example sensitive values at high risk of disclosure or values of key identifiers, are replaced with multiple imputations. Although partially synthetic approaches are currently used to protect public use data, valid methods of inference have not been developed for them. This article presents such methods. They are based on the concepts of multiple imputation for missing data but use different rules for combining point and variance estimates. The combining rules also differ from those for fully synthetic data sets developed by Raghunathan, Reiter and Rubin (2003). The validity of these new rules is illustrated in simulation studies.

    Release date: 2004-01-27

  • Articles and reports: 12-001-X20030026778
    Description:

    Using both purely design based and model assisted arguments, it is shown that, under conditions of high entropy, the variance of the Horvitz Thompson (HT) estimator depends almost entirely on first order inclusion probabilities. Approximate expressions and estimators are derived for this "high entropy" variance of the HT estimator. Monte Carlo simulation studies are conducted to examine the statistical properties of the proposed variance estimators.

    Release date: 2004-01-27

Reference (62)

Reference (62) (25 of 62 results)

  • Index and guides: 92-395-X
    Description:

    This report describes sampling and weighting procedures used in the 2001 Census. It reviews the history of these procedures in Canadian censuses, provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

    Release date: 2004-12-15

  • Index and guides: 92-394-X
    Description:

    This report deals with coverage errors that occur when persons, households, dwellings or families are missed or enumerated in error by the census. After the 2001 Census was taken, a number of studies were carried out to estimate gross undercoverage, gross overcoverage and net undercoverage. This report presents the results of the Dwelling Classification Study, the Reverse Record Check Study, the Automated Match Study and the Collective Dwelling Study. The report first describes census universes, coverage error and census collection and processing procedures that may result in coverage error. Then it gives estimates of net undercoverage for a number of demographic characteristics. After, the technical report presents the methodology and results of each coverage study and the estimates of coverage error after describing how the results of the various studies are combined. A historical perspective completes the product.

    Release date: 2004-11-25

  • Surveys and statistical programs – Documentation: 31-533-X
    Description:

    Starting with the August 2004 reference month, the Monthly Survey of Manufacturing (MSM) is using administrative data (Goods and Services Tax files) to derive shipments for a portion of the small establishments in the sample. This document is being published to complement the release of MSM data for that month.

    Release date: 2004-10-15

  • Technical products: 12-002-X20040027032
    Description:

    This article examines why many Statistics Canada surveys supply bootstrap weights with their microdata for the purpose of design-based variance estimation. Bootstrap weights are not supported by commercially available software such as SUDAAN and WesVar, but there are ways to use these applications to produce boostrap variance estimates.

    The paper concludes with a brief discussion of other design-based approaches to variance estimation as well as software, programs and procedures where these methods have been employed.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027035
    Description:

    As part of the processing of the National Longitudinal Survey of Children and Youth (NLSCY) cycle 4 data, historical revisions have been made to the data of the first 3 cycles, either to correct errors or to update the data. During processing, particular attention was given to the PERSRUK (Person Identifier) and the FIELDRUK (Household Identifier). The same level of attention has not been given to the other identifiers that are included in the data base, the CHILDID (Person identifier) and the _IDHD01 (Household identifier). These identifiers have been created for the public files and can also be found in the master files by default. The PERSRUK should be used to link records between files and the FIELDRUK to determine the household when using the master files.

    Release date: 2004-10-05

  • Technical products: 12-002-X20040027034
    Description:

    The use of command files in Stat/Transfer can expedite the transfer of several data sets in an efficient replicable manner. This note outlines a simple step-by-step method for creating command files and provides sample code.

    Release date: 2004-10-05

  • Technical products: 21-601-M2004072
    Description:

    The Farm Product Price Index (FPPI) is a monthly series that measures the changes in prices that farmers receive for the agriculture commodities they produce and sell.

    The FPPI was discontinued in March 1995; it was revived in April 2001 owing to continued demand for an index of prices received by farmers.

    Release date: 2004-09-28

  • Surveys and statistical programs – Documentation: 62F0026M2004001
    Description:

    This report describes the quality indicators produced for the 2002 Survey of Household Spending. These quality indicators, such as coefficients of variation, nonresponse rates, slippage rates and imputation rates, help users interpret the survey data.

    Release date: 2004-09-15

  • Technical products: 11-522-X2002001
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987.

    Symposium 2002 was the nineteenth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2002 the theme was: "Modelling Survey Data for Social and Economic Research".

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016726
    Description:

    Although the use of school vouchers is growing in the developing world, the impact of vouchers is an open question. Any sort of long-term assessment of this activity is rare. This paper estimates the long-term effect of Colombia's PACES program, which provided over 125,000 poor children with vouchers that covered half the cost of private secondary school.

    The PACES program presents an unusual opportunity to assess the effect of demand-side education financing in a Latin American country where private schools educate a substantial proportion of pupils. The program is of special interest because many vouchers were assigned by lottery, so program effects can be reliably assessed.

    We use administrative records to assess the long-term impact of PACES vouchers on high school graduation status and test scores. The principal advantage of administrative records is that there is no loss-to-follow-up and the data are much cheaper than a costly and potentially dangerous survey effort. On the other hand, individual ID numbers may be inaccurate, complicating record linkage, and selection bias contaminates the sample of test-takers. We discuss solutions to these problems. The results suggest that the program increased secondary school completion rates, and that college-entrance test scores were higher for lottery winners than losers.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016719
    Description:

    This study takes a look at the modelling methods used for public health data. Public health has a renewed interest in the impact of the environment on health. Ecological or contextual studies ideally investigate these relationships using public health data augmented with environmental characteristics in multilevel or hierarchical models. In these models, individual respondents in health data are the first level and community data are the second level. Most public health data use complex sample survey designs, which require analyses accounting for the clustering, nonresponse, and poststratification to obtain representative estimates of prevalence of health risk behaviours.

    This study uses the Behavioral Risk Factor Surveillance System (BRFSS), a state-specific US health risk factor surveillance system conducted by the Center for Disease Control and Prevention, which assesses health risk factors in over 200,000 adults annually. BRFSS data are now available at the metropolitan statistical area (MSA) level and provide quality health information for studies of environmental effects. MSA-level analyses combining health and environmental data are further complicated by joint requirements of the survey sample design and the multilevel analyses.

    We compare three modelling methods in a study of physical activity and selected environmental factors using BRFSS 2000 data. Each of the methods described here is a valid way to analyse complex sample survey data augmented with environmental information, although each accounts for the survey design and multilevel data structure in a different manner and is thus appropriate for slightly different research questions.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016745
    Description:

    The attractiveness of the Regression Discontinuity Design (RDD) rests on its close similarity to a normal experimental design. On the other hand, it is of limited applicability since it is not often the case that units are assigned to the treatment group on the basis of an observable (to the analyst) pre-program measure. Besides, it only allows identification of the mean impact on a very specific subpopulation. In this technical paper, we show that the RDD straightforwardly generalizes to the instances in which the units' eligibility is established on an observable pre-program measure with eligible units allowed to freely self-select into the program. This set-up also proves to be very convenient for building a specification test on conventional non-experimental estimators of the program mean impact. The data requirements are clearly described.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016723
    Description:

    Categorical outcomes, such as binary, ordinal and nominal responses, occur often in survey research. Logistic regression investigates the relationship between such categorical responses variables and a set of explanatory variables. The LOGISTIC procedure can be used to perform a logistic analysis on data from a random sample. However, this approach is not valid if the data come from other sample designs, such as complex survey designs with stratification, clustering and/or unequal weighting. In these cases, specialized techniques must be applied in order to produce the appropriate estimates and standard errors.

    The SURVEYLOGISTIC procedure, experimental in Version 9, brings logistic regression for survey data to the SAS System and delivers much of the functionality of the LOGISTIC procedure. This paper describes the methodological approach and applications for this new software.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016731
    Description:

    Behavioural researchers use a variety of techniques to predict respondent scores on constructs that are not directly observable. Examples of such constructs include job satisfaction, work stress, aptitude for graduate study, children's mathematical ability, etc. The techniques commonly used for modelling and predicting scores on such constructs include factor analysis, classical psychometric scaling and item response theory (IRT), and for each technique there are often several different strategies that can be used to generate individual scores. However, researchers are seldom satisfied with simply measuring these constructs. They typically use the derived scores in multiple regression, analysis of variance and numerous multivariate procedures. Though using predicted scores in this way can result in biased estimates of model parameters, not all researchers are aware of this difficulty. The paper will review the literature on this issue, with particular emphasis on IRT methods. Problems will be illustrated, some remedies suggested, and areas for further research will be identified.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016752
    Description:

    Opening remarks

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016740
    Description:

    Controlling for differences in student populations, we examine the contribution of schools to provincial differences in the reading, math and science achievement of 15-year-olds in this paper. Using a semi-parametric decomposition technique developed by DiNardo, Fortin and Lemieux (1996) for differences in distributions, we find that school differences contribute to provincial differences in different parts of the achievement distribution and that the effect varies by province and by type of skill, even within province. For example, school differences account for about 32% of the difference in mean reading achievement between New Brunswick and Alberta, but reduce the difference in the proportion of students performing at the lowest reading proficiency level. By contrast, school differences account for 94% of the New Brunswick-Alberta gap in the 10th percentile of the science distribution. Our results demonstrate that school effectiveness studies that focus on the first moment of the achievement distribution miss potentially important impacts for specific students.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016712
    Description:

    In this paper, we consider the effect of the interval censoring of cessation time on intensity parameter estimation with regard to smoking cessation and pregnancy. The three waves of the National Population Health Survey allow the methodology of event history analysis to be applied to smoking initiation, cessation and relapse. One issue of interest is the relationship between smoking cessation and pregnancy. If a longitudinal respondent who is a smoker at the first cycle ceases smoking by the second cycle, we know the cessation time to within an interval of length at most a year, since the respondent is asked for the age at which she stopped smoking, and her date of birth is known. We also know whether she is pregnant at the time of the second cycle, and whether she has given birth since the time of the first cycle. For many such subjects, we know the date of conception to within a relatively small interval. If we knew the time of smoking cessation and pregnancy period exactly for each member who experienced one or other of these events between cycles, we could model their temporal relationship through their joint intensities.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016737
    Description:

    If the dataset available to machine learning results from cluster sampling (e.g., patients from a sample of hospital wards), the usual cross-validation error rate estimate can lead to biased and misleading results. In this technical paper, an adapted cross-validation is described for this case. Using a simulation, the sampling distribution of the generalization error rate estimate, under cluster or simple random sampling hypothesis, is compared with the true value. The results highlight the impact of the sampling design on inference: clearly, clustering has a significant impact; the repartition between learning set and test set should result from a random partition of the clusters, not from a random partition of the examples. With cluster sampling, standard cross-validation underestimates the generalization error rate, and is deficient for model selection. These results are illustrated with a real application of automatic identification of spoken language.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016746
    Description:

    In 1961, the European Commission launched a harmonized qualitative survey program to the consumers and the heads of companies (industry, services, construction, retail trade, investments) that covers more than 40 countries today. These qualitative surveys are aimed at understanding the economic situation of these companies. Results are available a few days after the end of the reference period, well before the results of the quantitative surveys.

    Although qualitative, these surveys have quickly become an essential tool of the cyclical diagnosis and of the short-term economic forecast. This product shows how these surveys are used by the European Commission, in particular by the Directorate-General for economic and financial Affairs (DG ECFIN) and the Statistical Office of the European Communities (EUROSTAT), to evaluate the economic situation of the Euro zone.

    The first part of this product briefly presents the harmonized European business and consumer survey program. In the second part, we look at how DG ECFIN calculates a coincident indicator of the economic activity, using a dynamic factorial analysis of the questions of the survey in industry. This type of indicator makes it possible, in addition, to study the convergence of the economic cycles of the member states. The quantitative short-term indicators for the Euro zone are often criticized for the delay with which they are published. In the third part, we look at how EUROSTAT plans to publish flash estimates of the industrial product price index (IPPI) resulting from econometric models integrating the business survey series. Lastly, we show how these surveys can be used to forecast the gross domestic product (GDP) and to define proxies for some non-available key indicators (new orders in industry, etc.).

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016734
    Description:

    According to recent literature, the calibration method has gained much popularity on survey sampling and calibration estimators are routinely computed by many survey organizations. The choice of calibration variables for all existing approaches, however, remains ad hoc. In this article, we show that the model-calibration estimator for the finite population mean, which was proposed by Wu and Sitter (2001) through an intuitive argument, is indeed optimal among a class of calibration estimators. We further present optimal calibration estimators for the finite population distribution function, the population variance, variance of a linear estimator and other quadratic finite population functions under a unified framework. A limited simulation study shows that the improvement of these optimal estimators over the conventional ones can be substantial. The question of when and how auxiliary information can be used for both the estimation of the population mean using a generalized regression estimator and the estimation of its variance through calibration is addressed clearly under the proposed general methodology. Constructions of proposed estimators under two-phase sampling and some fundamental issues in using auxiliary information from survey data are also addressed under the context of optimal estimation.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016741
    Description:

    Linearization and the jack-knife method are widely used to estimate standard errors for the coefficients of linear regression models fit to multi-stage samples. With few primary sampling units (PSUs) or when a few PSUs have high leverage, linearization estimators can have large negative bias, while the jack-knife method has a correspondingly large positive bias. We characterize the design factors that produce large biases in these standard error estimators. In this technical paper, we propose an alternative estimator, bias reduced linearization (BRL), based on residuals adjusted to better approximate the covariance of the true errors.

    When errors are independently and identically distributed (iid), the BRL estimator is unbiased. The BRL method applies to stratified samples with non-constant selection weights and to generalized linear models such as logistic regression. We also discuss BRL standard error estimators for generalized estimating equation models that explicitly model the dependence among observations from the same PSU in data from complex sample designs. Simulation study results show that BRL standard errors are combined with the Satterthwaite approximation to determine the reference distribution yield tests with Type I error rates near nominal values. We contrast our method with alternatives proposed by Kott (1994 and 1996) and Mancl and DeRouen (2001).

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016753
    Description:

    Keynote Address.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016722
    Description:

    Colorectal cancer (CRC) is the second cause of cancer deaths in Canada. Randomized controlled trials (RCT) have shown the efficacy of screening using faecal occult blood tests (FOBT). A comprehensive evaluation of the costs and consequences of CRC screening for the Canadian population is required before implementing such a program. This paper evaluates whether or not the CRC screening is cost-effective. The results of these simulations will be provided to the Canadian National Committee on Colorectal Cancer Screening to help formulate national policy recommendations for CRC screening.

    Statistics Canada's Population Health Microsimulation Model was updated to incorporate a comprehensive CRC screening module based on Canadian data and RCT efficacy results. The module incorporated sensitivity and specificity of FOBT and colonoscopy, participation rates, incidence, staging, diagnostic and therapeutic options, disease progression, mortality and direct health care costs for different screening scenarios. Reproducing the mortality reduction observed in the Funen screening trial validated the model.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016715
    Description:

    This paper will describe the multiple imputation of income in the National Health Interview Survey and discuss the methodological issues involved. In addition, the paper will present empirical summaries of the imputations as well as results of a Monte Carlo evaluation of inferences based on multiply imputed income items.

    Analysts of health data are often interested in studying relationships between income and health. The National Health Interview Survey, conducted by the National Center for Health Statistics of the U.S. Centers for Disease Control and Prevention, provides a rich source of data for studying such relationships. However, the nonresponse rates on two key income items, an individual's earned income and a family's total income, are over 20%. Moreover, these nonresponse rates appear to be increasing over time. A project is currently underway to multiply impute individual earnings and family income along with some other covariates for the National Health Interview Survey in 1997 and subsequent years.

    There are many challenges in developing appropriate multiple imputations for such large-scale surveys. First, there are many variables of different types, with different skip patterns and logical relationships. Second, it is not known what types of associations will be investigated by the analysts of multiply imputed data. Finally, some variables, such as family income, are collected at the family level and others, such as earned income, are collected at the individual level. To make the imputations for both the family- and individual-level variables conditional on as many predictors as possible, and to simplify modelling, we are using a modified version of the sequential regression imputation method described in Raghunathan et al. ( Survey Methodology, 2001).

    Besides issues related to the hierarchical nature of the imputations just described, there are other methodological issues of interest such as the use of transformations of the income variables, the imposition of restrictions on the values of variables, the general validity of sequential regression imputation and, even more generally, the validity of multiple-imputation inferences for surveys with complex sample designs.

    Release date: 2004-09-13

  • Technical products: 11-522-X20020016729
    Description:

    For most survey samples, if not all, we have to deal with the problem of missing values. Missing values are usually caused by nonresponse (such as refusal of participant or interviewer was unable to contact respondent) but can also be produced at the editing step of the survey in an attempt to resolve problems of inconsistent or suspect responses. The presence of missing values (nonresponse) generally leads to bias and uncertainty in the estimates. To treat this problem, the appropriate use of all available auxiliary information permits the maximum reduction of nonresponse bias and variance. During this presentation, we will define the problem, describe the methodology that SEVANI is based on and discuss potential uses of the system. We will end the discussion by presenting some examples based on real data to illustrate the theory in practice.

    In practice, it is very difficult to estimate the nonresponse bias. However, it is possible to estimate the nonresponse variance by assuming that the bias is negligible. In the last decade, many methods were indeed proposed to estimate this variance, and some of these have been implemented in the System for Estimation of Variance due to Nonresponse and Imputation (SEVANI).

    The methodology used to develop SEVANI is based on the theory of two-phase sampling where we assume that the second phase of selection is nonresponse. However, contrary to two-phase sampling, an imputation or nonresponse model is required for variance estimation. SEVANI also assumes that nonresponse is treated by reweighting respondent units or by imputing their missing values. Three imputation methods are considered: the imputation of an auxiliary variable, regression imputation (deterministic or random) and nearest-neighbour imputation.

    Release date: 2004-09-13

Date modified: