Weighting and estimation

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Survey or statistical program

1 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (93)

All (93) (0 to 10 of 93 results)

  • Articles and reports: 11-522-X202200100004
    Description: In accordance with Statistics Canada’s long-term Disaggregated Data Action Plan (DDAP), several initiatives have been implemented into the Labour Force Survey (LFS). One of the more direct initiatives was a targeted increase in the size of the monthly LFS sample. Furthermore, a regular Supplement program was introduced, where an additional series of questions are asked to a subset of LFS respondents and analyzed in a monthly or quarterly production cycle. Finally, the production of modelled estimates based on Small Area Estimation (SAE) methodologies resumed for the LFS and will include a wider scope with more analytical value than what had existed in the past. This paper will give an overview of these three initiatives.
    Release date: 2024-03-25

  • Articles and reports: 12-001-X202300200004
    Description: We present a novel methodology to benchmark county-level estimates of crop area totals to a preset state total subject to inequality constraints and random variances in the Fay-Herriot model. For planted area of the National Agricultural Statistics Service (NASS), an agency of the United States Department of Agriculture (USDA), it is necessary to incorporate the constraint that the estimated totals, derived from survey and other auxiliary data, are no smaller than administrative planted area totals prerecorded by other USDA agencies except NASS. These administrative totals are treated as fixed and known, and this additional coherence requirement adds to the complexity of benchmarking the county-level estimates. A fully Bayesian analysis of the Fay-Herriot model offers an appealing way to incorporate the inequality and benchmarking constraints, and to quantify the resulting uncertainties, but sampling from the posterior densities involves difficult integration, and reasonable approximations must be made. First, we describe a single-shrinkage model, shrinking the means while the variances are assumed known. Second, we extend this model to accommodate double shrinkage, borrowing strength across means and variances. This extended model has two sources of extra variation, but because we are shrinking both means and variances, it is expected that this second model should perform better in terms of goodness of fit (reliability) and possibly precision. The computations are challenging for both models, which are applied to simulated data sets with properties resembling the Illinois corn crop.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200015
    Description: This article discusses and provides comments on the Ardilly, Haziza, Lavallée and Tillé’s summary presentation of Jean-Claude Deville’s work on survey theory. It sheds light on the context, applications and uses of his findings, and shows how these have become engrained in the role of statisticians, in which Jean-Claude was a trailblazer. It also discusses other aspects of his career and his creative inventions.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200016
    Description: In this discussion, I will present some additional aspects of three major areas of survey theory developed or studied by Jean-Claude Deville: calibration, balanced sampling and the generalized weight-share method.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200018
    Description: Sample surveys, as a tool for policy development and evaluation and for scientific, social and economic research, have been employed for over a century. In that time, they have primarily served as tools for collecting data for enumerative purposes. Estimation of these characteristics has been typically based on weighting and repeated sampling, or design-based, inference. However, sample data have also been used for modelling the unobservable processes that gave rise to the finite population data. This type of use has been termed analytic, and often involves integrating the sample data with data from secondary sources.

    Alternative approaches to inference in these situations, drawing inspiration from mainstream statistical modelling, have been strongly promoted. The principal focus of these alternatives has been on allowing for informative sampling. Modern survey sampling, though, is more focussed on situations where the sample data are in fact part of a more complex set of data sources all carrying relevant information about the process of interest. When an efficient modelling method such as maximum likelihood is preferred, the issue becomes one of how it should be modified to account for both complex sampling designs and multiple data sources. Here application of the Missing Information Principle provides a clear way forward.

    In this paper I review how this principle has been applied to resolve so-called “messy” data analysis issues in sampling. I also discuss a scenario that is a consequence of the rapid growth in auxiliary data sources for survey data analysis. This is where sampled records from one accessible source or register are linked to records from another less accessible source, with values of the response variable of interest drawn from this second source, and where a key output is small area estimates for the response variable for domains defined on the first source.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300100003
    Description: To improve the precision of inferences and reduce costs there is considerable interest in combining data from several sources such as sample surveys and administrative data. Appropriate methodology is required to ensure satisfactory inferences since the target populations and methods for acquiring data may be quite different. To provide improved inferences we use methodology that has a more general structure than the ones in current practice. We start with the case where the analyst has only summary statistics from each of the sources. In our primary method, uncertain pooling, it is assumed that the analyst can regard one source, survey r, as the single best choice for inference. This method starts with the data from survey r and adds data from those other sources that are shown to form clusters that include survey r. We also consider Dirichlet process mixtures, one of the most popular nonparametric Bayesian methods. We use analytical expressions and the results from numerical studies to show properties of the methodology.
    Release date: 2023-06-30

  • Articles and reports: 12-001-X202300100011
    Description: The definition of statistical units is a recurring issue in the domain of sample surveys. Indeed, not all the populations surveyed have a readily available sampling frame. For some populations, the sampled units are distinct from the observation units and producing estimates on the population of interest raises complex questions, which can be addressed by using the weight share method (Deville and Lavallée, 2006). However, the two populations considered in this approach are discrete. In some fields of study, the sampled population is continuous: this is for example the case of forest inventories for which, frequently, the trees surveyed are those located on plots of which the centers are points randomly drawn in a given area. The production of statistical estimates from the sample of trees surveyed poses methodological difficulties, as do the associated variance calculations. The purpose of this paper is to generalize the weight share method to the continuous (sampled population) ? discrete (surveyed population) case, from the extension proposed by Cordy (1993) of the Horvitz-Thompson estimator for drawing points carried out in a continuous universe.
    Release date: 2023-06-30

  • Articles and reports: 12-001-X202200200012
    Description:

    In many applications, the population means of geographically adjacent small areas exhibit a spatial variation. If available auxiliary variables do not adequately account for the spatial pattern, the residual variation will be included in the random effects. As a result, the independent and identical distribution assumption on random effects of the Fay-Herriot model will fail. Furthermore, limited resources often prevent numerous sub-populations from being included in the sample, resulting in non-sampled small areas. The problem can be exacerbated for predicting means of non-sampled small areas using the above Fay-Herriot model as the predictions will be made based solely on the auxiliary variables. To address such inadequacy, we consider Bayesian spatial random-effect models that can accommodate multiple non-sampled areas. Under mild conditions, we establish the propriety of the posterior distributions for various spatial models for a useful class of improper prior densities on model parameters. The effectiveness of these spatial models is assessed based on simulated and real data. Specifically, we examine predictions of statewide four-person family median incomes based on the 1990 Current Population Survey and the 1980 Census for the United States of America.

    Release date: 2022-12-15

  • Articles and reports: 12-001-X202200100004
    Description:

    When the sample size of an area is small, borrowing information from neighbors is a small area estimation technique to provide more reliable estimates. One of the famous models in small area estimation is a multinomial-Dirichlet hierarchical model for multinomial counts. Due to natural characteristics of the data, making unimodal order restriction assumption to parameter spaces is relevant. In our application, body mass index is more likely at an overweight level, which means the unimodal order restriction may be reasonable. The same unimodal order restriction for all areas may be too strong to be true for some cases. To increase flexibility, we add uncertainty to the unimodal order restriction. Each area will have similar unimodal patterns, but not the same. Since the order restriction with uncertainty increases the inference difficulty, we make comparison with the posterior summaries and approximated log-pseudo marginal likelihood.

    Release date: 2022-06-21

  • Articles and reports: 12-001-X202100200005
    Description:

    Variance estimation is a challenging problem in surveys because there are several nontrivial factors contributing to the total survey error, including sampling and unit non-response. Initially devised to capture the variance of non-trivial statistics based on independent and identically distributed data, the bootstrap method has since been adapted in various ways to address survey-specific elements/factors. In this paper we look into one of those variants, the with-replacement bootstrap. We consider household surveys, with or without sub-sampling of individuals. We make explicit the benchmark variance estimators that the with-replacement bootstrap aims at reproducing. We explain how the bootstrap can be used to account for the impact sampling, treatment of non-response and calibration have on total survey error. For clarity, the proposed methods are illustrated on a running example. They are evaluated through a simulation study, and applied to a French Panel for Urban Policy. Two SAS macros to perform the bootstrap methods are also developed.

    Release date: 2022-01-06
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (86)

Analysis (86) (0 to 10 of 86 results)

  • Articles and reports: 11-522-X202200100004
    Description: In accordance with Statistics Canada’s long-term Disaggregated Data Action Plan (DDAP), several initiatives have been implemented into the Labour Force Survey (LFS). One of the more direct initiatives was a targeted increase in the size of the monthly LFS sample. Furthermore, a regular Supplement program was introduced, where an additional series of questions are asked to a subset of LFS respondents and analyzed in a monthly or quarterly production cycle. Finally, the production of modelled estimates based on Small Area Estimation (SAE) methodologies resumed for the LFS and will include a wider scope with more analytical value than what had existed in the past. This paper will give an overview of these three initiatives.
    Release date: 2024-03-25

  • Articles and reports: 12-001-X202300200004
    Description: We present a novel methodology to benchmark county-level estimates of crop area totals to a preset state total subject to inequality constraints and random variances in the Fay-Herriot model. For planted area of the National Agricultural Statistics Service (NASS), an agency of the United States Department of Agriculture (USDA), it is necessary to incorporate the constraint that the estimated totals, derived from survey and other auxiliary data, are no smaller than administrative planted area totals prerecorded by other USDA agencies except NASS. These administrative totals are treated as fixed and known, and this additional coherence requirement adds to the complexity of benchmarking the county-level estimates. A fully Bayesian analysis of the Fay-Herriot model offers an appealing way to incorporate the inequality and benchmarking constraints, and to quantify the resulting uncertainties, but sampling from the posterior densities involves difficult integration, and reasonable approximations must be made. First, we describe a single-shrinkage model, shrinking the means while the variances are assumed known. Second, we extend this model to accommodate double shrinkage, borrowing strength across means and variances. This extended model has two sources of extra variation, but because we are shrinking both means and variances, it is expected that this second model should perform better in terms of goodness of fit (reliability) and possibly precision. The computations are challenging for both models, which are applied to simulated data sets with properties resembling the Illinois corn crop.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200015
    Description: This article discusses and provides comments on the Ardilly, Haziza, Lavallée and Tillé’s summary presentation of Jean-Claude Deville’s work on survey theory. It sheds light on the context, applications and uses of his findings, and shows how these have become engrained in the role of statisticians, in which Jean-Claude was a trailblazer. It also discusses other aspects of his career and his creative inventions.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200016
    Description: In this discussion, I will present some additional aspects of three major areas of survey theory developed or studied by Jean-Claude Deville: calibration, balanced sampling and the generalized weight-share method.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300200018
    Description: Sample surveys, as a tool for policy development and evaluation and for scientific, social and economic research, have been employed for over a century. In that time, they have primarily served as tools for collecting data for enumerative purposes. Estimation of these characteristics has been typically based on weighting and repeated sampling, or design-based, inference. However, sample data have also been used for modelling the unobservable processes that gave rise to the finite population data. This type of use has been termed analytic, and often involves integrating the sample data with data from secondary sources.

    Alternative approaches to inference in these situations, drawing inspiration from mainstream statistical modelling, have been strongly promoted. The principal focus of these alternatives has been on allowing for informative sampling. Modern survey sampling, though, is more focussed on situations where the sample data are in fact part of a more complex set of data sources all carrying relevant information about the process of interest. When an efficient modelling method such as maximum likelihood is preferred, the issue becomes one of how it should be modified to account for both complex sampling designs and multiple data sources. Here application of the Missing Information Principle provides a clear way forward.

    In this paper I review how this principle has been applied to resolve so-called “messy” data analysis issues in sampling. I also discuss a scenario that is a consequence of the rapid growth in auxiliary data sources for survey data analysis. This is where sampled records from one accessible source or register are linked to records from another less accessible source, with values of the response variable of interest drawn from this second source, and where a key output is small area estimates for the response variable for domains defined on the first source.
    Release date: 2024-01-03

  • Articles and reports: 12-001-X202300100003
    Description: To improve the precision of inferences and reduce costs there is considerable interest in combining data from several sources such as sample surveys and administrative data. Appropriate methodology is required to ensure satisfactory inferences since the target populations and methods for acquiring data may be quite different. To provide improved inferences we use methodology that has a more general structure than the ones in current practice. We start with the case where the analyst has only summary statistics from each of the sources. In our primary method, uncertain pooling, it is assumed that the analyst can regard one source, survey r, as the single best choice for inference. This method starts with the data from survey r and adds data from those other sources that are shown to form clusters that include survey r. We also consider Dirichlet process mixtures, one of the most popular nonparametric Bayesian methods. We use analytical expressions and the results from numerical studies to show properties of the methodology.
    Release date: 2023-06-30

  • Articles and reports: 12-001-X202300100011
    Description: The definition of statistical units is a recurring issue in the domain of sample surveys. Indeed, not all the populations surveyed have a readily available sampling frame. For some populations, the sampled units are distinct from the observation units and producing estimates on the population of interest raises complex questions, which can be addressed by using the weight share method (Deville and Lavallée, 2006). However, the two populations considered in this approach are discrete. In some fields of study, the sampled population is continuous: this is for example the case of forest inventories for which, frequently, the trees surveyed are those located on plots of which the centers are points randomly drawn in a given area. The production of statistical estimates from the sample of trees surveyed poses methodological difficulties, as do the associated variance calculations. The purpose of this paper is to generalize the weight share method to the continuous (sampled population) ? discrete (surveyed population) case, from the extension proposed by Cordy (1993) of the Horvitz-Thompson estimator for drawing points carried out in a continuous universe.
    Release date: 2023-06-30

  • Articles and reports: 12-001-X202200200012
    Description:

    In many applications, the population means of geographically adjacent small areas exhibit a spatial variation. If available auxiliary variables do not adequately account for the spatial pattern, the residual variation will be included in the random effects. As a result, the independent and identical distribution assumption on random effects of the Fay-Herriot model will fail. Furthermore, limited resources often prevent numerous sub-populations from being included in the sample, resulting in non-sampled small areas. The problem can be exacerbated for predicting means of non-sampled small areas using the above Fay-Herriot model as the predictions will be made based solely on the auxiliary variables. To address such inadequacy, we consider Bayesian spatial random-effect models that can accommodate multiple non-sampled areas. Under mild conditions, we establish the propriety of the posterior distributions for various spatial models for a useful class of improper prior densities on model parameters. The effectiveness of these spatial models is assessed based on simulated and real data. Specifically, we examine predictions of statewide four-person family median incomes based on the 1990 Current Population Survey and the 1980 Census for the United States of America.

    Release date: 2022-12-15

  • Articles and reports: 12-001-X202200100004
    Description:

    When the sample size of an area is small, borrowing information from neighbors is a small area estimation technique to provide more reliable estimates. One of the famous models in small area estimation is a multinomial-Dirichlet hierarchical model for multinomial counts. Due to natural characteristics of the data, making unimodal order restriction assumption to parameter spaces is relevant. In our application, body mass index is more likely at an overweight level, which means the unimodal order restriction may be reasonable. The same unimodal order restriction for all areas may be too strong to be true for some cases. To increase flexibility, we add uncertainty to the unimodal order restriction. Each area will have similar unimodal patterns, but not the same. Since the order restriction with uncertainty increases the inference difficulty, we make comparison with the posterior summaries and approximated log-pseudo marginal likelihood.

    Release date: 2022-06-21

  • Articles and reports: 12-001-X202100200005
    Description:

    Variance estimation is a challenging problem in surveys because there are several nontrivial factors contributing to the total survey error, including sampling and unit non-response. Initially devised to capture the variance of non-trivial statistics based on independent and identically distributed data, the bootstrap method has since been adapted in various ways to address survey-specific elements/factors. In this paper we look into one of those variants, the with-replacement bootstrap. We consider household surveys, with or without sub-sampling of individuals. We make explicit the benchmark variance estimators that the with-replacement bootstrap aims at reproducing. We explain how the bootstrap can be used to account for the impact sampling, treatment of non-response and calibration have on total survey error. For clarity, the proposed methods are illustrated on a running example. They are evaluated through a simulation study, and applied to a French Panel for Urban Policy. Two SAS macros to perform the bootstrap methods are also developed.

    Release date: 2022-01-06
Reference (7)

Reference (7) ((7 results))

  • Surveys and statistical programs – Documentation: 13-604-M2003042
    Description:

    On May 31, 2001, the quarterly income and expenditure accounts adopted the Chain Fisher Index formula, chained quarterly, as the official measure of real gross domestic product (GDP) in terms of expenditures. This formula was also adopted for the Provincial Accounts on October 31, 2002.

    There were two reasons for adopting this formula: to provide users with a more accurate measure of real GDP growth between two consecutive periods and to make the Canadian measure comparable with the Income and Product Accounts of the United States, which has used the Chain Fisher Index formula since 1996 to measure real GDP.

    Release date: 2003-11-06

  • Surveys and statistical programs – Documentation: 11-522-X19990015668
    Description:

    Following the problems with estimating underenumeration in the 1991 Census of England and Wales the aim for the 2001 Census is to create a database that is fully adjusted to net underenumeration. To achieve this, the paper investigates weighted donor imputation methodology that utilises information from both the census and census coverage survey (CCS). The US Census Bureau has considered a similar approach for their 2000 Census (see Isaki et al 1998). The proposed procedure distinguishes between individuals who are not counted by the census because their household is missed and those who are missed in counted households. Census data is linked to data from the CCS. Multinomial logistic regression is used to estimate the probabilities that households are missed by the census and the probabilities that individuals are missed in counted households. Household and individual coverage weights are constructed from the estimated probabilities and these feed into the donor imputation procedure.

    Release date: 2000-03-02

  • Surveys and statistical programs – Documentation: 11-522-X19990015680
    Description:

    To augment the amount of available information, data from different sources are increasingly being combined. These databases are often combined using record linkage methods. When there is no unique identifier, a probabilistic linkage is used. In that case, a record on a first file is associated with a probability that is linked to a record on a second file, and then a decision is taken on whether a possible link is a true link or not. This usually requires a non-negligible amount of manual resolution. It might then be legitimate to evaluate if manual resolution can be reduced or even eliminated. This issue is addressed in this paper where one tries to produce an estimate of a total (or a mean) of one population, when using a sample selected from another population linked somehow to the first population. In other words, having two populations linked through probabilistic record linkage, we try to avoid any decision concerning the validity of links and still be able to produce an unbiased estimate for a total of the one of two populations. To achieve this goal, we suggest the use of the Generalised Weight Share Method (GWSM) described by Lavallée (1995).

    Release date: 2000-03-02

  • Surveys and statistical programs – Documentation: 11-522-X19980015019
    Description:

    The British Labour Force Survey (LFS) is a quarterly household survey with a rotating sample design that can potentially be used to produce longitudinal data, including estimates of labour force gross flows. However, these estimates may be biased due to the effect of non-response. Weighting adjustments are a commonly used method to account for non-response bias. We find that weighting may not fully account for the effect of non-response bias because non-response may depend on the unobserved labour force flows, i.e., the non-response is non-ignorable. To adjust for the effects of non-ignorable non-response, we propose a model for the complex non-response patterns in the LFS which controls for the correlated within-household non-response behaviour found in the survey. The results of modelling suggest that non-response may be non-ignorable in the LFS, causing the weighting estimates to be biased.

    Release date: 1999-10-22

  • Surveys and statistical programs – Documentation: 11-522-X19980015020
    Description:

    At the end of 1993, Eurostat lauched a 'community' panel of households. The first wave, carried out in 1994 in the 12 countries of the European Union, included some 7,300 households in France, and at least 14,000 adults 17 years or over. Each individual was then followed up and interviewed each year, even if they had moved. The individuals leaving the sample present a particular profile. In the first part, we present a sketch of how our sample evolves and an analysis of the main characteristics of the non-respondents. We then propose 2 models to correct for non-response per homogeneous category. We then describe the longitudinal weight distribution obtained from the two models, and the cross-sectional weights using the weight share method. Finally, we compare some indicators calculated using both weighting methods.

    Release date: 1999-10-22

  • Surveys and statistical programs – Documentation: 11-522-X19980015023
    Description:

    The study of social mobility, between labour market statuses or between income levels, for example, is often based on the analysis of mobility matrices. When comparing these transition matrices, with a view to evaluating behavioural changes, one often forgets that the data derive from a sample survey and are therefore affected by sampling variances. Similarly, it is assumed that the responses collected correspond to the ' true value.'

    Release date: 1999-10-22

  • Surveys and statistical programs – Documentation: 11-522-X19980015031
    Description:

    The U.S. Third National Health and Nutrition Examination Survey (NHANES III) was carried out from 1988 to 1994. This survey was intended primarily to provide estimates of cross-sectional parameters believed to be approximately constant over the six-year data collection period. However, for some variable (e.g., serum lead, body mass index and smoking behavior), substantive considerations suggest the possible presence of nontrivial changes in level between 1988 and 1994. For these variables, NHANES III is potentially a valuable source of time-change information, compared to other studies involving more restricted populations and samples. Exploration of possible change over time is complicated by two issues. First, there was of practical concern because some variables displayed substantial regional differences in level. This was of practical concern because some variables displayed substantial regional differences in level. Second, nontrivial changes in level over time can lead to nontrivial biases in some customary NHANES III variance estimators. This paper considers these two problems and discusses some related implications for statistical policy.

    Release date: 1999-10-22
Date modified: