# Statistics by subject – Weighting and estimation

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
All (411)

## All (411) (25 of 411 results)

• Articles and reports: 11-629-X2017009
Description:

Seasonal adjustment is a statistical technique used to remove fluctuations in economic data that occur every year at the same time and in a similar fashion. This video provides an overview of seasonal adjustment, how it is used and how it affects the economy.

Release date: 2017-11-22

• Articles and reports: 12-001-X201700114823
Description:

The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114819
Description:

Structural time series models are a powerful technique for variance reduction in the framework of small area estimation (SAE) based on repeatedly conducted surveys. Statistics Netherlands implemented a structural time series model to produce monthly figures about the labour force with the Dutch Labour Force Survey (DLFS). Such models, however, contain unknown hyperparameters that have to be estimated before the Kalman filter can be launched to estimate state variables of the model. This paper describes a simulation aimed at studying the properties of hyperparameter estimators in the model. Simulating distributions of the hyperparameter estimators under different model specifications complements standard model diagnostics for state space models. Uncertainty around the model hyperparameters is another major issue. To account for hyperparameter uncertainty in the mean squared errors (MSE) estimates of the DLFS, several estimation approaches known in the literature are considered in a simulation. Apart from the MSE bias comparison, this paper also provides insight into the variances and MSEs of the MSE estimators considered.

Release date: 2017-06-22

• Articles and reports: 12-001-X201600214677
Description:

How do we tell whether weighting adjustments reduce nonresponse bias? If a variable is measured for everyone in the selected sample, then the design weights can be used to calculate an approximately unbiased estimate of the population mean or total for that variable. A second estimate of the population mean or total can be calculated using the survey respondents only, with weights that have been adjusted for nonresponse. If the two estimates disagree, then there is evidence that the weight adjustments may not have removed the nonresponse bias for that variable. In this paper we develop the theoretical properties of linearization and jackknife variance estimators for evaluating the bias of an estimated population mean or total by comparing estimates calculated from overlapping subsets of the same data with different sets of weights, when poststratification or inverse propensity weighting is used for the nonresponse adjustments to the weights. We provide sufficient conditions on the population, sample, and response mechanism for the variance estimators to be consistent, and demonstrate their small-sample properties through a simulation study.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214664
Description:

This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214663
Description:

We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214660
Description:

In an economic survey of a sample of enterprises, occupations are randomly selected from a list until a number r of occupations in a local unit has been identified. This is an inverse sampling problem for which we are proposing a few solutions. Simple designs with and without replacement are processed using negative binomial distributions and negative hypergeometric distributions. We also propose estimators for when the units are selected with unequal probabilities, with or without replacement.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114540
Description:

In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114544
Description:

In the Netherlands, statistical information about income and wealth is based on two large scale household panels that are completely derived from administrative data. A problem with using households as sampling units in the sample design of panels is the instability of these units over time. Changes in the household composition affect the inclusion probabilities required for design-based and model-assisted inference procedures. Such problems are circumvented in the two aforementioned household panels by sampling persons, who are followed over time. At each period the household members of these sampled persons are included in the sample. This is equivalent to sampling with probabilities proportional to household size where households can be selected more than once but with a maximum equal to the number of household members. In this paper properties of this sample design are described and contrasted with the Generalized Weight Share method for indirect sampling (Lavallée 1995, 2007). Methods are illustrated with an application to the Dutch Regional Income Survey.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114543
Description:

The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

Release date: 2016-06-22

• Technical products: 91-528-X
Description:

This manual provides detailed descriptions of the data sources and methods used by Statistics Canada to estimate population. They comprise Postcensal and intercensal population estimates; base population; births and deaths; immigration; emigration; non-permanent residents; interprovincial migration; subprovincial estimates of population; population estimates by age, sex and marital status; and census family estimates. A glossary of principal terms is contained at the end of the manual, followed by the standard notation used.

Until now, literature on the methodological changes for estimates calculations has always been spread throughout various Statistics Canada publications and background papers. This manual provides users of demographic statistics with a comprehensive compilation of the current procedures used by Statistics Canada to prepare population and family estimates.

Release date: 2016-03-03

• Articles and reports: 12-001-X201500214230
Description:

This paper develops allocation methods for stratified sample surveys where composite small area estimators are a priority, and areas are used as strata. Longford (2006) proposed an objective criterion for this situation, based on a weighted combination of the mean squared errors of small area means and a grand mean. Here, we redefine this approach within a model-assisted framework, allowing regressor variables and a more natural interpretation of results using an intra-class correlation parameter. We also consider several uses of power allocation, and allow the placing of other constraints such as maximum relative root mean squared errors for stratum estimators. We find that a simple power allocation can perform very nearly as well as the optimal design even when the objective is to minimize Longford’s (2006) criterion.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214231
Description:

Rotating panels are widely applied by national statistical institutes, for example, to produce official statistics about the labour force. Estimation procedures are generally based on traditional design-based procedures known from classical sampling theory. A major drawback of this class of estimators is that small sample sizes result in large standard errors and that they are not robust for measurement bias. Two examples showing the effects of measurement bias are rotation group bias in rotating panels, and systematic differences in the outcome of a survey due to a major redesign of the underlying process. In this paper we apply a multivariate structural time series model to the Dutch Labour Force Survey to produce model-based figures about the monthly labour force. The model reduces the standard errors of the estimates by taking advantage of sample information collected in previous periods, accounts for rotation group bias and autocorrelation induced by the rotating panel, and models discontinuities due to a survey redesign. Additionally, we discuss the use of correlated auxiliary series in the model to further improve the accuracy of the model estimates. The method is applied by Statistics Netherlands to produce accurate official monthly statistics about the labour force that are consistent over time, despite a redesign of the survey process.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214248
Description:

Unit level population models are often used in model-based small area estimation of totals and means, but the models may not hold for the sample if the sampling design is informative for the model. As a result, standard methods, assuming that the model holds for the sample, can lead to biased estimators. We study alternative methods that use a suitable function of the unit selection probability as an additional auxiliary variable in the sample model. We report the results of a simulation study on the bias and mean squared error (MSE) of the proposed estimators of small area means and on the relative bias of the associated MSE estimators, using informative sampling schemes to generate the samples. Alternative methods, based on modeling the conditional expectation of the design weight as a function of the model covariates and the response, are also included in the simulation study.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114150
Description:

An area-level model approach to combining information from several sources is considered in the context of small area estimation. At each small area, several estimates are computed and linked through a system of structural error models. The best linear unbiased predictor of the small area parameter can be computed by the general least squares method. Parameters in the structural error models are estimated using the theory of measurement error models. Estimation of mean squared errors is also discussed. The proposed method is applied to the real problem of labor force surveys in Korea.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114200
Description:

We consider the observed best prediction (OBP; Jiang, Nguyen and Rao 2011) for small area estimation under the nested-error regression model, where both the mean and variance functions may be misspecified. We show via a simulation study that the OBP may significantly outperform the empirical best linear unbiased prediction (EBLUP) method not just in the overall mean squared prediction error (MSPE) but also in the area-specific MSPE for every one of the small areas. A bootstrap method is proposed for estimating the design-based area-specific MSPE, which is simple and always produces positive MSPE estimates. The performance of the proposed MSPE estimator is evaluated through a simulation study. An application to the Television School and Family Smoking Prevention and Cessation study is considered.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114174
Description:

Matrix sampling, often referred to as split-questionnaire, is a sampling design that involves dividing a questionnaire into subsets of questions, possibly overlapping, and then administering each subset to one or more different random subsamples of an initial sample. This increasingly appealing design addresses concerns related to data collection costs, respondent burden and data quality, but reduces the number of sample units that are asked each question. A broadened concept of matrix design includes the integration of samples from separate surveys for the benefit of streamlined survey operations and consistency of outputs. For matrix survey sampling with overlapping subsets of questions, we propose an efficient estimation method that exploits correlations among items surveyed in the various subsamples in order to improve the precision of the survey estimates. The proposed method, based on the principle of best linear unbiased estimation, generates composite optimal regression estimators of population totals using a suitable calibration scheme for the sampling weights of the full sample. A variant of this calibration scheme, of more general use, produces composite generalized regression estimators that are also computationally very efficient.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114161
Description:

A popular area level model used for the estimation of small area means is the Fay-Herriot model. This model involves unobservable random effects for the areas apart from the (fixed) linear regression based on area level covariates. Empirical best linear unbiased predictors of small area means are obtained by estimating the area random effects, and they can be expressed as a weighted average of area-specific direct estimators and regression-synthetic estimators. In some cases the observed data do not support the inclusion of the area random effects in the model. Excluding these area effects leads to the regression-synthetic estimator, that is, a zero weight is attached to the direct estimator. A preliminary test estimator of a small area mean obtained after testing for the presence of area random effects is studied. On the other hand, empirical best linear unbiased predictors of small area means that always give non-zero weights to the direct estimators in all areas together with alternative estimators based on the preliminary test are also studied. The preliminary testing procedure is also used to define new mean squared error estimators of the point estimators of small area means. Results of a limited simulation study show that, for small number of areas, the preliminary testing procedure leads to mean squared error estimators with considerably smaller average absolute relative bias than the usual mean squared error estimators, especially when the variance of the area effects is small relative to the sampling variances.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114160
Description:

Composite estimation is a technique applicable to repeated surveys with controlled overlap between successive surveys. This paper examines the modified regression estimators that incorporate information from previous time periods into estimates for the current time period. The range of modified regression estimators are extended to the situation of business surveys with survey frames that change over time, due to the addition of “births” and the deletion of “deaths”. Since the modified regression estimators can deviate from the generalized regression estimator over time, it is proposed to use a compromise modified regression estimator, a weighted average of the modified regression estimator and the generalised regression estimator. A Monte Carlo simulation study shows that the proposed compromise modified regression estimator leads to significant efficiency gains in both the point-in-time and movement estimates.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114192
Description:

We are concerned with optimal linear estimation of means on subsequent occasions under sample rotation where evolution of samples in time is designed through a cascade pattern. It has been known since the seminal paper of Patterson (1950) that when the units are not allowed to return to the sample after leaving it for certain period (there are no gaps in the rotation pattern), one step recursion for optimal estimator holds. However, in some important real surveys, e.g., Current Population Survey in the US or Labour Force Survey in many countries in Europe, units return to the sample after being absent in the sample for several occasions (there are gaps in rotation patterns). In such situations difficulty of the question of the form of the recurrence for optimal estimator increases drastically. This issue has not been resolved yet. Instead alternative sub-optimal approaches were developed, as K - composite estimation (see e.g., Hansen, Hurwitz, Nisselson and Steinberg (1955)), AK - composite estimation (see e.g., Gurney and Daly (1965)) or time series approach (see e.g., Binder and Hidiroglou (1988)).

In the present paper we overcome this long-standing difficulty, that is, we present analytical recursion formulas for the optimal linear estimator of the mean for schemes with gaps in rotation patterns. It is achieved under some technical conditions: ASSUMPTION I and ASSUMPTION II (numerical experiments suggest that these assumptions might be universally satisfied). To attain the goal we develop an algebraic operator approach which allows to reduce the problem of recursion for the optimal linear estimator to two issues: (1) localization of roots (possibly complex) of a polynomial Qp defined in terms of the rotation pattern (Qp happens to be conveniently expressed through Chebyshev polynomials of the first kind), (2) rank of a matrix S defined in terms of the rotation pattern and the roots of the polynomial Qp. In particular, it is shown that the order of the recursion is equal to one plus the size of the largest gap in the rotation pattern. Exact formulas for calculation of the recurrence coefficients are given - of course, to use them one has to check (in many cases, numerically) that ASSUMPTIONs I and II are satisfied. The solution is illustrated through several examples of rotation schemes arising in real surveys.

Release date: 2015-06-29

• Technical products: 12-002-X
Description:

The Research Data Centres (RDCs) Information and Technical Bulletin (ITB) is a forum by which Statistics Canada analysts and the research community can inform each other on survey data uses and methodological techniques. Articles in the ITB focus on data analysis and modelling, data management, and best or ineffective statistical, computational, and scientific practices. Further, ITB topics will include essays on data content, implications of questionnaire wording, comparisons of datasets, reviews on methodologies and their application, data peculiarities, problematic data and solutions, and explanations of innovative tools using RDC surveys and relevant software. All of these essays may provide advice and detailed examples outlining commands, habits, tricks and strategies used to make problem-solving easier for the RDC user.

The main aims of the ITB are:

- the advancement and dissemination of knowledge surrounding Statistics Canada's data; - the exchange of ideas among the RDC-user community;- the support of new users; - the co-operation with subject matter experts and divisions within Statistics Canada.

The ITB is interested in quality articles that are worth publicizing throughout the research community, and that will add value to the quality of research produced at Statistics Canada's RDCs.

Release date: 2015-03-25

• Index and guides: 99-002-X
Description:

This report describes sampling and weighting procedures used in the 2011 National Household Survey. It provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

Release date: 2015-01-28

• Index and guides: 99-002-X2011001
Release date: 2015-01-28

• Articles and reports: 12-001-X201400214113
Description:

Rotating panel surveys are used to calculate estimates of gross flows between two consecutive periods of measurement. This paper considers a general procedure for the estimation of gross flows when the rotating panel survey has been generated from a complex survey design with random nonresponse. A pseudo maximum likelihood approach is considered through a two-stage model of Markov chains for the allocation of individuals among the categories in the survey and for modeling for nonresponse.

Release date: 2014-12-19

...
Data (0)

## Data (0) (0 results)

Your search for "" found no results in this section of the site.

You may try:

Analysis (331)

## Analysis (331) (25 of 331 results)

• Articles and reports: 11-629-X2017009
Description:

Seasonal adjustment is a statistical technique used to remove fluctuations in economic data that occur every year at the same time and in a similar fashion. This video provides an overview of seasonal adjustment, how it is used and how it affects the economy.

Release date: 2017-11-22

• Articles and reports: 12-001-X201700114823
Description:

The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114819
Description:

Structural time series models are a powerful technique for variance reduction in the framework of small area estimation (SAE) based on repeatedly conducted surveys. Statistics Netherlands implemented a structural time series model to produce monthly figures about the labour force with the Dutch Labour Force Survey (DLFS). Such models, however, contain unknown hyperparameters that have to be estimated before the Kalman filter can be launched to estimate state variables of the model. This paper describes a simulation aimed at studying the properties of hyperparameter estimators in the model. Simulating distributions of the hyperparameter estimators under different model specifications complements standard model diagnostics for state space models. Uncertainty around the model hyperparameters is another major issue. To account for hyperparameter uncertainty in the mean squared errors (MSE) estimates of the DLFS, several estimation approaches known in the literature are considered in a simulation. Apart from the MSE bias comparison, this paper also provides insight into the variances and MSEs of the MSE estimators considered.

Release date: 2017-06-22

• Articles and reports: 12-001-X201600214677
Description:

How do we tell whether weighting adjustments reduce nonresponse bias? If a variable is measured for everyone in the selected sample, then the design weights can be used to calculate an approximately unbiased estimate of the population mean or total for that variable. A second estimate of the population mean or total can be calculated using the survey respondents only, with weights that have been adjusted for nonresponse. If the two estimates disagree, then there is evidence that the weight adjustments may not have removed the nonresponse bias for that variable. In this paper we develop the theoretical properties of linearization and jackknife variance estimators for evaluating the bias of an estimated population mean or total by comparing estimates calculated from overlapping subsets of the same data with different sets of weights, when poststratification or inverse propensity weighting is used for the nonresponse adjustments to the weights. We provide sufficient conditions on the population, sample, and response mechanism for the variance estimators to be consistent, and demonstrate their small-sample properties through a simulation study.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214664
Description:

This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214663
Description:

We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600214660
Description:

In an economic survey of a sample of enterprises, occupations are randomly selected from a list until a number r of occupations in a local unit has been identified. This is an inverse sampling problem for which we are proposing a few solutions. Simple designs with and without replacement are processed using negative binomial distributions and negative hypergeometric distributions. We also propose estimators for when the units are selected with unequal probabilities, with or without replacement.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114540
Description:

In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114544
Description:

In the Netherlands, statistical information about income and wealth is based on two large scale household panels that are completely derived from administrative data. A problem with using households as sampling units in the sample design of panels is the instability of these units over time. Changes in the household composition affect the inclusion probabilities required for design-based and model-assisted inference procedures. Such problems are circumvented in the two aforementioned household panels by sampling persons, who are followed over time. At each period the household members of these sampled persons are included in the sample. This is equivalent to sampling with probabilities proportional to household size where households can be selected more than once but with a maximum equal to the number of household members. In this paper properties of this sample design are described and contrasted with the Generalized Weight Share method for indirect sampling (Lavallée 1995, 2007). Methods are illustrated with an application to the Dutch Regional Income Survey.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114543
Description:

The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

Release date: 2016-06-22

• Articles and reports: 12-001-X201500214230
Description:

This paper develops allocation methods for stratified sample surveys where composite small area estimators are a priority, and areas are used as strata. Longford (2006) proposed an objective criterion for this situation, based on a weighted combination of the mean squared errors of small area means and a grand mean. Here, we redefine this approach within a model-assisted framework, allowing regressor variables and a more natural interpretation of results using an intra-class correlation parameter. We also consider several uses of power allocation, and allow the placing of other constraints such as maximum relative root mean squared errors for stratum estimators. We find that a simple power allocation can perform very nearly as well as the optimal design even when the objective is to minimize Longford’s (2006) criterion.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214231
Description:

Rotating panels are widely applied by national statistical institutes, for example, to produce official statistics about the labour force. Estimation procedures are generally based on traditional design-based procedures known from classical sampling theory. A major drawback of this class of estimators is that small sample sizes result in large standard errors and that they are not robust for measurement bias. Two examples showing the effects of measurement bias are rotation group bias in rotating panels, and systematic differences in the outcome of a survey due to a major redesign of the underlying process. In this paper we apply a multivariate structural time series model to the Dutch Labour Force Survey to produce model-based figures about the monthly labour force. The model reduces the standard errors of the estimates by taking advantage of sample information collected in previous periods, accounts for rotation group bias and autocorrelation induced by the rotating panel, and models discontinuities due to a survey redesign. Additionally, we discuss the use of correlated auxiliary series in the model to further improve the accuracy of the model estimates. The method is applied by Statistics Netherlands to produce accurate official monthly statistics about the labour force that are consistent over time, despite a redesign of the survey process.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214248
Description:

Unit level population models are often used in model-based small area estimation of totals and means, but the models may not hold for the sample if the sampling design is informative for the model. As a result, standard methods, assuming that the model holds for the sample, can lead to biased estimators. We study alternative methods that use a suitable function of the unit selection probability as an additional auxiliary variable in the sample model. We report the results of a simulation study on the bias and mean squared error (MSE) of the proposed estimators of small area means and on the relative bias of the associated MSE estimators, using informative sampling schemes to generate the samples. Alternative methods, based on modeling the conditional expectation of the design weight as a function of the model covariates and the response, are also included in the simulation study.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114150
Description:

An area-level model approach to combining information from several sources is considered in the context of small area estimation. At each small area, several estimates are computed and linked through a system of structural error models. The best linear unbiased predictor of the small area parameter can be computed by the general least squares method. Parameters in the structural error models are estimated using the theory of measurement error models. Estimation of mean squared errors is also discussed. The proposed method is applied to the real problem of labor force surveys in Korea.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114200
Description:

We consider the observed best prediction (OBP; Jiang, Nguyen and Rao 2011) for small area estimation under the nested-error regression model, where both the mean and variance functions may be misspecified. We show via a simulation study that the OBP may significantly outperform the empirical best linear unbiased prediction (EBLUP) method not just in the overall mean squared prediction error (MSPE) but also in the area-specific MSPE for every one of the small areas. A bootstrap method is proposed for estimating the design-based area-specific MSPE, which is simple and always produces positive MSPE estimates. The performance of the proposed MSPE estimator is evaluated through a simulation study. An application to the Television School and Family Smoking Prevention and Cessation study is considered.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114174
Description:

Matrix sampling, often referred to as split-questionnaire, is a sampling design that involves dividing a questionnaire into subsets of questions, possibly overlapping, and then administering each subset to one or more different random subsamples of an initial sample. This increasingly appealing design addresses concerns related to data collection costs, respondent burden and data quality, but reduces the number of sample units that are asked each question. A broadened concept of matrix design includes the integration of samples from separate surveys for the benefit of streamlined survey operations and consistency of outputs. For matrix survey sampling with overlapping subsets of questions, we propose an efficient estimation method that exploits correlations among items surveyed in the various subsamples in order to improve the precision of the survey estimates. The proposed method, based on the principle of best linear unbiased estimation, generates composite optimal regression estimators of population totals using a suitable calibration scheme for the sampling weights of the full sample. A variant of this calibration scheme, of more general use, produces composite generalized regression estimators that are also computationally very efficient.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114161
Description:

A popular area level model used for the estimation of small area means is the Fay-Herriot model. This model involves unobservable random effects for the areas apart from the (fixed) linear regression based on area level covariates. Empirical best linear unbiased predictors of small area means are obtained by estimating the area random effects, and they can be expressed as a weighted average of area-specific direct estimators and regression-synthetic estimators. In some cases the observed data do not support the inclusion of the area random effects in the model. Excluding these area effects leads to the regression-synthetic estimator, that is, a zero weight is attached to the direct estimator. A preliminary test estimator of a small area mean obtained after testing for the presence of area random effects is studied. On the other hand, empirical best linear unbiased predictors of small area means that always give non-zero weights to the direct estimators in all areas together with alternative estimators based on the preliminary test are also studied. The preliminary testing procedure is also used to define new mean squared error estimators of the point estimators of small area means. Results of a limited simulation study show that, for small number of areas, the preliminary testing procedure leads to mean squared error estimators with considerably smaller average absolute relative bias than the usual mean squared error estimators, especially when the variance of the area effects is small relative to the sampling variances.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114160
Description:

Composite estimation is a technique applicable to repeated surveys with controlled overlap between successive surveys. This paper examines the modified regression estimators that incorporate information from previous time periods into estimates for the current time period. The range of modified regression estimators are extended to the situation of business surveys with survey frames that change over time, due to the addition of “births” and the deletion of “deaths”. Since the modified regression estimators can deviate from the generalized regression estimator over time, it is proposed to use a compromise modified regression estimator, a weighted average of the modified regression estimator and the generalised regression estimator. A Monte Carlo simulation study shows that the proposed compromise modified regression estimator leads to significant efficiency gains in both the point-in-time and movement estimates.

Release date: 2015-06-29

• Articles and reports: 12-001-X201500114192
Description:

We are concerned with optimal linear estimation of means on subsequent occasions under sample rotation where evolution of samples in time is designed through a cascade pattern. It has been known since the seminal paper of Patterson (1950) that when the units are not allowed to return to the sample after leaving it for certain period (there are no gaps in the rotation pattern), one step recursion for optimal estimator holds. However, in some important real surveys, e.g., Current Population Survey in the US or Labour Force Survey in many countries in Europe, units return to the sample after being absent in the sample for several occasions (there are gaps in rotation patterns). In such situations difficulty of the question of the form of the recurrence for optimal estimator increases drastically. This issue has not been resolved yet. Instead alternative sub-optimal approaches were developed, as K - composite estimation (see e.g., Hansen, Hurwitz, Nisselson and Steinberg (1955)), AK - composite estimation (see e.g., Gurney and Daly (1965)) or time series approach (see e.g., Binder and Hidiroglou (1988)).

In the present paper we overcome this long-standing difficulty, that is, we present analytical recursion formulas for the optimal linear estimator of the mean for schemes with gaps in rotation patterns. It is achieved under some technical conditions: ASSUMPTION I and ASSUMPTION II (numerical experiments suggest that these assumptions might be universally satisfied). To attain the goal we develop an algebraic operator approach which allows to reduce the problem of recursion for the optimal linear estimator to two issues: (1) localization of roots (possibly complex) of a polynomial Qp defined in terms of the rotation pattern (Qp happens to be conveniently expressed through Chebyshev polynomials of the first kind), (2) rank of a matrix S defined in terms of the rotation pattern and the roots of the polynomial Qp. In particular, it is shown that the order of the recursion is equal to one plus the size of the largest gap in the rotation pattern. Exact formulas for calculation of the recurrence coefficients are given - of course, to use them one has to check (in many cases, numerically) that ASSUMPTIONs I and II are satisfied. The solution is illustrated through several examples of rotation schemes arising in real surveys.

Release date: 2015-06-29

• Articles and reports: 12-001-X201400214113
Description:

Rotating panel surveys are used to calculate estimates of gross flows between two consecutive periods of measurement. This paper considers a general procedure for the estimation of gross flows when the rotating panel survey has been generated from a complex survey design with random nonresponse. A pseudo maximum likelihood approach is considered through a two-stage model of Markov chains for the allocation of individuals among the categories in the survey and for modeling for nonresponse.

Release date: 2014-12-19

• Articles and reports: 12-001-X201400214097
Description:

When monthly business surveys are not completely overlapping, there are two different estimators for the monthly growth rate of the turnover: (i) one that is based on the monthly estimated population totals and (ii) one that is purely based on enterprises observed on both occasions in the overlap of the corresponding surveys. The resulting estimates and variances might be quite different. This paper proposes an optimal composite estimator for the growth rate as well as the population totals.

Release date: 2014-12-19

• Articles and reports: 12-001-X201400214118
Description:

Bagging is a powerful computational method used to improve the performance of inefficient estimators. This article is a first exploration of the use of bagging in survey estimation, and we investigate the effects of bagging on non-differentiable survey estimators including sample distribution functions and quantiles, among others. The theoretical properties of bagged survey estimators are investigated under both design-based and model-based regimes. In particular, we show the design consistency of the bagged estimators, and obtain the asymptotic normality of the estimators in the model-based context. The article describes how implementation of bagging for survey estimators can take advantage of replicates developed for survey variance estimation, providing an easy way for practitioners to apply bagging in existing surveys. A major remaining challenge in implementing bagging in the survey context is variance estimation for the bagged estimators themselves, and we explore two possible variance estimation approaches. Simulation experiments reveal the improvement of the proposed bagging estimator relative to the original estimator and compare the two variance estimation approaches.

Release date: 2014-12-19

• Articles and reports: 12-001-X201400114029
Description:

Fay and Train (1995) present a method called successive difference replication that can be used to estimate the variance of an estimated total from a systematic random sample from an ordered list. The estimator uses the general form of a replication variance estimator, where the replicate factors are constructed such that the estimator mimics the successive difference estimator. This estimator is a modification of the estimator given by Wolter (1985). The paper furthers the methodology by explaining the impact of the row assignments on the variance estimator, showing how a reduced set of replicates leads to a reasonable estimator, and establishing conditions for successive difference replication to be equivalent to the successive difference estimator.

Release date: 2014-06-27

• Articles and reports: 12-001-X201400114030
Description:

The paper reports the results of a Monte Carlo simulation study that was conducted to compare the effectiveness of four different hierarchical Bayes small area models for producing state estimates of proportions based on data from stratified simple random samples from a fixed finite population. Two of the models adopted the commonly made assumptions that the survey weighted proportion for each sampled small area has a normal distribution and that the sampling variance of this proportion is known. One of these models used a linear linking model and the other used a logistic linking model. The other two models both employed logistic linking models and assumed that the sampling variance was unknown. One of these models assumed a normal distribution for the sampling model while the other assumed a beta distribution. The study found that for all four models the credible interval design-based coverage of the finite population state proportions deviated markedly from the 95 percent nominal level used in constructing the intervals.

Release date: 2014-06-27

...
Reference (80)

## Reference (80) (25 of 80 results)

• Technical products: 91-528-X
Description:

This manual provides detailed descriptions of the data sources and methods used by Statistics Canada to estimate population. They comprise Postcensal and intercensal population estimates; base population; births and deaths; immigration; emigration; non-permanent residents; interprovincial migration; subprovincial estimates of population; population estimates by age, sex and marital status; and census family estimates. A glossary of principal terms is contained at the end of the manual, followed by the standard notation used.

Until now, literature on the methodological changes for estimates calculations has always been spread throughout various Statistics Canada publications and background papers. This manual provides users of demographic statistics with a comprehensive compilation of the current procedures used by Statistics Canada to prepare population and family estimates.

Release date: 2016-03-03

• Technical products: 12-002-X
Description:

The Research Data Centres (RDCs) Information and Technical Bulletin (ITB) is a forum by which Statistics Canada analysts and the research community can inform each other on survey data uses and methodological techniques. Articles in the ITB focus on data analysis and modelling, data management, and best or ineffective statistical, computational, and scientific practices. Further, ITB topics will include essays on data content, implications of questionnaire wording, comparisons of datasets, reviews on methodologies and their application, data peculiarities, problematic data and solutions, and explanations of innovative tools using RDC surveys and relevant software. All of these essays may provide advice and detailed examples outlining commands, habits, tricks and strategies used to make problem-solving easier for the RDC user.

The main aims of the ITB are:

- the advancement and dissemination of knowledge surrounding Statistics Canada's data; - the exchange of ideas among the RDC-user community;- the support of new users; - the co-operation with subject matter experts and divisions within Statistics Canada.

The ITB is interested in quality articles that are worth publicizing throughout the research community, and that will add value to the quality of research produced at Statistics Canada's RDCs.

Release date: 2015-03-25

• Index and guides: 99-002-X
Description:

This report describes sampling and weighting procedures used in the 2011 National Household Survey. It provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

Release date: 2015-01-28

• Index and guides: 99-002-X2011001
Release date: 2015-01-28

• Technical products: 11-522-X201300014281
Description:

Web surveys exclude the entire non-internet population and often have low response rates. Therefore, statistical inference based on Web survey samples will require availability of additional information about the non-covered population, careful choice of survey methods to account for potential biases, and caution with interpretation and generalization of the results to a target population. In this paper, we focus on non-coverage bias, and explore the use of weighted estimators and hot-deck imputation estimators for bias adjustment under the ideal scenario where covariate information was obtained for a simple random sample of individuals from the non-covered population. We illustrate empirically the performance of the proposed estimators under this scenario. Possible extensions of these approaches to more realistic scenarios are discussed.

Release date: 2014-10-31

• Technical products: 11-522-X201300014266
Description:

Monitors and self-reporting are two methods of measuring energy expended in physical activity, where monitor devices typically have much smaller error variances than do self-reports. The Physical Activity Measurement Survey was designed to compare the two procedures, using replicate observations on the same individual. The replicates permit calibrating the personal report measurement to the monitor measurement and make it possible to estimate components of the measurement error variances. Estimates of the variance components of measurement error in monitor-and self-report energy expenditure are given for females in the Physical Activity Measurement Survey.

Release date: 2014-10-31

• Technical products: 11-522-X201300014286
Description:

The Étude Longitudinale Française depuis l’Enfance (ELFE) [French longitudinal study from childhood on], which began in 2011, involves over 18,300 infants whose parents agreed to participate when they were in the maternity hospital. This cohort survey, which will track the children from birth to adulthood, covers the many aspects of their lives from the perspective of social science, health and environmental health. In randomly selected maternity hospitals, all infants in the target population, who were born on one of 25 days distributed across the four seasons, were chosen. This sample is the outcome of a non-standard sampling scheme that we call product sampling. In this survey, it takes the form of the cross-tabulation between two independent samples: a sampling of maternity hospitals and a sampling of days. While it is easy to imagine a cluster effect due to the sampling of maternity hospitals, one can also imagine a cluster effect due to the sampling of days. The scheme’s time dimension therefore cannot be ignored if the desired estimates are subject to daily or seasonal variation. While this non-standard scheme can be viewed as a particular kind of two-phase design, it needs to be defined within a more specific framework. Following a comparison of the product scheme with a conventional two-stage design, we propose variance estimators specially formulated for this sampling scheme. Our ideas are illustrated with a simulation study.

Release date: 2014-10-31

• Technical products: 11-522-X201300014265
Description:

Release date: 2014-10-31

• Technical products: 12-002-X201400111901
Description:

This document is for analysts/researchers who are considering doing research with data from a survey where both survey weights and bootstrap weights are provided in the data files. This document gives directions, for some selected software packages, about how to get started in using survey weights and bootstrap weights for an analysis of survey data. We give brief directions for obtaining survey-weighted estimates, bootstrap variance estimates (and other desired error quantities) and some typical test statistics for each software package in turn. While these directions are provided just for the chosen examples, there will be information about the range of weighted and bootstrapped analyses that can be carried out by each software package.

Release date: 2014-08-07

• Technical products: 75F0002M2012003
Description:

The release of the 2010 Survey of Labour and Income Dynamics (SLID) data coincided with a historical revision of the 2006 to 2009 results. The survey weights were updated to take into account new population estimates based on the 2006 Census rather than the 2001 Census. This paper presents a summary of the impact of this revision on the 2006-2009 survey estimates.

Release date: 2012-11-01

• Technical products: 12-002-X201200111642
Description:

It is generally recommended that weighted estimation approaches be used when analyzing data from a long-form census microdata file. Since such data files are now available in the RDC's, there is a need to provide researchers there with more information about doing weighted estimation with these files. The purpose of this paper is to provide some of this information - in particular, how the weight variables were derived for the census microdata files and what weight should be used for different units of analysis. For the 1996, 2001 and 2006 censuses the same weight variable is appropriate regardless of whether people, families or households are being studied. For the 1991 census, recommendations are more complex: a different weight variable is required for households than for people and families, and additional restrictions apply to obtain the correct weight value for families.

Release date: 2012-10-25

• Technical products: 92-568-X
Description:

This report describes sampling and weighting procedures used in the 2006 Census. It reviews the history of these procedures in Canadian censuses, provides operational and theoretical justifications for them, and presents the results of the evaluation studies of these procedures.

Release date: 2009-08-11

• Technical products: 92-568-X2006001
Release date: 2009-08-11

• Technical products: 11-522-X200600110409
Description:

In unequal-probability-of-selection sample, correlations between the probability of selection and the sampled data can induce bias. Weights equal to the inverse of the probability of selection are often used to counteract this bias. Highly disproportional sample designs have large weights, which can introduce unnecessary variability in statistics such as the population mean estimate. Weight trimming reduces large weights to a fixed cutpoint value and adjusts weights below this value to maintain the untrimmed weight sum. This reduces variability at the cost of introducing some bias. Standard approaches are not "data-driven": they do not use the data to make the appropriate bias-variance tradeoff, or else do so in a highly inefficient fashion. This presentation develops Bayesian variable selection methods for weight trimming to supplement standard, ad-hoc design-based methods in disproportional probability-of-inclusion designs where variances due to sample weights exceeds bias correction. These methods are used to estimate linear and generalized linear regression model population parameters in the context of stratified and poststratified known-probability sample designs. Applications will be considered in the context of traffic injury survey data, in which highly disproportional sample designs are often utilized.

Release date: 2008-03-17

• Technical products: 11-522-X200600110417
Description:

The coefficients of regression equations are often parameters of interest for health surveys and such surveys are usually of complex design with differential sampling rates. We give estimators for the regression coefficients for complex surveys that are superior to ordinary expansion estimators under the subject matter model, but also retain desirable design properties. Theoretical and Monte Carlo properties are presented.

Release date: 2008-03-17

• Technical products: 75F0002M2007007
Description:

The Survey of Labour and Income Dynamics (SLID), introduced in the 1993 reference year, is a longitudinal panel survey of individuals. The purpose of the survey is to measure changes in the economic well-being of individuals and the factors that influence these changes. SLID's sample is divided into two overlapping panels, each six years in length. Longitudinal surveys like SLID are complex due to the dynamic nature of the sample, which in turn is due to the ever-changing composition of households and families over the years. For each reference year, SLID produces two sets of weights: one is representative of the initial population (the longitudinal weights), while the other is representative of the current population (the cross-sectional weights). Since 2002, SLID has been producing a third set of weights which combines two panels that overlap to form a new longitudinal sample. The new weights are referred to as combined longitudinal weights.

For the production of the cross-sectional weights, SLID combines two independent samples and assigns a probability of selection to individuals who joined the sample after the panel was selected. Like cross-sectional weights, longitudinal weights are adjusted for non-response and influential values. In addition, the sample is adjusted to make it representative of the target population. The purpose of this document is to describe SLID's methodology for the longitudinal and cross-sectional weights, as well as to present problems that have been encountered, and solutions that have been proposed. For the purpose of illustration, results for the 2003 reference year are used. The methodology used to produce the combined longitudinal weights will not be presented in this document as there is a complete description in Naud (2004).

Release date: 2007-10-18

• Technical products: 11-522-X20050019468
Description:

At the time of recruitment, the participants in a longitudinal survey are chosen to be representative of a population. As time goes on, typically some of the participants will drop out, and dropout may be informative in the sense of depending on the response variables of interest. However, even if dropout is minimal, the participants who continue to the second and third waves of a longitudinal survey may differ from those they supposedly represent in subtle ways. It is clearly important to take such possibilities into account when designing and analyzing longitudinal survey data before and after an intervention.

Release date: 2007-03-02

• Technical products: 11-522-X20050019447
Description:

To understand the selection biases in model estimation when using longitudinal survey panel microdata, we consider a structural model composed of three equations for non-attrition/response, employment and wages. The three equations are freely correlated.

Release date: 2007-03-02

• Technical products: 11-522-X20050019461
Description:

We propose a generalization of the usual coefficient of variation (CV) to address some of the known problems when used in measuring quality of estimates. Some of the problems associated with CV include interpretation when the estimate is near zero, and the inconsistency in the interpretation about precision when computed for different one-to-one monotonic transformations.

Release date: 2007-03-02

• Technical products: 11-522-X20050019444
Description:

There are several ways to improve data quality. One of them is to re-design and test questionnaires for ongoing surveys. The benefits of questionnaire re-design and testing include improving the accuracy by ensuring the questions collect the required data, as well as decreased response burden.

Release date: 2007-03-02

• Technical products: 11-522-X20050019462
Description:

The traditional approach to presenting variance information to data users is to publish estimates of variance or related statistics, such as standard errors, coefficients of variation, confidence limits or simple grading systems. The paper examines potential sources of variance, such as sample design, sample allocation, sample selection, non-response, and considers what might best be done to reduce variance. Finally, the paper assesses briefly the financial costs to producers and users of reducing or not reducing variance and how we might trade off the costs of producing more accurate statistics against the financial benefits of greater accuracy.

Release date: 2007-03-02

• Technical products: 11-522-X20050019482
Description:

Health studies linking the administrative hospital discharge database by person can be used to describe disease/procedure rates and trends by person, place and time; investigate outcomes of disease, procedures or risk factors; and illuminate hospital utilization. The power and challenges of this work will be illustrated with examples from work done at Statistics Canada.

Release date: 2007-03-02

• Technical products: 11-522-X20050019467
Description:

This paper reviews techniques for dealing with missing data from complex surveys when conducting longitudinal analysis. In addition to incurring the same types of missingness as cross sectional data, longitudinal observations also suffer from drop out missingness. For the purpose of analyzing longitudinal data, random effects models are most often used to account for the longitudinal nature of the data. However, there are difficulties in incorporating the complex design with typical multi-level models that are used in this type of longitudinal analysis, especially in the presence of drop-out missingness.

Release date: 2007-03-02

• Technical products: 11-522-X20050019476
Description:

The paper will show how, using data published by Statistics Canada and available from member libraries of the CREPUQ, a linkage approach using postal codes makes it possible to link the data from the outcomes file to a set of contextual variables. These variables could then contribute to producing, on an exploratory basis, a better index to explain the varied outcomes of students from schools. In terms of the impact, the proposed index could show more effectively the limitations of ranking students and schools when this information is not given sufficient weight.

Release date: 2007-03-02

• Technical products: 11-522-X20050019490
Description:

Using core survey, frame, and contact history data collected with the 2005 NHIS, a multi-purpose health survey conducted by the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC), a model of initial contact was developed and tested. The implications for survey procedures and field operations are discussed.

Release date: 2007-03-02

Date modified: