# Statistics by subject – Statistical methods

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
All (224)

## All (224) (25 of 224 results)

• Articles and reports: 12-001-X201700254871
Description:

In this paper the question is addressed how alternative data sources, such as administrative and social media data, can be used in the production of official statistics. Since most surveys at national statistical institutes are conducted repeatedly over time, a multivariate structural time series modelling approach is proposed to model the series observed by a repeated surveys with related series obtained from such alternative data sources. Generally, this improves the precision of the direct survey estimates by using sample information observed in preceding periods and information from related auxiliary series. This model also makes it possible to utilize the higher frequency of the social media to produce more precise estimates for the sample survey in real time at the moment that statistics for the social media become available but the sample data are not yet available. The concept of cointegration is applied to address the question to which extent the alternative series represent the same phenomena as the series observed with the repeated survey. The methodology is applied to the Dutch Consumer Confidence Survey and a sentiment index derived from social media.

Release date: 2017-12-21

• Articles and reports: 12-001-X201700114823
Description:

The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114819
Description:

Structural time series models are a powerful technique for variance reduction in the framework of small area estimation (SAE) based on repeatedly conducted surveys. Statistics Netherlands implemented a structural time series model to produce monthly figures about the labour force with the Dutch Labour Force Survey (DLFS). Such models, however, contain unknown hyperparameters that have to be estimated before the Kalman filter can be launched to estimate state variables of the model. This paper describes a simulation aimed at studying the properties of hyperparameter estimators in the model. Simulating distributions of the hyperparameter estimators under different model specifications complements standard model diagnostics for state space models. Uncertainty around the model hyperparameters is another major issue. To account for hyperparameter uncertainty in the mean squared errors (MSE) estimates of the DLFS, several estimation approaches known in the literature are considered in a simulation. Apart from the MSE bias comparison, this paper also provides insight into the variances and MSEs of the MSE estimators considered.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114820
Description:

Measurement errors can induce bias in the estimation of transitions, leading to erroneous conclusions about labour market dynamics. Traditional literature on gross flows estimation is based on the assumption that measurement errors are uncorrelated over time. This assumption is not realistic in many contexts, because of survey design and data collection strategies. In this work, we use a model-based approach to correct observed gross flows from classification errors with latent class Markov models. We refer to data collected with the Italian Continuous Labour Force Survey, which is cross-sectional, quarterly, with a 2-2-2 rotating design. The questionnaire allows us to use multiple indicators of labour force conditions for each quarter: two collected in the first interview, and a third collected one year later. Our approach provides a method to estimate labour market mobility, taking into account correlated errors and the rotating design of the survey. The best-fitting model is a mixed latent class Markov model with covariates affecting latent transitions and correlated errors among indicators; the mixture components are of mover-stayer type. The better fit of the mixture specification is due to more accurately estimated latent transitions.

Release date: 2017-06-22

• Articles and reports: 82-003-X201601214687
Description:

This study describes record linkage of the Canadian Community Health Survey and the Canadian Mortality Database. The article explains the record linkage process and presents results about associations between health behaviours and mortality among a representative sample of Canadians.

Release date: 2016-12-21

• Articles and reports: 12-001-X201600214662
Description:

Two-phase sampling designs are often used in surveys when the sampling frame contains little or no auxiliary information. In this note, we shed some light on the concept of invariance, which is often mentioned in the context of two-phase sampling designs. We define two types of invariant two-phase designs: strongly invariant and weakly invariant two-phase designs. Some examples are given. Finally, we describe the implications of strong and weak invariance from an inference point of view.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114546
Description:

Adjusting the base weights using weighting classes is a standard approach for dealing with unit nonresponse. A common approach is to create nonresponse adjustments that are weighted by the inverse of the assumed response propensity of respondents within weighting classes under a quasi-randomization approach. Little and Vartivarian (2003) questioned the value of weighting the adjustment factor. In practice the models assumed are misspecified, so it is critical to understand the impact of weighting might have in this case. This paper describes the effects on nonresponse adjusted estimates of means and totals for population and domains computed using the weighted and unweighted inverse of the response propensities in stratified simple random sample designs. The performance of these estimators under different conditions such as different sample allocation, response mechanism, and population structure is evaluated. The findings show that for the scenarios considered the weighted adjustment has substantial advantages for estimating totals and using an unweighted adjustment may lead to serious biases except in very limited cases. Furthermore, unlike the unweighted estimates, the weighted estimates are not sensitive to how the sample is allocated.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114539
Description:

Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.

Release date: 2016-06-22

• Technical products: 11-522-X201700014745
Description:

In the design of surveys a number of parameters like contact propensities, participation propensities and costs per sample unit play a decisive role. In on-going surveys, these survey design parameters are usually estimated from previous experience and updated gradually with new experience. In new surveys, these parameters are estimated from expert opinion and experience with similar surveys. Although survey institutes have a fair expertise and experience, the postulation, estimation and updating of survey design parameters is rarely done in a systematic way. This paper presents a Bayesian framework to include and update prior knowledge and expert opinion about the parameters. This framework is set in the context of adaptive survey designs in which different population units may receive different treatment given quality and cost objectives. For this type of survey, the accuracy of design parameters becomes even more crucial to effective design decisions. The framework allows for a Bayesian analysis of the performance of a survey during data collection and in between waves of a survey. We demonstrate the Bayesian analysis using a realistic simulation study.

Release date: 2016-03-24

• Technical products: 11-522-X201700014722
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014713
Description:

Big data is a term that means different things to different people. To some, it means datasets so large that our traditional processing and analytic systems can no longer accommodate them. To others, it simply means taking advantage of existing datasets of all sizes and finding ways to merge them with the goal of generating new insights. The former view poses a number of important challenges to traditional market, opinion, and social research. In either case, there are implications for the future of surveys that are only beginning to be explored.

Release date: 2016-03-24

• Technical products: 11-522-X201700014749
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014725
Description:

Tax data are being used more and more to measure and analyze the population and its characteristics. One of the issues raised by the growing use of these type of data relates to the definition of the concept of place of residence. While the census uses the traditional concept of place of residence, tax data provide information based on the mailing address of tax filers. Using record linkage between the census, the National Household Survey and tax data from the T1 Family File, this study examines the consistency level of the place of residence of these two sources and its associated characteristics.

Release date: 2016-03-24

• Articles and reports: 82-003-X201600314338
Description:

This paper describes the methods and data used in the development and implementation of the POHEM-Neurological meta-model.

Release date: 2016-03-16

• Articles and reports: 82-003-X201600114307
Description:

Using the 2012 Aboriginal Peoples Survey, this study examined the psychometric properties of the 10-item Kessler Psychological Distress Scale (a short measure of non-specific psychological distress) for First Nations people living off reserve, Métis, and Inuit aged 15 or older.

Release date: 2016-01-20

• Articles and reports: 82-003-X201600114306
Description:

Release date: 2016-01-20

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Technical products: 11-522-X201300014252
Description:

Although estimating finite populations characteristics from probability samples has been very successful for large samples, inferences from non-probability samples may also be possible. Non-probability samples have been criticized due to self-selection bias and the lack of methods for estimating the precision of the estimates. The wide spread access to the Web and the ability to do very inexpensive data collection on the Web has reinvigorated interest in this topic. We review of non-probability sampling strategies and summarize some of the key issues. We then propose conditions under which non-probability sampling may be a reasonable approach. We conclude with ideas for future research.

Release date: 2014-10-31

• Technical products: 11-522-X201300014253
Description:

New developments in computer technology, but also new challenges in society like increasing nonresponse rates and decreasing budgets may lead to changes in survey methodology for official statistics. Nowadays, web panels have become very popular in the world of market research. This raises the question whether such panels can also be used for official statistics. Can they produce high quality statistics about the general population? This paper attempts to answer this question by exploring methodological aspects like under-coverage, sample selection, and nonresponse. Statistics Netherlands carried out a test with a web panel. Some results are described.

Release date: 2014-10-31

• Technical products: 11-522-X201300014259
Description:

In an effort to reduce response burden on farm operators, Statistics Canada is studying alternative approaches to telephone surveys for producing field crop estimates. One option is to publish harvested area and yield estimates in September as is currently done, but to calculate them using models based on satellite and weather data, and data from the July telephone survey. However before adopting such an approach, a method must be found which produces estimates with a sufficient level of accuracy. Research is taking place to investigate different possibilities. Initial research results and issues to consider are discussed in this paper.

Release date: 2014-10-31

• Technical products: 11-522-X201300014278
Description:

In January and February 2014, Statistics Canada conducted a test aiming at measuring the effectiveness of different collection strategies using an online self-reporting survey. Sampled units were contacted using mailed introductory letters and asked to complete the online survey without any interviewer contact. The objectives of this test were to measure the take-up rates for completing an online survey, and to profile the respondents/non-respondents. Different samples and letters were tested to determine the relative effectiveness of the different approaches. The results of this project will be used to inform various social surveys that are preparing to include an internet response option in their surveys. The paper will present the general methodology of the test as well as results observed from collection and the analysis of profiles.

Release date: 2014-10-31

• Technical products: 11-522-X201300014255
Description:

The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

Release date: 2014-10-31

• Technical products: 11-522-X201300014291
Description:

Occupational coding in Germany is mostly done using dictionary approaches with subsequent manual revision of cases which could not be coded. Since manual coding is expensive, it is desirable to assign a higher number of codes automatically. At the same time the quality of the automatic coding must at least reach that of the manual coding. As a possible solution we employ different machine learning algorithms for the task using a substantial amount of manually coded occuptions available from recent studies as training data. We asses the feasibility of these methods of evaluating performance and quality of the algorithms.

Release date: 2014-10-31

• Articles and reports: 12-001-X201300211883
Description:

The history of survey sampling, dating from the writings of A.N. Kiaer, has been remarkably controversial. First Kiaer himself had to struggle to convince his contemporaries that survey sampling itself was a legitimate procedure. He spent several decades in the attempt, and was an old man before survey sampling became a reputable activity. The first person to provide both a theoretical justification of survey sampling (in 1906) and a practical demonstration of its feasibility (in a survey conducted in Reading which was published in 1912) was A.L. Bowley. In 1925, the ISI meeting in Rome adopted a resolution giving acceptance to the use of both randomization and purposive sampling. Bowley used both. However the next two decades saw a steady tendency for randomization to become mandatory. In 1934 Jerzy Neyman used the relatively recent failure of a large purposive survey to ensure that subsequent sample surveys would need to employ random sampling only. He found apt pupils in M.H. Hansen, W.N. Hurwitz and W.G. Madow, who together published a definitive sampling textbook in 1953. This went effectively unchallenged for nearly two decades. In the 1970s, however, R.M. Royall and his coauthors did challenge the use of random sampling inference, and advocated that of model-based sampling instead. That in turn gave rise to the third major controversy within little more than a century. The present author, however, with several others, believes that both design-based and model-based inference have a useful part to play.

Release date: 2014-01-15

• Articles and reports: 12-001-X201300211884
Description:

This paper offers a solution to the problem of finding the optimal stratification of the available population frame, so as to ensure the minimization of the cost of the sample required to satisfy precision constraints on a set of different target estimates. The solution is searched by exploring the universe of all possible stratifications obtainable by cross-classifying the categorical auxiliary variables available in the frame (continuous auxiliary variables can be transformed into categorical ones by means of suitable methods). Therefore, the followed approach is multivariate with respect to both target and auxiliary variables. The proposed algorithm is based on a non deterministic evolutionary approach, making use of the genetic algorithm paradigm. The key feature of the algorithm is in considering each possible stratification as an individual subject to evolution, whose fitness is given by the cost of the associated sample required to satisfy a set of precision constraints, the cost being calculated by applying the Bethel algorithm for multivariate allocation. This optimal stratification algorithm, implemented in an R package (SamplingStrata), has been so far applied to a number of current surveys in the Italian National Institute of Statistics: the obtained results always show significant improvements in the efficiency of the samples obtained, with respect to previously adopted stratifications.

Release date: 2014-01-15

Data (0)

You may try:

Analysis (120)

## Analysis (120) (25 of 120 results)

• Articles and reports: 12-001-X201700254871
Description:

In this paper the question is addressed how alternative data sources, such as administrative and social media data, can be used in the production of official statistics. Since most surveys at national statistical institutes are conducted repeatedly over time, a multivariate structural time series modelling approach is proposed to model the series observed by a repeated surveys with related series obtained from such alternative data sources. Generally, this improves the precision of the direct survey estimates by using sample information observed in preceding periods and information from related auxiliary series. This model also makes it possible to utilize the higher frequency of the social media to produce more precise estimates for the sample survey in real time at the moment that statistics for the social media become available but the sample data are not yet available. The concept of cointegration is applied to address the question to which extent the alternative series represent the same phenomena as the series observed with the repeated survey. The methodology is applied to the Dutch Consumer Confidence Survey and a sentiment index derived from social media.

Release date: 2017-12-21

• Articles and reports: 12-001-X201700114823
Description:

The derivation of estimators in a multi-phase calibration process requires a sequential computation of estimators and calibrated weights of previous phases in order to obtain those of later ones. Already after two phases of calibration the estimators and their variances involve calibration factors from both phases and the formulae become cumbersome and uninformative. As a consequence the literature so far deals mainly with two phases while three phases or more are rarely being considered. The analysis in some cases is ad-hoc for a specific design and no comprehensive methodology for constructing calibrated estimators, and more challengingly, estimating their variances in three or more phases was formed. We provide a closed form formula for the variance of multi-phase calibrated estimators that holds for any number of phases. By specifying a new presentation of multi-phase calibrated weights it is possible to construct calibrated estimators that have the form of multi-variate regression estimators which enables a computation of a consistent estimator for their variance. This new variance estimator is not only general for any number of phases but also has some favorable characteristics. A comparison to other estimators in the special case of two-phase calibration and another independent study for three phases are presented.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114819
Description:

Structural time series models are a powerful technique for variance reduction in the framework of small area estimation (SAE) based on repeatedly conducted surveys. Statistics Netherlands implemented a structural time series model to produce monthly figures about the labour force with the Dutch Labour Force Survey (DLFS). Such models, however, contain unknown hyperparameters that have to be estimated before the Kalman filter can be launched to estimate state variables of the model. This paper describes a simulation aimed at studying the properties of hyperparameter estimators in the model. Simulating distributions of the hyperparameter estimators under different model specifications complements standard model diagnostics for state space models. Uncertainty around the model hyperparameters is another major issue. To account for hyperparameter uncertainty in the mean squared errors (MSE) estimates of the DLFS, several estimation approaches known in the literature are considered in a simulation. Apart from the MSE bias comparison, this paper also provides insight into the variances and MSEs of the MSE estimators considered.

Release date: 2017-06-22

• Articles and reports: 12-001-X201700114820
Description:

Measurement errors can induce bias in the estimation of transitions, leading to erroneous conclusions about labour market dynamics. Traditional literature on gross flows estimation is based on the assumption that measurement errors are uncorrelated over time. This assumption is not realistic in many contexts, because of survey design and data collection strategies. In this work, we use a model-based approach to correct observed gross flows from classification errors with latent class Markov models. We refer to data collected with the Italian Continuous Labour Force Survey, which is cross-sectional, quarterly, with a 2-2-2 rotating design. The questionnaire allows us to use multiple indicators of labour force conditions for each quarter: two collected in the first interview, and a third collected one year later. Our approach provides a method to estimate labour market mobility, taking into account correlated errors and the rotating design of the survey. The best-fitting model is a mixed latent class Markov model with covariates affecting latent transitions and correlated errors among indicators; the mixture components are of mover-stayer type. The better fit of the mixture specification is due to more accurately estimated latent transitions.

Release date: 2017-06-22

• Articles and reports: 82-003-X201601214687
Description:

This study describes record linkage of the Canadian Community Health Survey and the Canadian Mortality Database. The article explains the record linkage process and presents results about associations between health behaviours and mortality among a representative sample of Canadians.

Release date: 2016-12-21

• Articles and reports: 12-001-X201600214662
Description:

Two-phase sampling designs are often used in surveys when the sampling frame contains little or no auxiliary information. In this note, we shed some light on the concept of invariance, which is often mentioned in the context of two-phase sampling designs. We define two types of invariant two-phase designs: strongly invariant and weakly invariant two-phase designs. Some examples are given. Finally, we describe the implications of strong and weak invariance from an inference point of view.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114546
Description:

Adjusting the base weights using weighting classes is a standard approach for dealing with unit nonresponse. A common approach is to create nonresponse adjustments that are weighted by the inverse of the assumed response propensity of respondents within weighting classes under a quasi-randomization approach. Little and Vartivarian (2003) questioned the value of weighting the adjustment factor. In practice the models assumed are misspecified, so it is critical to understand the impact of weighting might have in this case. This paper describes the effects on nonresponse adjusted estimates of means and totals for population and domains computed using the weighted and unweighted inverse of the response propensities in stratified simple random sample designs. The performance of these estimators under different conditions such as different sample allocation, response mechanism, and population structure is evaluated. The findings show that for the scenarios considered the weighted adjustment has substantial advantages for estimating totals and using an unweighted adjustment may lead to serious biases except in very limited cases. Furthermore, unlike the unweighted estimates, the weighted estimates are not sensitive to how the sample is allocated.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114539
Description:

Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.

Release date: 2016-06-22

• Articles and reports: 82-003-X201600314338
Description:

This paper describes the methods and data used in the development and implementation of the POHEM-Neurological meta-model.

Release date: 2016-03-16

• Articles and reports: 82-003-X201600114307
Description:

Using the 2012 Aboriginal Peoples Survey, this study examined the psychometric properties of the 10-item Kessler Psychological Distress Scale (a short measure of non-specific psychological distress) for First Nations people living off reserve, Métis, and Inuit aged 15 or older.

Release date: 2016-01-20

• Articles and reports: 82-003-X201600114306
Description:

Release date: 2016-01-20

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Articles and reports: 12-001-X201300211883
Description:

The history of survey sampling, dating from the writings of A.N. Kiaer, has been remarkably controversial. First Kiaer himself had to struggle to convince his contemporaries that survey sampling itself was a legitimate procedure. He spent several decades in the attempt, and was an old man before survey sampling became a reputable activity. The first person to provide both a theoretical justification of survey sampling (in 1906) and a practical demonstration of its feasibility (in a survey conducted in Reading which was published in 1912) was A.L. Bowley. In 1925, the ISI meeting in Rome adopted a resolution giving acceptance to the use of both randomization and purposive sampling. Bowley used both. However the next two decades saw a steady tendency for randomization to become mandatory. In 1934 Jerzy Neyman used the relatively recent failure of a large purposive survey to ensure that subsequent sample surveys would need to employ random sampling only. He found apt pupils in M.H. Hansen, W.N. Hurwitz and W.G. Madow, who together published a definitive sampling textbook in 1953. This went effectively unchallenged for nearly two decades. In the 1970s, however, R.M. Royall and his coauthors did challenge the use of random sampling inference, and advocated that of model-based sampling instead. That in turn gave rise to the third major controversy within little more than a century. The present author, however, with several others, believes that both design-based and model-based inference have a useful part to play.

Release date: 2014-01-15

• Articles and reports: 12-001-X201300211884
Description:

This paper offers a solution to the problem of finding the optimal stratification of the available population frame, so as to ensure the minimization of the cost of the sample required to satisfy precision constraints on a set of different target estimates. The solution is searched by exploring the universe of all possible stratifications obtainable by cross-classifying the categorical auxiliary variables available in the frame (continuous auxiliary variables can be transformed into categorical ones by means of suitable methods). Therefore, the followed approach is multivariate with respect to both target and auxiliary variables. The proposed algorithm is based on a non deterministic evolutionary approach, making use of the genetic algorithm paradigm. The key feature of the algorithm is in considering each possible stratification as an individual subject to evolution, whose fitness is given by the cost of the associated sample required to satisfy a set of precision constraints, the cost being calculated by applying the Bethel algorithm for multivariate allocation. This optimal stratification algorithm, implemented in an R package (SamplingStrata), has been so far applied to a number of current surveys in the Italian National Institute of Statistics: the obtained results always show significant improvements in the efficiency of the samples obtained, with respect to previously adopted stratifications.

Release date: 2014-01-15

• Articles and reports: 82-003-X201300611796
Description:

The study assesses the feasibility of using statistical modelling techniques to fill information gaps related to risk factors, specifically, smoking status, in linked long-form census data.

Release date: 2013-06-19

• Articles and reports: 82-003-X201300111764
Description:

This study compares two sources of information about prescription drug use by people aged 65 or older in Ontario - the Canadian Community Health Survey and the drug claimsdatabase of the Ontario Drug Benefit Program. The analysis pertains to cardiovascular and diabetes drugs because they are commonly used, and almost all are prescribed on a regular basis.

Release date: 2013-01-16

• Articles and reports: 82-003-X201200311707
Description:

This study compares waist circumference measured using World Health Organization and National Institutes of Health protocols to determine if the results differ significantly, and whether equations can be developed to allow comparison between waist circumference taken at the two different measurement sites.

Release date: 2012-09-20

• Articles and reports: 12-001-X201100211605
Description:

Composite imputation is often used in business surveys. The term "composite" means that more than a single imputation method is used to impute missing values for a variable of interest. The literature on variance estimation in the presence of composite imputation is rather limited. To deal with this problem, we consider an extension of the methodology developed by Särndal (1992). Our extension is quite general and easy to implement provided that linear imputation methods are used to fill in the missing values. This class of imputation methods contains linear regression imputation, donor imputation and auxiliary value imputation, sometimes called cold-deck or substitution imputation. It thus covers the most common methods used by national statistical agencies for the imputation of missing values. Our methodology has been implemented in the System for the Estimation of Variance due to Nonresponse and Imputation (SEVANI) developed at Statistics Canada. Its performance is evaluated in a simulation study.

Release date: 2011-12-21

• Articles and reports: 82-003-X201100411598
Description:

With longitudinal data, lifetime health status dynamics can be estimated by modeling trajectories. Health status trajectories measured by the Health Utilities Index Mark 3 (HUI3) modeled as a function of age alone and also of age and socio-economic covariates revealed non-normal residuals and variance estimation problems. The possibility of transforming the HUI3 distribution to obtain residuals that approximate a normal distribution was investigated.

Release date: 2011-12-21

• Articles and reports: 12-001-X201100111447
Description:

This paper introduces a R-package for the stratification of a survey population using a univariate stratification variable X and for the calculation of stratum sample sizes. Non iterative methods such as the cumulative root frequency method and the geometric stratum boundaries are implemented. Optimal designs, with stratum boundaries that minimize either the CV of the simple expansion estimator for a fixed sample size n or the n value for a fixed CV can be constructed. Two iterative algorithms are available to find the optimal stratum boundaries. The design can feature a user defined certainty stratum where all the units are sampled. Take-all and take-none strata can be included in the stratified design as they might lead to smaller sample sizes. The sample size calculations are based on the anticipated moments of the survey variable Y, given the stratification variable X. The package handles conditional distributions of Y given X that are either a heteroscedastic linear model, or a log-linear model. Stratum specific non-response can be accounted for in the design construction and in the sample size calculations.

Release date: 2011-06-29

• Articles and reports: 12-001-X201100111444
Description:

Release date: 2011-06-29

• Articles and reports: 12-001-X201100111443
Description:

Dual frame telephone surveys are becoming common in the U.S. because of the incompleteness of the landline frame as people transition to cell phones. This article examines nonsampling errors in dual frame telephone surveys. Even though nonsampling errors are ignored in much of the dual frame literature, we find that under some conditions substantial biases may arise in dual frame telephone surveys due to these errors. We specifically explore biases due to nonresponse and measurement error in these telephone surveys. To reduce the bias resulting from these errors, we propose dual frame sampling and weighting methods. The compositing factor for combining the estimates from the two frames is shown to play an important role in reducing nonresponse bias.

Release date: 2011-06-29

• Articles and reports: 12-001-X201000211384
Description:

The current economic downturn in the US could challenge costly strategies in survey operations. In the Behavioral Risk Factor Surveillance System (BRFSS), ending the monthly data collection at 31 days could be a less costly alternative. However, this could potentially exclude a portion of interviews completed after 31 days (late responders) whose respondent characteristics could be different in many respects from those who completed the survey within 31 days (early responders). We examined whether there are differences between the early and late responders in demographics, health-care coverage, general health status, health risk behaviors, and chronic disease conditions or illnesses. We used 2007 BRFSS data, where a representative sample of the noninstitutionalized adult U.S. population was selected using a random digit dialing method. Late responders were significantly more likely to be male; to report race/ethnicity as Hispanic; to have annual income higher than \$50,000; to be younger than 45 years of age; to have less than high school education; to have health-care coverage; to be significantly more likely to report good health; and to be significantly less likely to report hypertension, diabetes, or being obese. The observed differences between early and late responders on survey estimates may hardly influence national and state-level estimates. As the proportion of late responders may increase in the future, its impact on surveillance estimates should be examined before excluding from the analysis. Analysis on late responders only should combine several years of data to produce reliable estimates.

Release date: 2010-12-21

• Articles and reports: 12-001-X201000211382
Description:

The size of the cell-phone-only population in the USA has increased rapidly in recent years and, correspondingly, researchers have begun to experiment with sampling and interviewing of cell-phone subscribers. We discuss statistical issues involved in the sampling design and estimation phases of cell-phone studies. This work is presented primarily in the context of a nonoverlapping dual-frame survey in which one frame and sample are employed for the landline population and a second frame and sample are employed for the cell-phone-only population. Additional considerations necessary for overlapping dual-frame surveys (where the cell-phone frame and sample include some of the landline population) are also discussed. We illustrate the methods using the design of the National Immunization Survey (NIS), which monitors the vaccination rates of children age 19-35 months and teens age 13-17 years. The NIS is a nationwide telephone survey, followed by a provider record check, conducted by the Centers for Disease Control and Prevention.

Release date: 2010-12-21

• Articles and reports: 12-001-X201000111245
Description:

Knowledge of the causes of measurement errors in business surveys is limited, even though such errors may compromise the accuracy of the micro data and economic indicators derived from them. This article, based on an empirical study with a focus from the business perspective, presents new research findings on the response process in business surveys. It proposes the Multidimensional Integral Business Survey Response (MIBSR) model as a tool for investigating the response process and explaining its outcomes, and as the foundation of any strategy dedicated to reducing and preventing measurement errors.

Release date: 2010-06-29

Reference (104)

## Reference (104) (25 of 104 results)

• Technical products: 11-522-X201700014745
Description:

In the design of surveys a number of parameters like contact propensities, participation propensities and costs per sample unit play a decisive role. In on-going surveys, these survey design parameters are usually estimated from previous experience and updated gradually with new experience. In new surveys, these parameters are estimated from expert opinion and experience with similar surveys. Although survey institutes have a fair expertise and experience, the postulation, estimation and updating of survey design parameters is rarely done in a systematic way. This paper presents a Bayesian framework to include and update prior knowledge and expert opinion about the parameters. This framework is set in the context of adaptive survey designs in which different population units may receive different treatment given quality and cost objectives. For this type of survey, the accuracy of design parameters becomes even more crucial to effective design decisions. The framework allows for a Bayesian analysis of the performance of a survey during data collection and in between waves of a survey. We demonstrate the Bayesian analysis using a realistic simulation study.

Release date: 2016-03-24

• Technical products: 11-522-X201700014722
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014713
Description:

Big data is a term that means different things to different people. To some, it means datasets so large that our traditional processing and analytic systems can no longer accommodate them. To others, it simply means taking advantage of existing datasets of all sizes and finding ways to merge them with the goal of generating new insights. The former view poses a number of important challenges to traditional market, opinion, and social research. In either case, there are implications for the future of surveys that are only beginning to be explored.

Release date: 2016-03-24

• Technical products: 11-522-X201700014749
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014725
Description:

Tax data are being used more and more to measure and analyze the population and its characteristics. One of the issues raised by the growing use of these type of data relates to the definition of the concept of place of residence. While the census uses the traditional concept of place of residence, tax data provide information based on the mailing address of tax filers. Using record linkage between the census, the National Household Survey and tax data from the T1 Family File, this study examines the consistency level of the place of residence of these two sources and its associated characteristics.

Release date: 2016-03-24

• Technical products: 11-522-X201300014252
Description:

Although estimating finite populations characteristics from probability samples has been very successful for large samples, inferences from non-probability samples may also be possible. Non-probability samples have been criticized due to self-selection bias and the lack of methods for estimating the precision of the estimates. The wide spread access to the Web and the ability to do very inexpensive data collection on the Web has reinvigorated interest in this topic. We review of non-probability sampling strategies and summarize some of the key issues. We then propose conditions under which non-probability sampling may be a reasonable approach. We conclude with ideas for future research.

Release date: 2014-10-31

• Technical products: 11-522-X201300014253
Description:

New developments in computer technology, but also new challenges in society like increasing nonresponse rates and decreasing budgets may lead to changes in survey methodology for official statistics. Nowadays, web panels have become very popular in the world of market research. This raises the question whether such panels can also be used for official statistics. Can they produce high quality statistics about the general population? This paper attempts to answer this question by exploring methodological aspects like under-coverage, sample selection, and nonresponse. Statistics Netherlands carried out a test with a web panel. Some results are described.

Release date: 2014-10-31

• Technical products: 11-522-X201300014259
Description:

In an effort to reduce response burden on farm operators, Statistics Canada is studying alternative approaches to telephone surveys for producing field crop estimates. One option is to publish harvested area and yield estimates in September as is currently done, but to calculate them using models based on satellite and weather data, and data from the July telephone survey. However before adopting such an approach, a method must be found which produces estimates with a sufficient level of accuracy. Research is taking place to investigate different possibilities. Initial research results and issues to consider are discussed in this paper.

Release date: 2014-10-31

• Technical products: 11-522-X201300014278
Description:

In January and February 2014, Statistics Canada conducted a test aiming at measuring the effectiveness of different collection strategies using an online self-reporting survey. Sampled units were contacted using mailed introductory letters and asked to complete the online survey without any interviewer contact. The objectives of this test were to measure the take-up rates for completing an online survey, and to profile the respondents/non-respondents. Different samples and letters were tested to determine the relative effectiveness of the different approaches. The results of this project will be used to inform various social surveys that are preparing to include an internet response option in their surveys. The paper will present the general methodology of the test as well as results observed from collection and the analysis of profiles.

Release date: 2014-10-31

• Technical products: 11-522-X201300014255
Description:

The Brazilian Network Information Center (NIC.br) has designed and carried out a pilot project to collect data from the Web in order to produce statistics about the webpages’ characteristics. Studies on the characteristics and dimensions of the web require collecting and analyzing information from a dynamic and complex environment. The core idea was collecting data from a sample of webpages automatically by using software known as web crawler. The motivation for this paper is to disseminate the methods and results of this study as well as to show current developments related to sampling techniques in a dynamic environment.

Release date: 2014-10-31

• Technical products: 11-522-X201300014291
Description:

Occupational coding in Germany is mostly done using dictionary approaches with subsequent manual revision of cases which could not be coded. Since manual coding is expensive, it is desirable to assign a higher number of codes automatically. At the same time the quality of the automatic coding must at least reach that of the manual coding. As a possible solution we employ different machine learning algorithms for the task using a substantial amount of manually coded occuptions available from recent studies as training data. We asses the feasibility of these methods of evaluating performance and quality of the algorithms.

Release date: 2014-10-31

• Technical products: 11-522-X200800010981
Description:

One of the main characteristics of the 2001 Spanish Census of the Population was the use of an administrative Register of Population (El Padrón) for pre-printing the questionnaires and also the enumerator's record books of the census sections. In this paper we present the main characteristics of the relationship between the Population Register and Census of Population, and the main changes that are being foreseen for the next Census that will take place in 2011.

Release date: 2009-12-03

• Technical products: 11-522-X200800011000
Description:

The present report reviews the results of a mailing experiment that took place within a large scale demonstration project. A postcard and stickers were sent to a random group of project participants in the period between a contact call and a survey. The researchers hypothesized that, because of the additional mailing (the treatment), the response rates to the upcoming survey would increase. There was, however, no difference between the response rates of the treatment group that received the additional mailing and the control group. In the specific circumstances of the mailing experiment, sending project participants a postcard and stickers as a reminder of the upcoming survey and of their participation in the pilot project was not an efficient way to increase response rates.

Release date: 2009-12-03

• Technical products: 11-522-X200800010955
Description:

Survey managers are still discovering the usefulness of digital audio recording for monitoring and managing field staff. Its value so far has been for confirming the authenticity of interviews, detecting curbstoning, offering a concrete basis for feedback on interviewing performance and giving data collection managers an intimate view of in-person interviews. In addition, computer audio-recorded interviewing (CARI) can improve other aspects of survey data quality, offering corroboration or correction of response coding by field staff. Audio recordings may replace or supplement in-field verbatim transcription of free responses, and speech-to-text technology might make this technique more efficient in the future.

Release date: 2009-12-03

• Technical products: 11-522-X200800010996
Description:

In recent years, the use of paradata has become increasingly important to the management of collection activities at Statistics Canada. Particular attention has been paid to social surveys conducted over the phone, like the Survey of Labour and Income Dynamics (SLID). For recent SLID data collections, the number of call attempts was capped at 40 calls. Investigations of the SLID Blaise Transaction History (BTH) files were undertaken to assess the impact of the cap on calls.The purpose of the first study was to inform decisions as to the capping of call attempts, the second study focused on the nature of nonresponse given the limit of 40 attempts.

The use of paradata as auxiliary information for studying and accounting for survey nonresponse was also examined. Nonresponse adjustment models using different paradata variables gathered at the collection stage were compared to the current models based on available auxiliary information from the Labour Force Survey.

Release date: 2009-12-03

• Technical products: 11-522-X200800011011
Description:

The Federation of Canadian Municipalities' (FCM) Quality of Life Reporting System (QOLRS) is a means by which to measure, monitor, and report on the quality of life in Canadian municipalities. To address that challenge of administrative data collection across member municipalities the QOLRS technical team collaborated on the development of the Municipal Data Collection Tool (MDCT) which has become a key component of QOLRS' data acquisition methodology. Offered as a case study on administrative data collection, this paper argues that the recent launch of the MDCT has enabled the FCM to access reliable pan-Canadian municipal administrative data for the QOLRS.

Release date: 2009-12-03

• Technical products: 11-522-X200800010989
Description:

At first sight, web surveys seem to be an interesting and attractive means of data collection. They provide simple, cheap and fast access to a large group of people. However, web surveys also suffer from methodological problems. Outcomes of web surveys may be severally biased, particularly if self-selection of respondents is applied instead of proper probability sampling. Under-coverage is also a serious problem. This raises the question whether web surveys can be used for data collection in official statistics. This paper addresses the problems under-coverage and self-selection in web surveys, and attempts to describe how Internet data collection can be incorporated in normal data collection practices of official statistics.

Release date: 2009-12-03

• Technical products: 11-522-X200800011001
Description:

Currently underway, the Québec Population Health Survey (EQSP), for which collection will wrap up in February 2009, provides an opportunity, because of the size of its sample, to assess the impact that sending out introductory letters to respondents has on the response rate in a controlled environment. Since this regional telephone survey is expected to have more than 38,000 respondents, it was possible to use part of its sample for this study without having too great an impact on its overall response rate. In random digit dialling (RDD) surveys such as the EQSP, one of the main challenges in sending out introductory letters is reaching the survey units. Doing so depends largely on our capacity to associate an address with the sample units and on the quality of that information.

This article describes the controlled study proposed by the Institut de la statistique du Québec to measure the effect that sending out introductory letters to respondents had on the survey's response rate.

Release date: 2009-12-03

• Technical products: 11-522-X200800010976
Description:

Many survey organizations use the response rate as an indicator for the quality of survey data. As a consequence, a variety of measures are implemented to reduce non-response or to maintain response at an acceptable level. However, the response rate is not necessarily a good indicator of non-response bias. A higher response rate does not imply smaller non-response bias. What matters is how the composition of the response differs from the composition of the sample as a whole. This paper describes the concept of R-indicators to assess potential differences between the sample and the response. Such indicators may facilitate analysis of survey response over time, between various fieldwork strategies or data collection modes. Some practical examples are given.

Release date: 2009-12-03

• Technical products: 11-522-X200800010948
Description:

Past survey instruments, whether in the form of a paper questionnaire or telephone script, were their own documentation. Based on this, the ESRC Question Bank was created, providing free-access internet publication of questionnaires, enabling researchers to re-use questions, saving them trouble, whilst improving the comparability of their data with that collected by others. Today however, as survey technology and computer programs have become more sophisticated, accurate comprehension of the latest questionnaires seems more difficult, particularly when each survey team uses its own conventions to document complex items in technical reports. This paper seeks to illustrate these problems and suggest preliminary standards of presentation to be used until the process can be automated.

Release date: 2009-12-03

• Technical products: 11-522-X200800011008
Description:

In one sense, a questionnaire is never complete. Test results, paradata and research findings constantly provide reasons to update and improve the questionnaire. In addition, establishments change over time and questions need to be updated accordingly. In reality, it doesn't always work like this. At Statistics Sweden there are several examples of questionnaires that were designed at one point in time and rarely improved later on. However, we are currently trying to shift the perspective on questionnaire design from a linear to a cyclic one. We are developing a cyclic model in which the questionnaire can be improved continuously in multiple rounds. In this presentation, we will discuss this model and how we work with it.

Release date: 2009-12-03

• Technical products: 11-522-X200800010975
Description:

A major issue in official statistics is the availability of objective measures supporting the based-on-fact decision process. Istat has developed an Information System to assess survey quality. Among other standard quality indicators, nonresponse rates are systematically computed and stored for all surveys. Such a rich information base permits analysis over time and comparisons among surveys. The paper focuses on the analysis of interrelationships between data collection mode and other survey characteristics on total nonresponse. Particular attention is devoted to the extent to which multi-mode data collection improves response rates.

Release date: 2009-12-03

• Technical products: 11-522-X200800011004
Description:

The issue of reducing the response burden is not new. Statistics Sweden works in different ways to reduce response burden and to decrease the administrative costs of data collection from enterprises and organizations. According to legislation Statistics Sweden must reduce response burden for the business community. Therefore, this work is a priority. There is a fixed level decided by the Government to decrease the administrative costs of enterprises by twenty-five percent until year 2010. This goal is valid also for data collection for statistical purposes. The goal concerns surveys with response compulsory legislation. In addition to these surveys there are many more surveys and a need to measure and reduce the burden from these surveys as well. In order to help measure, analyze and reduce the burden, Statistics Sweden has developed the Register of Data providers concerning enterprises and organization (ULR). The purpose of the register is twofold, to measure and analyze the burden on an aggregated level and to be able to give information to each individual enterprise which surveys they are participating in.

Release date: 2009-12-03

• Technical products: 11-522-X200800010978
Description:

Census developers and social researchers are at a critical juncture in determining collection modes of the future. Internet data collection is technically feasible, but the initial investment in hardware and software is costly. Given the great divide in computer knowledge and access, internet data collection is viable for some, but not for all. Therefore internet cannot fully replace the existing paper questionnaire - at least not in the near future.

Canada, Australia and New Zealand are pioneers in internet data collection as an option for completing the census. This paper studies four driving forces behind this collection mode: 1) responding to social/public expectations; 2) longer term economic benefits; 3) improved data quality; and 4) improved coverage.

Issues currently being faced are: 1) estimating internet uptake and maximizing benefits without undue risk; 2) designing a questionnaire for multiple modes; 3) producing multiple public communication approaches; and 4) gaining positive public reaction and trust in using the internet.

This paper summarizes the countries' collective thinking and experiences on the benefits and limitation of internet data collection for a census of population and dwellings. It also provides an outline of where countries are heading in terms of internet data collection in the future.

Release date: 2009-12-03

• Technical products: 11-536-X200900110811
Description:

Composite imputation is often used in business surveys. It occurs when several imputation methods are used to impute a single variable of interest. The choice of one method instead of another depends on the availability or not of some auxiliary variables. For instance, ratio imputation could be used to impute a missing value when an auxiliary variable is available and, otherwise, mean imputation could be used.

Although composite imputation is frequent in practice, the literature on variance estimation when composite imputation is used is limited. We consider the general methodology proposed by Särndal et al. (1992), which requires the validity of an imputation model i.e., a model for the variable being imputed. At first glance, the extension of this methodology to composite imputation seems quite tedious until we notice that most imputation methods used in practice lead to imputed estimators that are linear in the observed values of the variable of interest. This considerably simplifies the derivation of a variance estimator even when there is a single imputation method. Regarding the estimation of the sampling portion of the total variance, we use a methodology slightly different than the one proposed by Särndal et al. (1992). Our methodology is similar to the sampling variance estimator under multiple imputation with an infinite number of imputations.

This methodology is the central part of version 2.0 of the System for Estimation of Variance due to Nonresponse and Imputation (SEVANI), which is being developed at Statistics Canada. Using SEVANI, we will illustrate our method through an example based on real data.

Release date: 2009-08-11

Date modified: