# Statistics by subject – Statistical methods

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
All (167)

## All (167) (25 of 167 results)

• Articles and reports: 82-003-X201700614829
Description:

POHEM-BMI is a microsimulation tool that includes a model of adult body mass index (BMI) and a model of childhood BMI history. This overview describes the development of BMI prediction models for adults and of childhood BMI history, and compares projected BMI estimates with those from nationally representative survey data to establish validity.

Release date: 2017-06-21

• Articles and reports: 12-001-X201600214662
Description:

Two-phase sampling designs are often used in surveys when the sampling frame contains little or no auxiliary information. In this note, we shed some light on the concept of invariance, which is often mentioned in the context of two-phase sampling designs. We define two types of invariant two-phase designs: strongly invariant and weakly invariant two-phase designs. Some examples are given. Finally, we describe the implications of strong and weak invariance from an inference point of view.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114543
Description:

The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114540
Description:

In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

Release date: 2016-06-22

• Technical products: 11-522-X201700014726
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014755
Description:

The National Children’s Study Vanguard Study was a pilot epidemiological cohort study of children and their parents. Measures were to be taken from pre-pregnancy until adulthood. The use of extant data was planned to supplement direct data collection from the respondents. Our paper outlines a strategy for cataloging and evaluating extant data sources for use with large scale longitudinal. Through our review we selected five evaluation factors to guide a researcher through available data sources including 1) relevance, 2) timeliness, 3) spatiality, 4) accessibility, and 5) accuracy.

Release date: 2016-03-24

• Technical products: 11-522-X201700014719
Description:

Open Data initiatives are transforming how governments and other public institutions interact and provide services to their constituents. They increase transparency and value to citizens, reduce inefficiencies and barriers to information, enable data-driven applications that improve public service delivery, and provide public data that can stimulate innovative business opportunities. As one of the first international organizations to adopt an open data policy, the World Bank has been providing guidance and technical expertise to developing countries that are considering or designing their own initiatives. This presentation will give an overview of developments in open data at the international level along with current and future experiences, challenges, and opportunities. Mr. Herzog will discuss the rationales under which governments are embracing open data, demonstrated benefits to both the public and private sectors, the range of different approaches that governments are taking, and the availability of tools for policymakers, with special emphasis on the roles and perspectives of National Statistics Offices within a government-wide initiative.

Release date: 2016-03-24

• Technical products: 11-522-X201700014749
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014746
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014729
Description:

The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census.

Release date: 2016-03-24

• Technical products: 11-522-X201700014725
Description:

Tax data are being used more and more to measure and analyze the population and its characteristics. One of the issues raised by the growing use of these type of data relates to the definition of the concept of place of residence. While the census uses the traditional concept of place of residence, tax data provide information based on the mailing address of tax filers. Using record linkage between the census, the National Household Survey and tax data from the T1 Family File, this study examines the consistency level of the place of residence of these two sources and its associated characteristics.

Release date: 2016-03-24

• Technical products: 11-522-X201700014742
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014732
Description:

The Institute for Employment Research (IAB) is the research unit of the German Federal Employment Agency. Via the Research Data Centre (FDZ) at the IAB, administrative and survey data on individuals and establishments are provided to researchers. In cooperation with the Institute for the Study of Labor (IZA), the FDZ has implemented the Job Submission Application (JoSuA) environment which enables researchers to submit jobs for remote data execution through a custom-built web interface. Moreover, two types of user-generated output files may be distinguished within the JoSuA environment which allows for faster and more efficient disclosure review services.

Release date: 2016-03-24

• Technical products: 11-522-X201700014714
Description:

The Labour Market Development Agreements (LMDAs) between Canada and the provinces and territories fund labour market training and support services to Employment Insurance claimants. The objective of this paper is to discuss the improvements over the years in the impact assessment methodology. The paper describes the LMDAs and past evaluation work and discusses the drivers to make better use of large administrative data holdings. It then explains how the new approach made the evaluation less resource-intensive, while results are more relevant to policy development. The paper outlines the lessons learned from a methodological perspective and provides insight into ways for making this type of use of administrative data effective, especially in the context of large programs.

Release date: 2016-03-24

• Technical products: 11-522-X201700014740
Description:

In this paper, we discuss the impacts of Employment Benefit and Support Measures delivered in Canada under the Labour Market Development Agreements. We use linked rich longitudinal administrative data covering all LMDA participants from 2002 to 2005. We Apply propensity score matching as in Blundell et al. (2002), Gerfin and Lechner (2002), and Sianesi (2004), and produced the national incremental impact estimates using difference-in-differences and Kernel Matching estimator (Heckman and Smith, 1999). The findings suggest that, both Employment Assistance Services and employment benefit such as Skills Development and Targeted Wage Subsidies had positive effects on earnings and employment.

Release date: 2016-03-24

• Technical products: 11-522-X201700014716
Description:

Administrative data, depending on its source and original purpose, can be considered a more reliable source of information than survey-collected data. It does not require a respondent to be present and understand question wording, and it is not limited by the respondent’s ability to recall events retrospectively. This paper compares selected survey data, such as demographic variables, from the Longitudinal and International Study of Adults (LISA) to various administrative sources for which LISA has linkage agreements in place. The agreement between data sources, and some factors that might affect it, are analyzed for various aspects of the survey.

Release date: 2016-03-24

• Technical products: 11-522-X201700014743
Description:

Probabilistic linkage is susceptible to linkage errors such as false positives and false negatives. In many cases, these errors may be reliably measured through clerical-reviews, i.e. the visual inspection of a sample of record pairs to determine if they are matched. A framework is described to effectively carry-out such clerical-reviews based on a probabilistic sample of pairs, repeated independent reviews of the same pairs and latent class analysis to account for clerical errors.

Release date: 2016-03-24

• Technical products: 11-522-X201700014711
Description:

After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.

Release date: 2016-03-24

• Technical products: 11-522-X201700014718
Description:

This study assessed whether starting participation in Employment Assistance Services (EAS) earlier after initiating an Employment Insurance (EI) claim leads to better impacts for unemployed individuals than participating later during the EI benefit period. As in Sianesi (2004) and Hujer and Thomsen (2010), the analysis relied on a stratified propensity score matching approach conditional on the discretized duration of unemployment until the program starts. The results showed that individuals who participated in EAS within the first four weeks after initiating an EI claim had the best impacts on earnings and incidence of employment while also experiencing reduced use of EI starting the second year post-program.

Release date: 2016-03-24

• Articles and reports: 12-001-X201500214229
Description:

Self-weighting estimation through equal probability selection methods (epsem) is desirable for variance efficiency. Traditionally, the epsem property for (one phase) two stage designs for estimating population-level parameters is realized by using each primary sampling unit (PSU) population count as the measure of size for PSU selection along with equal sample size allocation per PSU under simple random sampling (SRS) of elementary units. However, when self-weighting estimates are desired for parameters corresponding to multiple domains under a pre-specified sample allocation to domains, Folsom, Potter and Williams (1987) showed that a composite measure of size can be used to select PSUs to obtain epsem designs when besides domain-level PSU counts (i.e., distribution of domain population over PSUs), frame-level domain identifiers for elementary units are also assumed to be available. The term depsem-A will be used to denote such (one phase) two stage designs to obtain domain-level epsem estimation. Folsom et al. also considered two phase two stage designs when domain-level PSU counts are unknown, but whole PSU counts are known. For these designs (to be termed depsem-B) with PSUs selected proportional to the usual size measure (i.e., the total PSU count) at the first stage, all elementary units within each selected PSU are first screened for classification into domains in the first phase of data collection before SRS selection at the second stage. Domain-stratified samples are then selected within PSUs with suitably chosen domain sampling rates such that the desired domain sample sizes are achieved and the resulting design is self-weighting. In this paper, we first present a simple justification of composite measures of size for the depsem-A design and of the domain sampling rates for the depsem-B design. Then, for depsem-A and -B designs, we propose generalizations, first to cases where frame-level domain identifiers for elementary units are not available and domain-level PSU counts are only approximately known from alternative sources, and second to cases where PSU size measures are pre-specified based on other practical and desirable considerations of over- and under-sampling of certain domains. We also present a further generalization in the presence of subsampling of elementary units and nonresponse within selected PSUs at the first phase before selecting phase two elementary units from domains within each selected PSU. This final generalization of depsem-B is illustrated for an area sample of housing units.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214236
Description:

We propose a model-assisted extension of weighting design-effect measures. We develop a summary-level statistic for different variables of interest, in single-stage sampling and under calibration weight adjustments. Our proposed design effect measure captures the joint effects of a non-epsem sampling design, unequal weights produced using calibration adjustments, and the strength of the association between an analysis variable and the auxiliaries used in calibration. We compare our proposed measure to existing design effect measures in simulations using variables like those collected in establishment surveys and telephone surveys of households.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214248
Description:

Unit level population models are often used in model-based small area estimation of totals and means, but the models may not hold for the sample if the sampling design is informative for the model. As a result, standard methods, assuming that the model holds for the sample, can lead to biased estimators. We study alternative methods that use a suitable function of the unit selection probability as an additional auxiliary variable in the sample model. We report the results of a simulation study on the bias and mean squared error (MSE) of the proposed estimators of small area means and on the relative bias of the associated MSE estimators, using informative sampling schemes to generate the samples. Alternative methods, based on modeling the conditional expectation of the design weight as a function of the model covariates and the response, are also included in the simulation study.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Articles and reports: 12-001-X201400214119
Description:

When considering sample stratification by several variables, we often face the case where the expected number of sample units to be selected in each stratum is very small and the total number of units to be selected is smaller than the total number of strata. These stratified sample designs are specifically represented by the tabular arrays with real numbers, called controlled selection problems, and are beyond the reach of conventional methods of allocation. Many algorithms for solving these problems have been studied over about 60 years beginning with Goodman and Kish (1950). Those developed more recently are especially computer intensive and always find the solutions. However, there still remains the unanswered question: In what sense are the solutions to a controlled selection problem obtained from those algorithms optimal? We introduce the general concept of optimal solutions, and propose a new controlled selection algorithm based on typical distance functions to achieve solutions. This algorithm can be easily performed by a new SAS-based software. This study focuses on two-way stratification designs. The controlled selection solutions from the new algorithm are compared with those from existing algorithms using several examples. The new algorithm successfully obtains robust solutions to two-way controlled selection problems that meet the optimality criteria.

Release date: 2014-12-19

• Technical products: 11-522-X201300014276
Description:

In France, budget restrictions are making it more difficult to hire casual interviewers to deal with collection problems. As a result, it has become necessary to adhere to a predetermined annual work quota. For surveys of the National Institute of Statistics and Economic Studies (INSEE), which use a master sample, problems arise when an interviewer is on extended leave throughout the entire collection period of a survey. When that occurs, an area may cease to be covered by the survey, and this effectively generates a bias. In response to this new problem, we have implemented two methods, depending on when the problem is identified: If an area is ‘abandoned’ before or at the very beginning of collection, we carry out a ‘sub-allocation’ procedure. The procedure involves interviewing a minimum number of households in each collection area at the expense of other areas in which no collection problems have been identified. The idea is to minimize the dispersion of weights while meeting collection targets. If an area is ‘abandoned’ during collection, we prioritize the remaining surveys. Prioritization is based on a representativeness indicator (R indicator) that measures the degree of similarity between a sample and the base population. The goal of this prioritization process during collection is to get as close as possible to equal response probability for respondents. The R indicator is based on the dispersion of the estimated response probabilities of the sampled households, and it is composed of partial R indicators that measure representativeness variable by variable. These R indicators are tools that we can use to analyze collection by isolating underrepresented population groups. We can increase collection efforts for groups that have been identified beforehand. In the oral presentation, we covered these two points concisely. By contrast, this paper deals exclusively with the first point: sub-allocation. Prioritization is being implemented for the first time at INSEE for the assets survey, and it will be covered in a specific paper by A. Rebecq.

Release date: 2014-10-31

Data (1)

## Data (1) (1 result)

• Table: 62-010-X19970023422
Description:

The current official time base of the Consumer Price Index (CPI) is 1986=100. This time base was first used when the CPI for June 1990 was released. Statistics Canada is about to convert all price index series to the time base 1992=100. As a result, all constant dollar series will be converted to 1992 dollars. The CPI will shift to the new time base when the CPI for January 1998 is released on February 27th, 1998.

Release date: 1997-11-17

Analysis (79)

## Analysis (79) (25 of 79 results)

• Articles and reports: 82-003-X201700614829
Description:

POHEM-BMI is a microsimulation tool that includes a model of adult body mass index (BMI) and a model of childhood BMI history. This overview describes the development of BMI prediction models for adults and of childhood BMI history, and compares projected BMI estimates with those from nationally representative survey data to establish validity.

Release date: 2017-06-21

• Articles and reports: 12-001-X201600214662
Description:

Two-phase sampling designs are often used in surveys when the sampling frame contains little or no auxiliary information. In this note, we shed some light on the concept of invariance, which is often mentioned in the context of two-phase sampling designs. We define two types of invariant two-phase designs: strongly invariant and weakly invariant two-phase designs. Some examples are given. Finally, we describe the implications of strong and weak invariance from an inference point of view.

Release date: 2016-12-20

• Articles and reports: 12-001-X201600114543
Description:

The regression estimator is extensively used in practice because it can improve the reliability of the estimated parameters of interest such as means or totals. It uses control totals of variables known at the population level that are included in the regression set up. In this paper, we investigate the properties of the regression estimator that uses control totals estimated from the sample, as well as those known at the population level. This estimator is compared to the regression estimators that strictly use the known totals both theoretically and via a simulation study.

Release date: 2016-06-22

• Articles and reports: 12-001-X201600114540
Description:

In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

Release date: 2016-06-22

• Articles and reports: 12-001-X201500214229
Description:

Self-weighting estimation through equal probability selection methods (epsem) is desirable for variance efficiency. Traditionally, the epsem property for (one phase) two stage designs for estimating population-level parameters is realized by using each primary sampling unit (PSU) population count as the measure of size for PSU selection along with equal sample size allocation per PSU under simple random sampling (SRS) of elementary units. However, when self-weighting estimates are desired for parameters corresponding to multiple domains under a pre-specified sample allocation to domains, Folsom, Potter and Williams (1987) showed that a composite measure of size can be used to select PSUs to obtain epsem designs when besides domain-level PSU counts (i.e., distribution of domain population over PSUs), frame-level domain identifiers for elementary units are also assumed to be available. The term depsem-A will be used to denote such (one phase) two stage designs to obtain domain-level epsem estimation. Folsom et al. also considered two phase two stage designs when domain-level PSU counts are unknown, but whole PSU counts are known. For these designs (to be termed depsem-B) with PSUs selected proportional to the usual size measure (i.e., the total PSU count) at the first stage, all elementary units within each selected PSU are first screened for classification into domains in the first phase of data collection before SRS selection at the second stage. Domain-stratified samples are then selected within PSUs with suitably chosen domain sampling rates such that the desired domain sample sizes are achieved and the resulting design is self-weighting. In this paper, we first present a simple justification of composite measures of size for the depsem-A design and of the domain sampling rates for the depsem-B design. Then, for depsem-A and -B designs, we propose generalizations, first to cases where frame-level domain identifiers for elementary units are not available and domain-level PSU counts are only approximately known from alternative sources, and second to cases where PSU size measures are pre-specified based on other practical and desirable considerations of over- and under-sampling of certain domains. We also present a further generalization in the presence of subsampling of elementary units and nonresponse within selected PSUs at the first phase before selecting phase two elementary units from domains within each selected PSU. This final generalization of depsem-B is illustrated for an area sample of housing units.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214236
Description:

We propose a model-assisted extension of weighting design-effect measures. We develop a summary-level statistic for different variables of interest, in single-stage sampling and under calibration weight adjustments. Our proposed design effect measure captures the joint effects of a non-epsem sampling design, unequal weights produced using calibration adjustments, and the strength of the association between an analysis variable and the auxiliaries used in calibration. We compare our proposed measure to existing design effect measures in simulations using variables like those collected in establishment surveys and telephone surveys of households.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500214248
Description:

Unit level population models are often used in model-based small area estimation of totals and means, but the models may not hold for the sample if the sampling design is informative for the model. As a result, standard methods, assuming that the model holds for the sample, can lead to biased estimators. We study alternative methods that use a suitable function of the unit selection probability as an additional auxiliary variable in the sample model. We report the results of a simulation study on the bias and mean squared error (MSE) of the proposed estimators of small area means and on the relative bias of the associated MSE estimators, using informative sampling schemes to generate the samples. Alternative methods, based on modeling the conditional expectation of the design weight as a function of the model covariates and the response, are also included in the simulation study.

Release date: 2015-12-17

• Articles and reports: 12-001-X201500114199
Description:

In business surveys, it is not unusual to collect economic variables for which the distribution is highly skewed. In this context, winsorization is often used to treat the problem of influential values. This technique requires the determination of a constant that corresponds to the threshold above which large values are reduced. In this paper, we consider a method of determining the constant which involves minimizing the largest estimated conditional bias in the sample. In the context of domain estimation, we also propose a method of ensuring consistency between the domain-level winsorized estimates and the population-level winsorized estimate. The results of two simulation studies suggest that the proposed methods lead to winsorized estimators that have good bias and relative efficiency properties.

Release date: 2015-06-29

• Articles and reports: 12-001-X201400214119
Description:

When considering sample stratification by several variables, we often face the case where the expected number of sample units to be selected in each stratum is very small and the total number of units to be selected is smaller than the total number of strata. These stratified sample designs are specifically represented by the tabular arrays with real numbers, called controlled selection problems, and are beyond the reach of conventional methods of allocation. Many algorithms for solving these problems have been studied over about 60 years beginning with Goodman and Kish (1950). Those developed more recently are especially computer intensive and always find the solutions. However, there still remains the unanswered question: In what sense are the solutions to a controlled selection problem obtained from those algorithms optimal? We introduce the general concept of optimal solutions, and propose a new controlled selection algorithm based on typical distance functions to achieve solutions. This algorithm can be easily performed by a new SAS-based software. This study focuses on two-way stratification designs. The controlled selection solutions from the new algorithm are compared with those from existing algorithms using several examples. The new algorithm successfully obtains robust solutions to two-way controlled selection problems that meet the optimality criteria.

Release date: 2014-12-19

• Articles and reports: 82-003-X201401014098
Description:

This study compares registry and non-registry approaches to linking 2006 Census of Population data for Manitoba and Ontario to Hospital data from the Discharge Abstract Database.

Release date: 2014-10-15

• Articles and reports: 12-001-X201400114004
Description:

In 2009, two major surveys in the Governments Division of the U.S. Census Bureau were redesigned to reduce sample size, save resources, and improve the precision of the estimates (Cheng, Corcoran, Barth and Hogue 2009). The new design divides each of the traditional state by government-type strata with sufficiently many units into two sub-strata according to each governmental unit’s total payroll, in order to sample less from the sub-stratum with small size units. The model-assisted approach is adopted in estimating population totals. Regression estimators using auxiliary variables are obtained either within each created sub-stratum or within the original stratum by collapsing two sub-strata. A decision-based method was proposed in Cheng, Slud and Hogue (2010), applying a hypothesis test to decide which regression estimator is used within each original stratum. Consistency and asymptotic normality of these model-assisted estimators are established here, under a design-based or model-assisted asymptotic framework. Our asymptotic results also suggest two types of consistent variance estimators, one obtained by substituting unknown quantities in the asymptotic variances and the other by applying the bootstrap. The performance of all the estimators of totals and of their variance estimators are examined in some empirical studies. The U.S. Annual Survey of Public Employment and Payroll (ASPEP) is used to motivate and illustrate our study.

Release date: 2014-06-27

• Articles and reports: 12-001-X201300211885
Description:

Web surveys are generally connected with low response rates. Common suggestions in textbooks on Web survey research highlight the importance of the welcome screen in encouraging respondents to take part. The importance of this screen has been empirically proven in research, showing that most respondents breakoff at the welcome screen. However, there has been little research on the effect of the design of this screen on the level of the breakoff rate. In a study conducted at the University of Konstanz, three experimental treatments were added to a survey of the first-year student population (2,629 students) to assess the impact of different design features of this screen on the breakoff rates. The methodological experiments included varying the background color of the welcome screen, varying the promised task duration on this first screen, and varying the length of the information provided on the welcome screen explaining the privacy rights of the respondents. The analyses show that the longer stated length and the more attention given to explaining privacy rights on the welcome screen, the fewer respondents started and completed the survey. However, the use of a different background color does not result in the expected significant difference.

Release date: 2014-01-15

• Articles and reports: 12-001-X201300211887
Description:

Multi-level models are extensively used for analyzing survey data with the design hierarchy matching the model hierarchy. We propose a unified approach, based on a design-weighted log composite likelihood, for two-level models that leads to design-model consistent estimators of the model parameters even when the within cluster sample sizes are small provided the number of sample clusters is large. This method can handle both linear and generalized linear two-level models and it requires level 2 and level 1 inclusion probabilities and level 1 joint inclusion probabilities, where level 2 represents a cluster and level 1 an element within a cluster. Results of a simulation study demonstrating superior performance of the proposed method relative to existing methods under informative sampling are also reported.

Release date: 2014-01-15

• Articles and reports: 12-001-X201300211869
Description:

The house price index compiled by Statistics Netherlands relies on the Sale Price Appraisal Ratio (SPAR) method. The SPAR method combines selling prices with prior government assessments of properties. This paper outlines an alternative approach where the appraisals serve as auxiliary information in a generalized regression (GREG) framework. An application on Dutch data demonstrates that, although the GREG index is much smoother than the ratio of sample means, it is very similar to the SPAR series. To explain this result we show that the SPAR index is an estimator of our more general GREG index and in practice almost as efficient.

Release date: 2014-01-15

• Articles and reports: 12-001-X201300111830
Description:

We consider two different self-benchmarking methods for the estimation of small area means based on the Fay-Herriot (FH) area level model: the method of You and Rao (2002) applied to the FH model and the method of Wang, Fuller and Qu (2008) based on augmented models. We derive an estimator of the mean squared prediction error (MSPE) of the You-Rao (YR) estimator of a small area mean that, under the true model, is correct to second-order terms. We report the results of a simulation study on the relative bias of the MSPE estimator of the YR estimator and the MSPE estimator of the Wang, Fuller and Qu (WFQ) estimator obtained under an augmented model. We also study the MSPE and the estimators of MSPE for the YR and WFQ estimators obtained under a misspecified model.

Release date: 2013-06-28

• Articles and reports: 12-001-X201200111682
Description:

Sample allocation issues are studied in the context of estimating sub-population (stratum or domain) means as well as the aggregate population mean under stratified simple random sampling. A non-linear programming method is used to obtain "optimal" sample allocation to strata that minimizes the total sample size subject to specified tolerances on the coefficient of variation of the estimators of strata means and the population mean. The resulting total sample size is then used to determine sample allocations for the methods of Costa, Satorra and Ventura (2004) based on compromise allocation and Longford (2006) based on specified "inferential priorities". In addition, we study sample allocation to strata when reliability requirements for domains, cutting across strata, are also specified. Performance of the three methods is studied using data from Statistics Canada's Monthly Retail Trade Survey (MRTS) of single establishments.

Release date: 2012-06-27

• Articles and reports: 82-003-X201200111625
Description:

This study compares estimates of the prevalence of cigarette smoking based on self-report with estimates based on urinary cotinine concentrations. The data are from the 2007 to 2009 Canadian Health Measures Survey, which included self-reported smoking status and the first nationally representative measures of urinary cotinine.

Release date: 2012-02-15

• Articles and reports: 12-001-X201000211385
Description:

In this short note, we show that simple random sampling without replacement and Bernoulli sampling have approximately the same entropy when the population size is large. An empirical example is given as an illustration.

Release date: 2010-12-21

• Articles and reports: 12-001-X201000211384
Description:

The current economic downturn in the US could challenge costly strategies in survey operations. In the Behavioral Risk Factor Surveillance System (BRFSS), ending the monthly data collection at 31 days could be a less costly alternative. However, this could potentially exclude a portion of interviews completed after 31 days (late responders) whose respondent characteristics could be different in many respects from those who completed the survey within 31 days (early responders). We examined whether there are differences between the early and late responders in demographics, health-care coverage, general health status, health risk behaviors, and chronic disease conditions or illnesses. We used 2007 BRFSS data, where a representative sample of the noninstitutionalized adult U.S. population was selected using a random digit dialing method. Late responders were significantly more likely to be male; to report race/ethnicity as Hispanic; to have annual income higher than \$50,000; to be younger than 45 years of age; to have less than high school education; to have health-care coverage; to be significantly more likely to report good health; and to be significantly less likely to report hypertension, diabetes, or being obese. The observed differences between early and late responders on survey estimates may hardly influence national and state-level estimates. As the proportion of late responders may increase in the future, its impact on surveillance estimates should be examined before excluding from the analysis. Analysis on late responders only should combine several years of data to produce reliable estimates.

Release date: 2010-12-21

• Articles and reports: 12-001-X201000211378
Description:

One key to poverty alleviation or eradication in the third world is reliable information on the poor and their location, so that interventions and assistance can be effectively targeted to the neediest people. Small area estimation is one statistical technique that is used to monitor poverty and to decide on aid allocation in pursuit of the Millennium Development Goals. Elbers, Lanjouw and Lanjouw (ELL) (2003) proposed a small area estimation methodology for income-based or expenditure-based poverty measures, which is implemented by the World Bank in its poverty mapping projects via the involvement of the central statistical agencies in many third world countries, including Cambodia, Lao PDR, the Philippines, Thailand and Vietnam, and is incorporated into the World Bank software program PovMap. In this paper, the ELL methodology which consists of first modeling survey data and then applying that model to census information is presented and discussed with strong emphasis on the first phase, i.e., the fitting of regression models and on the estimated standard errors at the second phase. Other regression model fitting procedures such as the General Survey Regression (GSR) (as described in Lohr (1999) Chapter 11) and those used in existing small area estimation techniques: Pseudo-Empirical Best Linear Unbiased Prediction (Pseudo-EBLUP) approach (You and Rao 2002) and Iterative Weighted Estimating Equation (IWEE) method (You, Rao and Kovacevic 2003) are presented and compared with the ELL modeling strategy. The most significant difference between the ELL method and the other techniques is in the theoretical underpinning of the ELL model fitting procedure. An example based on the Philippines Family Income and Expenditure Survey is presented to show the differences in both the parameter estimates and their corresponding standard errors, and in the variance components generated from the different methods and the discussion is extended to the effect of these on the estimated accuracy of the final small area estimates themselves. The need for sound estimation of variance components, as well as regression estimates and estimates of their standard errors for small area estimation of poverty is emphasized.

Release date: 2010-12-21

• Articles and reports: 12-001-X201000111246
Description:

Many surveys employ weight adjustment procedures to reduce nonresponse bias. These adjustments make use of available auxiliary data. This paper addresses the issue of jackknife variance estimation for estimators that have been adjusted for nonresponse. Using the reverse approach for variance estimation proposed by Fay (1991) and Shao and Steel (1999), we study the effect of not re-calculating the nonresponse weight adjustment within each jackknife replicate. We show that the resulting 'shortcut' jackknife variance estimator tends to overestimate the true variance of point estimators in the case of several weight adjustment procedures used in practice. These theoretical results are confirmed through a simulation study where we compare the shortcut jackknife variance estimator with the full jackknife variance estimator obtained by re-calculating the nonresponse weight adjustment within each jackknife replicate.

Release date: 2010-06-29

• Articles and reports: 12-001-X200900110885
Description:

Peaks in the spectrum of a stationary process are indicative of the presence of stochastic periodic phenomena, such as a stochastic seasonal effect. This work proposes to measure and test for the presence of such spectral peaks via assessing their aggregate slope and convexity. Our method is developed nonparametrically, and thus may be useful during a preliminary analysis of a series. The technique is also useful for detecting the presence of residual seasonality in seasonally adjusted data. The diagnostic is investigated through simulation and an extensive case study using data from the U.S. Census Bureau and the Organization for Economic Co-operation and Development (OECD).

Release date: 2009-06-22

• Articles and reports: 12-001-X200800110616
Description:

With complete multivariate data the BACON algorithm (Billor, Hadi and Vellemann 2000) yields a robust estimate of the covariance matrix. The corresponding Mahalanobis distance may be used for multivariate outlier detection. When items are missing the EM algorithm is a convenient way to estimate the covariance matrix at each iteration step of the BACON algorithm. In finite population sampling the EM algorithm must be enhanced to estimate the covariance matrix of the population rather than of the sample. A version of the EM algorithm for survey data following a multivariate normal model, the EEM algorithm (Estimated Expectation Maximization), is proposed. The combination of the two algorithms, the BACON-EEM algorithm, is applied to two datasets and compared with alternative methods.

Release date: 2008-06-26

• Articles and reports: 12-001-X200700210498
Description:

In this paper we describe a methodology for combining a convenience sample with a probability sample in order to produce an estimator with a smaller mean squared error (MSE) than estimators based on only the probability sample. We then explore the properties of the resulting composite estimator, a linear combination of the convenience and probability sample estimators with weights that are a function of bias. We discuss the estimator's properties in the context of web-based convenience sampling. Our analysis demonstrates that the use of a convenience sample to supplement a probability sample for improvements in the MSE of estimation may be practical only under limited circumstances. First, the remaining bias of the estimator based on the convenience sample must be quite small, equivalent to no more than 0.1 of the outcome's population standard deviation. For a dichotomous outcome, this implies a bias of no more than five percentage points at 50 percent prevalence and no more than three percentage points at 10 percent prevalence. Second, the probability sample should contain at least 1,000-10,000 observations for adequate estimation of the bias of the convenience sample estimator. Third, it must be inexpensive and feasible to collect at least thousands (and probably tens of thousands) of web-based convenience observations. The conclusions about the limited usefulness of convenience samples with estimator bias of more than 0.1 standard deviations also apply to direct use of estimators based on that sample.

Release date: 2008-01-03

• Articles and reports: 12-001-X200700210495
Description:

The purpose of this work is to obtain reliable estimates in study domains when there are potentially very small sample sizes and the sampling design stratum differs from the study domain. The population sizes are unknown as well for both the study domain and the sampling design stratum. In calculating parameter estimates in the study domains, a random sample size is often necessary. We propose a new family of generalized linear mixed models with correlated random effects when there is more than one unknown parameter. The proposed model will estimate both the population size and the parameter of interest. General formulae for full conditional distributions required for Markov chain Monte Carlo (MCMC) simulations are given for this framework. Equations for Bayesian estimation and prediction at the study domains are also given. We apply the 1998 Missouri Turkey Hunting Survey, which stratified samples based on the hunter's place of residence and we require estimates at the domain level, defined as the county in which the turkey hunter actually hunted.

Release date: 2008-01-03

Reference (87)

## Reference (87) (25 of 87 results)

• Technical products: 11-522-X201700014726
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014755
Description:

The National Children’s Study Vanguard Study was a pilot epidemiological cohort study of children and their parents. Measures were to be taken from pre-pregnancy until adulthood. The use of extant data was planned to supplement direct data collection from the respondents. Our paper outlines a strategy for cataloging and evaluating extant data sources for use with large scale longitudinal. Through our review we selected five evaluation factors to guide a researcher through available data sources including 1) relevance, 2) timeliness, 3) spatiality, 4) accessibility, and 5) accuracy.

Release date: 2016-03-24

• Technical products: 11-522-X201700014719
Description:

Open Data initiatives are transforming how governments and other public institutions interact and provide services to their constituents. They increase transparency and value to citizens, reduce inefficiencies and barriers to information, enable data-driven applications that improve public service delivery, and provide public data that can stimulate innovative business opportunities. As one of the first international organizations to adopt an open data policy, the World Bank has been providing guidance and technical expertise to developing countries that are considering or designing their own initiatives. This presentation will give an overview of developments in open data at the international level along with current and future experiences, challenges, and opportunities. Mr. Herzog will discuss the rationales under which governments are embracing open data, demonstrated benefits to both the public and private sectors, the range of different approaches that governments are taking, and the availability of tools for policymakers, with special emphasis on the roles and perspectives of National Statistics Offices within a government-wide initiative.

Release date: 2016-03-24

• Technical products: 11-522-X201700014749
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014746
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014729
Description:

The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census.

Release date: 2016-03-24

• Technical products: 11-522-X201700014725
Description:

Tax data are being used more and more to measure and analyze the population and its characteristics. One of the issues raised by the growing use of these type of data relates to the definition of the concept of place of residence. While the census uses the traditional concept of place of residence, tax data provide information based on the mailing address of tax filers. Using record linkage between the census, the National Household Survey and tax data from the T1 Family File, this study examines the consistency level of the place of residence of these two sources and its associated characteristics.

Release date: 2016-03-24

• Technical products: 11-522-X201700014742
Description:

Release date: 2016-03-24

• Technical products: 11-522-X201700014732
Description:

The Institute for Employment Research (IAB) is the research unit of the German Federal Employment Agency. Via the Research Data Centre (FDZ) at the IAB, administrative and survey data on individuals and establishments are provided to researchers. In cooperation with the Institute for the Study of Labor (IZA), the FDZ has implemented the Job Submission Application (JoSuA) environment which enables researchers to submit jobs for remote data execution through a custom-built web interface. Moreover, two types of user-generated output files may be distinguished within the JoSuA environment which allows for faster and more efficient disclosure review services.

Release date: 2016-03-24

• Technical products: 11-522-X201700014714
Description:

The Labour Market Development Agreements (LMDAs) between Canada and the provinces and territories fund labour market training and support services to Employment Insurance claimants. The objective of this paper is to discuss the improvements over the years in the impact assessment methodology. The paper describes the LMDAs and past evaluation work and discusses the drivers to make better use of large administrative data holdings. It then explains how the new approach made the evaluation less resource-intensive, while results are more relevant to policy development. The paper outlines the lessons learned from a methodological perspective and provides insight into ways for making this type of use of administrative data effective, especially in the context of large programs.

Release date: 2016-03-24

• Technical products: 11-522-X201700014740
Description:

In this paper, we discuss the impacts of Employment Benefit and Support Measures delivered in Canada under the Labour Market Development Agreements. We use linked rich longitudinal administrative data covering all LMDA participants from 2002 to 2005. We Apply propensity score matching as in Blundell et al. (2002), Gerfin and Lechner (2002), and Sianesi (2004), and produced the national incremental impact estimates using difference-in-differences and Kernel Matching estimator (Heckman and Smith, 1999). The findings suggest that, both Employment Assistance Services and employment benefit such as Skills Development and Targeted Wage Subsidies had positive effects on earnings and employment.

Release date: 2016-03-24

• Technical products: 11-522-X201700014716
Description:

Administrative data, depending on its source and original purpose, can be considered a more reliable source of information than survey-collected data. It does not require a respondent to be present and understand question wording, and it is not limited by the respondent’s ability to recall events retrospectively. This paper compares selected survey data, such as demographic variables, from the Longitudinal and International Study of Adults (LISA) to various administrative sources for which LISA has linkage agreements in place. The agreement between data sources, and some factors that might affect it, are analyzed for various aspects of the survey.

Release date: 2016-03-24

• Technical products: 11-522-X201700014743
Description:

Probabilistic linkage is susceptible to linkage errors such as false positives and false negatives. In many cases, these errors may be reliably measured through clerical-reviews, i.e. the visual inspection of a sample of record pairs to determine if they are matched. A framework is described to effectively carry-out such clerical-reviews based on a probabilistic sample of pairs, repeated independent reviews of the same pairs and latent class analysis to account for clerical errors.

Release date: 2016-03-24

• Technical products: 11-522-X201700014711
Description:

After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.

Release date: 2016-03-24

• Technical products: 11-522-X201700014718
Description:

This study assessed whether starting participation in Employment Assistance Services (EAS) earlier after initiating an Employment Insurance (EI) claim leads to better impacts for unemployed individuals than participating later during the EI benefit period. As in Sianesi (2004) and Hujer and Thomsen (2010), the analysis relied on a stratified propensity score matching approach conditional on the discretized duration of unemployment until the program starts. The results showed that individuals who participated in EAS within the first four weeks after initiating an EI claim had the best impacts on earnings and incidence of employment while also experiencing reduced use of EI starting the second year post-program.

Release date: 2016-03-24

• Technical products: 11-522-X201300014276
Description:

In France, budget restrictions are making it more difficult to hire casual interviewers to deal with collection problems. As a result, it has become necessary to adhere to a predetermined annual work quota. For surveys of the National Institute of Statistics and Economic Studies (INSEE), which use a master sample, problems arise when an interviewer is on extended leave throughout the entire collection period of a survey. When that occurs, an area may cease to be covered by the survey, and this effectively generates a bias. In response to this new problem, we have implemented two methods, depending on when the problem is identified: If an area is ‘abandoned’ before or at the very beginning of collection, we carry out a ‘sub-allocation’ procedure. The procedure involves interviewing a minimum number of households in each collection area at the expense of other areas in which no collection problems have been identified. The idea is to minimize the dispersion of weights while meeting collection targets. If an area is ‘abandoned’ during collection, we prioritize the remaining surveys. Prioritization is based on a representativeness indicator (R indicator) that measures the degree of similarity between a sample and the base population. The goal of this prioritization process during collection is to get as close as possible to equal response probability for respondents. The R indicator is based on the dispersion of the estimated response probabilities of the sampled households, and it is composed of partial R indicators that measure representativeness variable by variable. These R indicators are tools that we can use to analyze collection by isolating underrepresented population groups. We can increase collection efforts for groups that have been identified beforehand. In the oral presentation, we covered these two points concisely. By contrast, this paper deals exclusively with the first point: sub-allocation. Prioritization is being implemented for the first time at INSEE for the assets survey, and it will be covered in a specific paper by A. Rebecq.

Release date: 2014-10-31

• Technical products: 11-522-X201300014256
Description:

The American Community Survey (ACS) added an Internet data collection mode as part of a sequential mode design in 2013. The ACS currently uses a single web application for all Internet respondents, regardless of whether they respond on a personal computer or on a mobile device. As market penetration of mobile devices increases, however, more survey respondents are using tablets and smartphones to take surveys that are designed for personal computers. Using mobile devices to complete these surveys may be more difficult for respondents and this difficulty may translate to reduced data quality if respondents become frustrated or cannot navigate around usability issues. This study uses several indicators to compare data quality across computers, tablets, and smartphones and also compares the demographic characteristics of respondents that use each type of device.

Release date: 2014-10-31

• Technical products: 11-522-X201300014278
Description:

In January and February 2014, Statistics Canada conducted a test aiming at measuring the effectiveness of different collection strategies using an online self-reporting survey. Sampled units were contacted using mailed introductory letters and asked to complete the online survey without any interviewer contact. The objectives of this test were to measure the take-up rates for completing an online survey, and to profile the respondents/non-respondents. Different samples and letters were tested to determine the relative effectiveness of the different approaches. The results of this project will be used to inform various social surveys that are preparing to include an internet response option in their surveys. The paper will present the general methodology of the test as well as results observed from collection and the analysis of profiles.

Release date: 2014-10-31

• Technical products: 11-522-X200800010958
Description:

Telephone Data Entry (TDE) is a system by which survey respondents can return their data to the Office for National Statistics (ONS) using the keypad on their telephone and currently accounts for approximately 12% of total responses to ONS business surveys. ONS is currently increasing the number of surveys which use TDE as the primary mode of response and this paper gives an overview of the redevelopment project covering; the redevelopment of the paper questionnaire, enhancements made to the TDE system and the results from piloting these changes. Improvements to the quality of the data received and increased response via TDE as a result of these developments suggest that data quality improvements and cost savings are possible as a result of promoting TDE as the primary mode of response to short term surveys.

Release date: 2009-12-03

• Technical products: 11-522-X200800010940
Description:

Data Collection Methodology (DCM) enable the collection of good quality data by providing expert advice and assistance on questionnaire design, methods of evaluation and respondent engagement. DCM assist in the development of client skills, undertake research and lead innovation in data collection methods. This is done in a challenging environment of organisational change and limited resources. This paper will cover 'how DCM do business' with clients and the wider methodological community to achieve our goals.

Release date: 2009-12-03

• Technical products: 11-522-X200800010952
Description:

In a survey where results were estimated by simple averages, we will compare the effect on the results of a follow-up among non-respondents, and weighting based on the last ten percents of the respondents. The data used are collected from a Survey of Living Conditions among Immigrants in Norway that was carried out in 2006.

Release date: 2009-12-03

• Technical products: 11-522-X200800010956
Description:

The use of Computer Audio-Recorded Interviewing (CARI) as a tool to identify interview falsification is quickly growing in survey research (Biemer, 2000, 2003; Thissen, 2007). Similarly, survey researchers are starting to expand the usefulness of CARI by combining recordings with coding to address data quality (Herget, 2001; Hansen, 2005; McGee, 2007). This paper presents results from a study included as part of the establishment-based National Center for Health Statistics' National Home and Hospice Care Survey (NHHCS) which used CARI behavior coding and CARI-specific paradata to: 1) identify and correct problematic interviewer behavior or question issues early in the data collection period before either negatively impact data quality, and; 2) identify ways to diminish measurement error in future implementations of the NHHCS. During the first 9 weeks of the 30-week field period, CARI recorded a subset of questions from the NHHCS application for all interviewers. Recordings were linked with the interview application and output and then coded in one of two modes: Code by Interviewer or Code by Question. The Code by Interviewer method provided visibility into problems specific to an interviewer as well as more generalized problems potentially applicable to all interviewers. The Code by Question method yielded data that spoke to understandability of the questions and other response problems. In this mode, coders coded multiple implementations of the same question across multiple interviewers. Using the Code by Question approach, researchers identified issues with three key survey questions in the first few weeks of data collection and provided guidance to interviewers in how to handle those questions as data collection continued. Results from coding the audio recordings (which were linked with the survey application and output) will inform question wording and interviewer training in the next implementation of the NHHCS, and guide future enhancement of CARI and the coding system.

Release date: 2009-12-03

• Technical products: 11-522-X200800010948
Description:

Past survey instruments, whether in the form of a paper questionnaire or telephone script, were their own documentation. Based on this, the ESRC Question Bank was created, providing free-access internet publication of questionnaires, enabling researchers to re-use questions, saving them trouble, whilst improving the comparability of their data with that collected by others. Today however, as survey technology and computer programs have become more sophisticated, accurate comprehension of the latest questionnaires seems more difficult, particularly when each survey team uses its own conventions to document complex items in technical reports. This paper seeks to illustrate these problems and suggest preliminary standards of presentation to be used until the process can be automated.

Release date: 2009-12-03

• Technical products: 11-522-X200800010984
Description:

The Enterprise Portfolio Manager (EPM) Program at Statistics Canada demonstrated the value of employing a "holistic" approach to managing the relationships we have with our largest and most complex business respondents.

Understanding that different types of respondents should receive different levels of intervention and having learnt the value of employing an "enterprise-centric" approach to managing relationships with important, complex data providers, STC has embraced a response management strategy that divides its business population into four tiers based on size, complexity and importance to survey estimates. Thus segmented, different response management approaches have been developed appropriate to the relative contribution of the segment. This allows STC to target resources to the areas where it stands to achieve the greatest return on investment. Tier I and Tier II have been defined as critical to survey estimates.

Tier I represent the largest, most complex businesses in Canada and is managed through the Enterprise Portfolio Management Program.

Tier II represents businesses that are smaller or less complex than Tier I but still significant in developing accurate measures of the activities of individual industries.

Tier III includes more medium-sized businesses, those that form the bulk of survey samples.

Tier IV represents the smallest businesses which are excluded from collection; for these STC relies entirely on tax information.

The presentation will outline:It works! Results and metrics from the programs that have operationalized the Holistic Response Management strategy.Developing a less subjective, methodological approach to segment the business survey population for HRM. The project team's work to capture the complexity factors intrinsically used by experienced staff to rank respondents. What our so called "problem" respondents have told us about the issues underlying non-response.

Release date: 2009-12-03

• Technical products: 11-522-X200800011008
Description:

In one sense, a questionnaire is never complete. Test results, paradata and research findings constantly provide reasons to update and improve the questionnaire. In addition, establishments change over time and questions need to be updated accordingly. In reality, it doesn't always work like this. At Statistics Sweden there are several examples of questionnaires that were designed at one point in time and rarely improved later on. However, we are currently trying to shift the perspective on questionnaire design from a linear to a cyclic one. We are developing a cyclic model in which the questionnaire can be improved continuously in multiple rounds. In this presentation, we will discuss this model and how we work with it.

Release date: 2009-12-03

Date modified: