Statistics by subject – Statistical techniques

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Filter results by

Help for filters and search
Currently selected filters that can be removed

Keyword(s)

Content

1 facets displayed. 0 facets selected.

Other available resources to support your research.

Help for sorting results
Browse our central repository of key standard concepts, definitions, data sources and methods.
Loading
Loading in progress, please wait...
All (115)

All (115) (25 of 115 results)

  • Journals and periodicals: 11-633-X
    Description:

    Papers in this series provide background discussions of the methods used to develop data for economic, health, and social analytical studies at Statistics Canada. They are intended to provide readers with information on the statistical methods, standards and definitions used to develop databases for research purposes. All papers in this series have undergone peer and institutional review to ensure that they conform to Statistics Canada's mandate and adhere to generally accepted standards of good professional practice.

    Release date: 2018-01-11

  • Articles and reports: 11-626-X2017077
    Description:

    On April 13, 2017, the Government of Canada tabled legislation to legalize the recreational use of cannabis by adults. This will directly impact Canada’s statistical system. The focus of this Economic Insights article is to provide experimental estimates for the volume of cannabis consumption, based on existing information on the prevalence of cannabis use. The article presents experimental estimates of the number of tonnes of cannabis consumed by age group for the period from 1960 to 2015. The experimental estimates rely on survey data from multiple sources, statistical techniques to link the sources over time, and assumptions about consumption behaviour. They are subject to revision as improved or additional data sources become available.

    Release date: 2017-12-18

  • The Daily
    Description: Release published in The Daily – Statistics Canada’s official release bulletin
    Release date: 2017-12-18

  • Technical products: 84-538-X
    Description:

    This document presents the methodology underlying the production of the life tables for Canada, provinces and territories, from reference period 1980/1982 and onward.

    Release date: 2017-11-16

  • Articles and reports: 11-633-X2017009
    Description:

    This document describes the procedures for using linked administrative data sources to estimate paid parental leave rates in Canada and the issues surrounding this use.

    Release date: 2017-08-29

  • Journals and periodicals: 12-605-X
    Description:

    The Record Linkage Project Process Model (RLPPM) was developed by Statistics Canada to identify the processes and activities involved in record linkage. The RLPPM applies to linkage projects conducted at the individual and enterprise level using diverse data sources to create new data sources to meet analytical and operational needs.

    Release date: 2017-06-05

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 11-633-X2017005
    Description:

    Hospitalization rates are among commonly reported statistics related to health-care service use. The variety of methods for calculating confidence intervals for these and other health-related rates suggests a need to classify, compare and evaluate these methods. Zeno is a tool developed to calculate confidence intervals of rates based on several formulas available in the literature. This report describes the contents of the main sheet of the Zeno Tool and indicates which formulas are appropriate, based on users’ assumptions and scope of analysis.

    Release date: 2017-01-19

  • Articles and reports: 12-001-X201600214684
    Description:

    This paper introduces an incomplete adaptive cluster sampling design that is easy to implement, controls the sample size well, and does not need to follow the neighbourhood. In this design, an initial sample is first selected, using one of the conventional designs. If a cell satisfies a prespecified condition, a specified radius around the cell is sampled completely. The population mean is estimated using the \pi-estimator. If all the inclusion probabilities are known, then an unbiased \pi estimator is available; if, depending on the situation, the inclusion probabilities are not known for some of the final sample units, then they are estimated. To estimate the inclusion probabilities, a biased estimator is constructed. However, the simulations show that if the sample size is large enough, the error of the inclusion probabilities is negligible, and the relative \pi-estimator is almost unbiased. This design rivals adaptive cluster sampling because it controls the final sample size and is easy to manage. It rivals adaptive two-stage sequential sampling because it considers the cluster form of the population and reduces the cost of moving across the area. Using real data on a bird population and simulations, the paper compares the design with adaptive two-stage sequential sampling. The simulations show that the design has significant efficiency in comparison with its rival.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214664
    Description:

    This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214663
    Description:

    We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.

    Release date: 2016-12-20

  • Articles and reports: 11-633-X2016003
    Description:

    Large national mortality cohorts are used to estimate mortality rates for different socioeconomic and population groups, and to conduct research on environmental health. In 2008, Statistics Canada created a cohort linking the 1991 Census to mortality. The present study describes a linkage of the 2001 Census long-form questionnaire respondents aged 19 years and older to the T1 Personal Master File and the Amalgamated Mortality Database. The linkage tracks all deaths over a 10.6-year period (until the end of 2011, to date).

    Release date: 2016-10-26

  • The Daily
    Description: Release published in The Daily – Statistics Canada’s official release bulletin
    Release date: 2016-10-26

  • Articles and reports: 11-633-X2016002
    Description:

    Immigrants comprise an ever-increasing percentage of the Canadian population—at more than 20%, which is the highest percentage among the G8 countries (Statistics Canada 2013a). This figure is expected to rise to 25% to 28% by 2031, when at least one in four people living in Canada will be foreign-born (Statistics Canada 2010).

    This report summarizes the linkage of the Immigrant Landing File (ILF) for all provinces and territories, excluding Quebec, to hospital data from the Discharge Abstract Database (DAD), a national database containing information about hospital inpatient and day-surgery events. A deterministic exact-matching approach was used to link data from the 1980-to-2006 ILF and from the DAD (2006/2007, 2007/2008 and 2008/2009) with the 2006 Census, which served as a “bridge” file. This was a secondary linkage in that it used linkage keys created in two previous projects (primary linkages) that separately linked the ILF and the DAD to the 2006 Census. The ILF–DAD linked data were validated by means of a representative sample of 2006 Census records containing immigrant information previously linked to the DAD.

    Release date: 2016-08-17

  • Articles and reports: 12-001-X201600114539
    Description:

    Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.

    Release date: 2016-06-22

  • Articles and reports: 12-001-X201600114540
    Description:

    In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

    Release date: 2016-06-22

  • Articles and reports: 82-003-X201600414489
    Description:

    Using accelerometry data for children and youth aged 3 to 17 from the Canadian Health Measures Survey, the probability of adherence to physical activity guidelines is estimated using a conditional probability, given the number of active and inactive days distributed as a Betabinomial.

    Release date: 2016-04-20

  • Technical products: 11-522-X201700014729
    Description:

    The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014728
    Description:

    Record linkage joins together two or more sources. The product of record linkage is a file with one record per individual containing all the information about the individual from the multiple files. The problem is difficult when a unique identification key is not available, there are errors in some variables, some data are missing, and files are large. Probabilistic record linkage computes a probability that records from on different files pertain to a single individual. Some true links are given low probabilities of matching, whereas some non links are given high probabilities. Errors in linkage designations can cause bias in analyses based on the composite data base. The SEER cancer registries contain information on breast cancer cases in their registry areas. A diagnostic test based on the Oncotype DX assay, performed by Genomic Health, Inc. (GHI), is often performed for certain types of breast cancers. Record linkage using personal identifiable information was conducted to associate Oncotype DC assay results with SEER cancer registry information. The software Link Plus was used to generate a score describing the similarity of records and to identify the apparent best match of SEER cancer registry individuals to the GHI database. Clerical review was used to check samples of likely matches, possible matches, and unlikely matches. Models are proposed for jointly modeling the record linkage process and subsequent statistical analysis in this and other applications.

    Release date: 2016-03-24

  • Articles and reports: 13-604-M2015077
    Description:

    This new dataset increases the information available for comparing the performance of provinces and territories across a range of measures. It combines often fragmented provincial time series data that, as such, are of limited utility for examining the evolution of provincial economies over extended periods. More advanced statistical methods, and models with greater breadth and depth, are difficult to apply to existing fragmented Canadian data. The longitudinal nature of the new provincial dataset remedies this shortcoming. This report explains the construction of the latest vintage of the dataset. The dataset contains the most up-to-date information available.

    Release date: 2015-02-12

  • Articles and reports: 12-001-X201400214110
    Description:

    In developing the sample design for a survey we attempt to produce a good design for the funds available. Information on costs can be used to develop sample designs that minimise the sampling variance of an estimator of total for fixed cost. Improvements in survey management systems mean that it is now sometimes possible to estimate the cost of including each unit in the sample. This paper develops relatively simple approaches to determine whether the potential gains arising from using this unit level cost information are likely to be of practical use. It is shown that the key factor is the coefficient of variation of the costs relative to the coefficient of variation of the relative error on the estimated cost coefficients.

    Release date: 2014-12-19

  • Technical products: 11-522-X201300014268
    Description:

    Information collection is critical for chronic-disease surveillance to measure the scope of diseases, assess the use of services, identify at-risk groups and track the course of diseases and risk factors over time with the goal of planning and implementing public-health programs for disease prevention. It is in this context that the Quebec Integrated Chronic Disease Surveillance System (QICDSS) was established. The QICDSS is a database created by linking administrative files covering the period from 1996 to 2013. It is an attractive alternative to survey data, since it covers the entire population, is not affected by recall bias and can track the population over time and space. In this presentation, we describe the relevance of using administrative data as an alternative to survey data, the methods selected to build the population cohort by linking various sources of raw data, and the processing applied to minimize bias. We will also discuss the advantages and limitations associated with the analysis of administrative files.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014270
    Description:

    There is a wide range of character-string comparators in the record-linkage field. Comparison problems arise when factors affect the composition of the strings (for example, the use of a nickname instead of a given name, and typographical errors). In these cases, more sophisticated comparators must be used. Such tools help to reduce the number of potentially missed links. Unfortunately, some of the gains may be false links. In order to improve the matches, three sophisticated string comparators were developed; they are described in this paper. They are the Lachance comparator and its derivatives, the multi-word comparator and the multi-type comparator. This set of tools is currently available in a deterministic record-linkage prototype known as MixMatch. This application can use prior knowledge to reduce the volume of false links generated during matching. This paper also proposes a link-strength indicator.

    Release date: 2014-10-31

  • Articles and reports: 12-001-X201400114004
    Description:

    In 2009, two major surveys in the Governments Division of the U.S. Census Bureau were redesigned to reduce sample size, save resources, and improve the precision of the estimates (Cheng, Corcoran, Barth and Hogue 2009). The new design divides each of the traditional state by government-type strata with sufficiently many units into two sub-strata according to each governmental unit’s total payroll, in order to sample less from the sub-stratum with small size units. The model-assisted approach is adopted in estimating population totals. Regression estimators using auxiliary variables are obtained either within each created sub-stratum or within the original stratum by collapsing two sub-strata. A decision-based method was proposed in Cheng, Slud and Hogue (2010), applying a hypothesis test to decide which regression estimator is used within each original stratum. Consistency and asymptotic normality of these model-assisted estimators are established here, under a design-based or model-assisted asymptotic framework. Our asymptotic results also suggest two types of consistent variance estimators, one obtained by substituting unknown quantities in the asymptotic variances and the other by applying the bootstrap. The performance of all the estimators of totals and of their variance estimators are examined in some empirical studies. The U.S. Annual Survey of Public Employment and Payroll (ASPEP) is used to motivate and illustrate our study.

    Release date: 2014-06-27

Data (0)

Data (0) (0 results)

Your search for "" found no results in this section of the site.

You may try:

Analysis (76)

Analysis (76) (25 of 76 results)

  • Journals and periodicals: 11-633-X
    Description:

    Papers in this series provide background discussions of the methods used to develop data for economic, health, and social analytical studies at Statistics Canada. They are intended to provide readers with information on the statistical methods, standards and definitions used to develop databases for research purposes. All papers in this series have undergone peer and institutional review to ensure that they conform to Statistics Canada's mandate and adhere to generally accepted standards of good professional practice.

    Release date: 2018-01-11

  • Articles and reports: 11-626-X2017077
    Description:

    On April 13, 2017, the Government of Canada tabled legislation to legalize the recreational use of cannabis by adults. This will directly impact Canada’s statistical system. The focus of this Economic Insights article is to provide experimental estimates for the volume of cannabis consumption, based on existing information on the prevalence of cannabis use. The article presents experimental estimates of the number of tonnes of cannabis consumed by age group for the period from 1960 to 2015. The experimental estimates rely on survey data from multiple sources, statistical techniques to link the sources over time, and assumptions about consumption behaviour. They are subject to revision as improved or additional data sources become available.

    Release date: 2017-12-18

  • The Daily
    Description: Release published in The Daily – Statistics Canada’s official release bulletin
    Release date: 2017-12-18

  • Articles and reports: 11-633-X2017009
    Description:

    This document describes the procedures for using linked administrative data sources to estimate paid parental leave rates in Canada and the issues surrounding this use.

    Release date: 2017-08-29

  • Journals and periodicals: 12-605-X
    Description:

    The Record Linkage Project Process Model (RLPPM) was developed by Statistics Canada to identify the processes and activities involved in record linkage. The RLPPM applies to linkage projects conducted at the individual and enterprise level using diverse data sources to create new data sources to meet analytical and operational needs.

    Release date: 2017-06-05

  • Articles and reports: 11-633-X2017006
    Description:

    This paper describes a method of imputing missing postal codes in a longitudinal database. The 1991 Canadian Census Health and Environment Cohort (CanCHEC), which contains information on individuals from the 1991 Census long-form questionnaire linked with T1 tax return files for the 1984-to-2011 period, is used to illustrate and validate the method. The cohort contains up to 28 consecutive fields for postal code of residence, but because of frequent gaps in postal code history, missing postal codes must be imputed. To validate the imputation method, two experiments were devised where 5% and 10% of all postal codes from a subset with full history were randomly removed and imputed.

    Release date: 2017-03-13

  • Articles and reports: 11-633-X2017005
    Description:

    Hospitalization rates are among commonly reported statistics related to health-care service use. The variety of methods for calculating confidence intervals for these and other health-related rates suggests a need to classify, compare and evaluate these methods. Zeno is a tool developed to calculate confidence intervals of rates based on several formulas available in the literature. This report describes the contents of the main sheet of the Zeno Tool and indicates which formulas are appropriate, based on users’ assumptions and scope of analysis.

    Release date: 2017-01-19

  • Articles and reports: 12-001-X201600214684
    Description:

    This paper introduces an incomplete adaptive cluster sampling design that is easy to implement, controls the sample size well, and does not need to follow the neighbourhood. In this design, an initial sample is first selected, using one of the conventional designs. If a cell satisfies a prespecified condition, a specified radius around the cell is sampled completely. The population mean is estimated using the \pi-estimator. If all the inclusion probabilities are known, then an unbiased \pi estimator is available; if, depending on the situation, the inclusion probabilities are not known for some of the final sample units, then they are estimated. To estimate the inclusion probabilities, a biased estimator is constructed. However, the simulations show that if the sample size is large enough, the error of the inclusion probabilities is negligible, and the relative \pi-estimator is almost unbiased. This design rivals adaptive cluster sampling because it controls the final sample size and is easy to manage. It rivals adaptive two-stage sequential sampling because it considers the cluster form of the population and reduces the cost of moving across the area. Using real data on a bird population and simulations, the paper compares the design with adaptive two-stage sequential sampling. The simulations show that the design has significant efficiency in comparison with its rival.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214664
    Description:

    This paper draws statistical inference for finite population mean based on judgment post stratified (JPS) samples. The JPS sample first selects a simple random sample and then stratifies the selected units into H judgment classes based on their relative positions (ranks) in a small set of size H. This leads to a sample with random sample sizes in judgment classes. Ranking process can be performed either using auxiliary variables or visual inspection to identify the ranks of the measured observations. The paper develops unbiased estimator and constructs confidence interval for population mean. Since judgment ranks are random variables, by conditioning on the measured observations we construct Rao-Blackwellized estimators for the population mean. The paper shows that Rao-Blackwellized estimators perform better than usual JPS estimators. The proposed estimators are applied to 2012 United States Department of Agriculture Census Data.

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214676
    Description:

    Winsorization procedures replace extreme values with less extreme values, effectively moving the original extreme values toward the center of the distribution. Winsorization therefore both detects and treats influential values. Mulry, Oliver and Kaputa (2014) compare the performance of the one-sided Winsorization method developed by Clark (1995) and described by Chambers, Kokic, Smith and Cruddas (2000) to the performance of M-estimation (Beaumont and Alavi 2004) in highly skewed business population data. One aspect of particular interest for methods that detect and treat influential values is the range of values designated as influential, called the detection region. The Clark Winsorization algorithm is easy to implement and can be extremely effective. However, the resultant detection region is highly dependent on the number of influential values in the sample, especially when the survey totals are expected to vary greatly by collection period. In this note, we examine the effect of the number and magnitude of influential values on the detection regions from Clark Winsorization using data simulated to realistically reflect the properties of the population for the Monthly Retail Trade Survey (MRTS) conducted by the U.S. Census Bureau. Estimates from the MRTS and other economic surveys are used in economic indicators, such as the Gross Domestic Product (GDP).

    Release date: 2016-12-20

  • Articles and reports: 12-001-X201600214663
    Description:

    We present theoretical evidence that efforts during data collection to balance the survey response with respect to selected auxiliary variables will improve the chances for low nonresponse bias in the estimates that are ultimately produced by calibrated weighting. One of our results shows that the variance of the bias – measured here as the deviation of the calibration estimator from the (unrealized) full-sample unbiased estimator – decreases linearly as a function of the response imbalance that we assume measured and controlled continuously over the data collection period. An attractive prospect is thus a lower risk of bias if one can manage the data collection to get low imbalance. The theoretical results are validated in a simulation study with real data from an Estonian household survey.

    Release date: 2016-12-20

  • Articles and reports: 11-633-X2016003
    Description:

    Large national mortality cohorts are used to estimate mortality rates for different socioeconomic and population groups, and to conduct research on environmental health. In 2008, Statistics Canada created a cohort linking the 1991 Census to mortality. The present study describes a linkage of the 2001 Census long-form questionnaire respondents aged 19 years and older to the T1 Personal Master File and the Amalgamated Mortality Database. The linkage tracks all deaths over a 10.6-year period (until the end of 2011, to date).

    Release date: 2016-10-26

  • The Daily
    Description: Release published in The Daily – Statistics Canada’s official release bulletin
    Release date: 2016-10-26

  • Articles and reports: 11-633-X2016002
    Description:

    Immigrants comprise an ever-increasing percentage of the Canadian population—at more than 20%, which is the highest percentage among the G8 countries (Statistics Canada 2013a). This figure is expected to rise to 25% to 28% by 2031, when at least one in four people living in Canada will be foreign-born (Statistics Canada 2010).

    This report summarizes the linkage of the Immigrant Landing File (ILF) for all provinces and territories, excluding Quebec, to hospital data from the Discharge Abstract Database (DAD), a national database containing information about hospital inpatient and day-surgery events. A deterministic exact-matching approach was used to link data from the 1980-to-2006 ILF and from the DAD (2006/2007, 2007/2008 and 2008/2009) with the 2006 Census, which served as a “bridge” file. This was a secondary linkage in that it used linkage keys created in two previous projects (primary linkages) that separately linked the ILF and the DAD to the 2006 Census. The ILF–DAD linked data were validated by means of a representative sample of 2006 Census records containing immigrant information previously linked to the DAD.

    Release date: 2016-08-17

  • Articles and reports: 12-001-X201600114539
    Description:

    Statistical matching is a technique for integrating two or more data sets when information available for matching records for individual participants across data sets is incomplete. Statistical matching can be viewed as a missing data problem where a researcher wants to perform a joint analysis of variables that are never jointly observed. A conditional independence assumption is often used to create imputed data for statistical matching. We consider a general approach to statistical matching using parametric fractional imputation of Kim (2011) to create imputed data under the assumption that the specified model is fully identified. The proposed method does not have a convergent EM sequence if the model is not identified. We also present variance estimators appropriate for the imputation procedure. We explain how the method applies directly to the analysis of data from split questionnaire designs and measurement error models.

    Release date: 2016-06-22

  • Articles and reports: 12-001-X201600114540
    Description:

    In this paper, we compare the EBLUP and pseudo-EBLUP estimators for small area estimation under the nested error regression model and three area level model-based estimators using the Fay-Herriot model. We conduct a design-based simulation study to compare the model-based estimators for unit level and area level models under informative and non-informative sampling. In particular, we are interested in the confidence interval coverage rate of the unit level and area level estimators. We also compare the estimators if the model has been misspecified. Our simulation results show that estimators based on the unit level model perform better than those based on the area level. The pseudo-EBLUP estimator is the best among unit level and area level estimators.

    Release date: 2016-06-22

  • Articles and reports: 82-003-X201600414489
    Description:

    Using accelerometry data for children and youth aged 3 to 17 from the Canadian Health Measures Survey, the probability of adherence to physical activity guidelines is estimated using a conditional probability, given the number of active and inactive days distributed as a Betabinomial.

    Release date: 2016-04-20

  • Articles and reports: 13-604-M2015077
    Description:

    This new dataset increases the information available for comparing the performance of provinces and territories across a range of measures. It combines often fragmented provincial time series data that, as such, are of limited utility for examining the evolution of provincial economies over extended periods. More advanced statistical methods, and models with greater breadth and depth, are difficult to apply to existing fragmented Canadian data. The longitudinal nature of the new provincial dataset remedies this shortcoming. This report explains the construction of the latest vintage of the dataset. The dataset contains the most up-to-date information available.

    Release date: 2015-02-12

  • Articles and reports: 12-001-X201400214110
    Description:

    In developing the sample design for a survey we attempt to produce a good design for the funds available. Information on costs can be used to develop sample designs that minimise the sampling variance of an estimator of total for fixed cost. Improvements in survey management systems mean that it is now sometimes possible to estimate the cost of including each unit in the sample. This paper develops relatively simple approaches to determine whether the potential gains arising from using this unit level cost information are likely to be of practical use. It is shown that the key factor is the coefficient of variation of the costs relative to the coefficient of variation of the relative error on the estimated cost coefficients.

    Release date: 2014-12-19

  • Articles and reports: 12-001-X201400114004
    Description:

    In 2009, two major surveys in the Governments Division of the U.S. Census Bureau were redesigned to reduce sample size, save resources, and improve the precision of the estimates (Cheng, Corcoran, Barth and Hogue 2009). The new design divides each of the traditional state by government-type strata with sufficiently many units into two sub-strata according to each governmental unit’s total payroll, in order to sample less from the sub-stratum with small size units. The model-assisted approach is adopted in estimating population totals. Regression estimators using auxiliary variables are obtained either within each created sub-stratum or within the original stratum by collapsing two sub-strata. A decision-based method was proposed in Cheng, Slud and Hogue (2010), applying a hypothesis test to decide which regression estimator is used within each original stratum. Consistency and asymptotic normality of these model-assisted estimators are established here, under a design-based or model-assisted asymptotic framework. Our asymptotic results also suggest two types of consistent variance estimators, one obtained by substituting unknown quantities in the asymptotic variances and the other by applying the bootstrap. The performance of all the estimators of totals and of their variance estimators are examined in some empirical studies. The U.S. Annual Survey of Public Employment and Payroll (ASPEP) is used to motivate and illustrate our study.

    Release date: 2014-06-27

  • Articles and reports: 11F0027M2014092
    Description:

    Using data from the Provincial KLEMS database, this paper asks whether provincial economies have undergone structural change in their business sectors since 2000. It does so by applying a measure of industrial change (the dissimilarity index) using measures of output (real GDP) and hours worked. The paper also develops a statistical methodology to test whether the shifts in the industrial composition of output and hours worked over the period are due to random year-over-year changes in industrial structure or long-term systematic change in the structure of provincial economies. The paper is designed to inform discussion and analysis of recent changes in industrial composition at the national level, notably, the decline in manufacturing output and the concomitant rise of resource industries, and the implications of this change for provincial economies.

    Release date: 2014-05-07

  • Articles and reports: 12-001-X201300211870
    Description:

    At national statistical institutes experiments embedded in ongoing sample surveys are frequently conducted, for example to test the effect of modifications in the survey process on the main parameter estimates of the survey, to quantify the effect of alternative survey implementations on these estimates, or to obtain insight into the various sources of non-sampling errors. A design-based analysis procedure for factorial completely randomized designs and factorial randomized block designs embedded in probability samples is proposed in this paper. Design-based Wald statistics are developed to test whether estimated population parameters, like means, totals and ratios of two population totals, that are observed under the different treatment combinations of the experiment are significantly different. The methods are illustrated with a real life application of an experiment embedded in the Dutch Labor Force Survey.

    Release date: 2014-01-15

  • Articles and reports: 12-001-X201300111823
    Description:

    Although weights are widely used in survey sampling their ultimate justification from the design perspective is often problematical. Here we will argue for a stepwise Bayes justification for weights that does not depend explicitly on the sampling design. This approach will make use of the standard kind of information present in auxiliary variables however it will not assume a model relating the auxiliary variables to the characteristic of interest. The resulting weight for a unit in the sample can be given the usual interpretation as the number of units in the population which it represents.

    Release date: 2013-06-28

  • Articles and reports: 82-003-X201300611796
    Description:

    The study assesses the feasibility of using statistical modelling techniques to fill information gaps related to risk factors, specifically, smoking status, in linked long-form census data.

    Release date: 2013-06-19

  • Articles and reports: 12-001-X201200111685
    Description:

    Survey data are often used to fit linear regression models. The values of covariates used in modeling are not controlled as they might be in an experiment. Thus, collinearity among the covariates is an inevitable problem in the analysis of survey data. Although many books and articles have described the collinearity problem and proposed strategies to understand, assess and handle its presence, the survey literature has not provided appropriate diagnostic tools to evaluate its impact on regression estimation when the survey complexities are considered. We have developed variance inflation factors (VIFs) that measure the amount that variances of parameter estimators are increased due to having non-orthogonal predictors. The VIFs are appropriate for survey-weighted regression estimators and account for complex design features, e.g., weights, clusters, and strata. Illustrations of these methods are given using a probability sample from a household survey of health and nutrition.

    Release date: 2012-06-27

Reference (39)

Reference (39) (25 of 39 results)

  • Technical products: 84-538-X
    Description:

    This document presents the methodology underlying the production of the life tables for Canada, provinces and territories, from reference period 1980/1982 and onward.

    Release date: 2017-11-16

  • Technical products: 11-522-X201700014729
    Description:

    The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census.

    Release date: 2016-03-24

  • Technical products: 11-522-X201700014728
    Description:

    Record linkage joins together two or more sources. The product of record linkage is a file with one record per individual containing all the information about the individual from the multiple files. The problem is difficult when a unique identification key is not available, there are errors in some variables, some data are missing, and files are large. Probabilistic record linkage computes a probability that records from on different files pertain to a single individual. Some true links are given low probabilities of matching, whereas some non links are given high probabilities. Errors in linkage designations can cause bias in analyses based on the composite data base. The SEER cancer registries contain information on breast cancer cases in their registry areas. A diagnostic test based on the Oncotype DX assay, performed by Genomic Health, Inc. (GHI), is often performed for certain types of breast cancers. Record linkage using personal identifiable information was conducted to associate Oncotype DC assay results with SEER cancer registry information. The software Link Plus was used to generate a score describing the similarity of records and to identify the apparent best match of SEER cancer registry individuals to the GHI database. Clerical review was used to check samples of likely matches, possible matches, and unlikely matches. Models are proposed for jointly modeling the record linkage process and subsequent statistical analysis in this and other applications.

    Release date: 2016-03-24

  • Technical products: 11-522-X201300014268
    Description:

    Information collection is critical for chronic-disease surveillance to measure the scope of diseases, assess the use of services, identify at-risk groups and track the course of diseases and risk factors over time with the goal of planning and implementing public-health programs for disease prevention. It is in this context that the Quebec Integrated Chronic Disease Surveillance System (QICDSS) was established. The QICDSS is a database created by linking administrative files covering the period from 1996 to 2013. It is an attractive alternative to survey data, since it covers the entire population, is not affected by recall bias and can track the population over time and space. In this presentation, we describe the relevance of using administrative data as an alternative to survey data, the methods selected to build the population cohort by linking various sources of raw data, and the processing applied to minimize bias. We will also discuss the advantages and limitations associated with the analysis of administrative files.

    Release date: 2014-10-31

  • Technical products: 11-522-X201300014270
    Description:

    There is a wide range of character-string comparators in the record-linkage field. Comparison problems arise when factors affect the composition of the strings (for example, the use of a nickname instead of a given name, and typographical errors). In these cases, more sophisticated comparators must be used. Such tools help to reduce the number of potentially missed links. Unfortunately, some of the gains may be false links. In order to improve the matches, three sophisticated string comparators were developed; they are described in this paper. They are the Lachance comparator and its derivatives, the multi-word comparator and the multi-type comparator. This set of tools is currently available in a deterministic record-linkage prototype known as MixMatch. This application can use prior knowledge to reduce the volume of false links generated during matching. This paper also proposes a link-strength indicator.

    Release date: 2014-10-31

  • Technical products: 11-522-X2009000
    Description:

    Symposium 2009 was the twenty-fifth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular theme. In 2009, the theme was: "Longitudinal Surveys: From Design to Analysis".

    Release date: 2012-10-03

  • Technical products: 11-522-X2010000
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987. The Symposium 2010 is entitled "Social Statistics: The Interplay among Censuses, Surveys and Administrative Data".

    Release date: 2011-09-15

  • Technical products: 11-522-X2008000
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987. Symposium 2008 was the twenty fourth in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2008 the theme was: "Data Collection: Challenges, Achievements and New Directions".

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011003
    Description:

    This study examined the feasibility of developing correction factors to adjust self-reported measures of Body Mass Index to more closely approximate measured values. Data are from the 2005 Canadian Community Health Survey where respondents were asked to report their height and weight and were subsequently measured. Regression analyses were used to determine which socio-demographic and health characteristics were associated with the discrepancies between reported and measured values. The sample was then split into two groups. In the first, the self-reported BMI and the predictors of the discrepancies were regressed on the measured BMI. Correction equations were generated using all predictor variables that were significant at the p<0.05 level. These correction equations were then tested in the second group to derive estimates of sensitivity, specificity and of obesity prevalence. Logistic regression was used to examine the relationship between measured, reported and corrected BMI and obesity-related health conditions. Corrected estimates provided more accurate measures of obesity prevalence, mean BMI and sensitivity levels. Self-reported data exaggerated the relationship between BMI and health conditions, while in most cases the corrected estimates provided odds ratios that were more similar to those generated with the measured BMI.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010971
    Description:

    Keynote address

    Release date: 2009-12-03

  • Technical products: 11-522-X200800011002
    Description:

    Based on a representative sample of the Canadian population, this article quantifies the bias resulting from the use of self-reported rather than directly measured height, weight and body mass index (BMI). Associations between BMI categories and selected health conditions are compared to see if the misclassification resulting from the use of self-reported data alters associations between obesity and obesity-related health conditions. The analysis is based on 4,567 respondents to the 2005 Canadian Community Health Survey (CCHS) who, during a face-to-face interview, provided self-reported values for height and weight and were then measured by trained interviewers. Based on self-reported data, a substantial proportion of individuals with excess body weight were erroneously placed in lower BMI categories. This misclassification resulted in elevated associations between overweight/obesity and morbidity.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010959
    Description:

    The Unified Enterprise Survey (UES) at Statistics Canada is an annual business survey that unifies more than 60 surveys from different industries. Two types of collection follow-up score functions are currently used in the UES data collection. The objective of using a score function is to maximize the economically weighted response rates of the survey in terms of the primary variables of interest, under the constraint of a limited follow-up budget. Since the two types of score functions are based on different methodologies, they could have different impacts on the final estimates.

    This study generally compares the two types of score functions based on the collection data obtained from the two recent years. For comparison purposes, this study applies each score function method to the same data respectively and computes various estimates of the published financial and commodity variables, their deviation from the true pseudo value and their mean square deviation, based on each method. These estimates of deviation and mean square deviation based on each method are then used to measure the impact of each score function on the final estimates of the financial and commodity variables.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010953
    Description:

    As survey researchers attempt to maintain traditionally high response rates, reluctant respondents have resulted in increasing data collection costs. This respondent reluctance may be related to the amount of time it takes to complete an interview in large-scale, multi-purpose surveys, such as the National Survey of Recent College Graduates (NSRCG). Recognizing that respondent burden or questionnaire length may contribute to lower response rates, in 2003, following several months of data collection under the standard data collection protocol, the NSRCG offered its nonrespondents monetary incentives about two months before the end of the data collection,. In conjunction with the incentive offer, the NSRCG also offered persistent nonrespondents an opportunity to complete a much-abbreviated interview consisting of a few critical items. The late respondents who completed the interviews as the result of the incentive and critical items-only questionnaire offers may provide some insight into the issue of nonresponse bias and the likelihood that such interviewees would have remained survey nonrespondents if these refusal conversion efforts had not been made.

    In this paper, we define "reluctant respondents" as those who responded to the survey only after extra efforts were made beyond the ones initially planned in the standard data collection protocol. Specifically, reluctant respondents in the 2003 NSRCG are those who responded to the regular or shortened questionnaire following the incentive offer. Our conjecture was that the behavior of the reluctant respondents would be more like that of nonrespondents than of respondents to the surveys. This paper describes an investigation of reluctant respondents and the extent to which they are different from regular respondents. We compare different response groups on several key survey estimates. This comparison will expand our understanding of nonresponse bias in the NSRCG, and of the characteristics of nonrespondents themselves, thus providing a basis for changes in the NSRCG weighting system or estimation procedures in the future.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010960
    Description:

    Non-response is inevitable in any survey, despite all the effort put into reducing it at the various stages of the survey. In particular, non-response can cause bias in the estimates. In addition, non-response is an especially serious problem in longitudinal studies because the sample shrinks over time. France's ELFE (Étude Longitudinale Française depuis l'Enfance) is a project that aims to track 20,000 children from birth to adulthood using a multidisciplinary approach. This paper is based on the results of the initial pilot studies conducted in 2007 to test the survey's feasibility and acceptance. The participation rates are presented (response rate, non-response factors) along with a preliminary description of the non-response treatment methods being considered.

    Release date: 2009-12-03

  • Technical products: 11-522-X200800010951
    Description:

    Missing values caused by item nonresponse represent one type of non-sampling error that occurs in surveys. When cases with missing values are discarded in statistical analyses estimates may be biased because of differences between responders with missing values and responders that do not have missing values. Also, when variables in the data have different patterns of missingness among sampled cases, and cases with missing values are discarded in statistical analyses, those analyses may yield inconsistent results because they are based on different subsets of sampled cases that may not be comparable. However, analyses that discard cases with missing values may be valid provided those values are missing completely at random (MCAR). Are those missing values MCAR?

    To compensate, missing values are often imputed or survey weights are adjusted using weighting class methods. Subsequent analyses based on those compensations may be valid provided that missing values are missing at random (MAR) within each of the categorizations of the data implied by the independent variables of the models that underlie those adjustment approaches. Are those missing values MAR?

    Because missing values are not observed, MCAR and MAR assumptions made by statistical analyses are infrequently examined. This paper describes a selection model from which statistical significance tests for the MCAR and MAR assumptions can be examined although the missing values are not observed. Data from the National Immunization Survey conducted by the U.S. Department of Health and Human Services are used to illustrate the methods.

    Release date: 2009-12-03

  • Technical products: 12-002-X200900110692
    Description:

    Researchers are able to examine changes in trends over time, through the examination of responses to repeatedly-asked questions, among the same respondents, over several cycles of longitudinal data. Working with these repeatedly-measured responses can often be challenging. This article examines trends in youth's volunteering activities, using data from the National Longitudinal Survey of Children and Youth, to highlight several issues that researchers should consider when working with repeated measures.

    Release date: 2009-04-22

  • Technical products: 12-002-X200900110693
    Description:

    Developed initially for the author's research on Unemployment Insurance (UI), this article summarizes a set of procedures for constructing customized duration data, using SPSS software and the Survey of Labour and Income Dynamics (SLID). These procedures could be used to merge, deduce, or match multiple duration datasets.

    Release date: 2009-04-22

  • Technical products: 11-522-X2006001
    Description:

    Since 1984, an annual international symposium on methodological issues has been sponsored by Statistics Canada. Proceedings have been available since 1987. Symposium 2006 was the twenty third in Statistics Canada's series of international symposia on methodological issues. Each year the symposium focuses on a particular them. In 2006 the theme was: "Methodological Issues In Measuring Population Health".

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110404
    Description:

    Pursuing reduction in cost and response burden in survey programs has led to increased use of information available in administrative databases. Linkages between these two data sources is a way to exploit their complementary nature and maximize their respective usefulness. This paper discusses the various ways we have performed record linkage between the Canadian Community Health Survey (CCHS) and the Health Person-Oriented Information (HPOI) databases. The files resulting from selected linkage methods are used in an analysis of risk factors for having been hospitalized for heart disease. The sensitivity of the analysis with respect to the various linkage approaches is investigated.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110391
    Description:

    Small area estimation using linear area level models typically assumes normality of the area level random effects (model errors) and of the survey errors of the direct survey estimates. Outlying observations can be a concern, and can arise from outliers in either the model errors or the survey errors, two possibilities with very different implications. We consider both possibilities here and investigate empirically how use of a Bayesian approach with a t-distribution assumed for one of the error components can address potential outliers. The empirical examples use models for U.S. state poverty ratios from the U.S. Census Bureau's Small Area Income and Poverty Estimates program, extending the usual Gaussian models to assume a t-distribution for the model error or survey error. Results are examined to see how they are affected by varying the number of degrees of freedom (assumed known) of the t-distribution. We find that using a t-distribution with low degrees of freedom can diminish the effects of outliers, but in the examples discussed the results do not go as far as approaching outright rejection of observations.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110410
    Description:

    The U.S. Survey of Occupational Illnesses and Injuries (SOII) is a large-scale establishment survey conducted by the Bureau of Labor Statistics to measure incidence rates and impact of occupational illnesses and injuries within specified industries at the national and state levels. This survey currently uses relatively simple procedures for detection and treatment of outliers. The outlier-detection methods center on comparison of reported establishment-level incidence rates to the corresponding distribution of reports within specified cells defined by the intersection of state and industry classifications. The treatment methods involve replacement of standard probability weights with a weight set equal to one, followed by a benchmark adjustment.

    One could use more complex methods for detection and treatment of outliers for the SOII, e.g., detection methods that use influence functions, probability weights and multivariate observations; or treatment methods based on Winsorization or M-estimation. Evaluation of the practical benefits of these more complex methods requires one to consider three important factors. First, severe outliers are relatively rare, but when they occur, they may have a severe impact on SOII estimators in cells defined by the intersection of states and industries. Consequently, practical evaluation of the impact of outlier methods focuses primarily on the tails of the distributions of estimators, rather than standard aggregate performance measures like variance or mean squared error. Second, the analytic and data-based evaluations focus on the incremental improvement obtained through use of the more complex methods, relative to the performance of the simple methods currently in place. Third, development of the abovementioned tools requires somewhat nonstandard asymptotics the reflect trade-offs in effects associated with, respectively, increasing sample sizes; increasing numbers of publication cells; and changing tails of underlying distributions of observations.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110431
    Description:

    We describe statistical disclosure control methods (SDC) developed for a public release Canadian Hospitals Injury Reporting and Prevention Program (CHIRPP) micro-data file. CHIRPP is a national injury surveillance database managed by the Public Health Agency of Canada (PHAC). After describing CHIRPP, the paper includes a brief overview of basic SDC concepts, as an introduction to the process for selecting and developing the appropriate SDC methods for CHIRPP given its specific challenges and requirements. We then summarize some key results. The paper concludes with a discussion of the implication of this work for the health information field and closing remarks with respect to the some methodological issues for consideration.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110408
    Description:

    Despite advances that have improved the health of the United States population, disparities in health remain among various racial/ethnic and socio-economic groups. Common data sources for assessing the health of a population of interest include large-scale surveys that often pose questions requiring a self-report, such as, "Has a doctor or other health professional ever told you that you have health condition of interest?" Answers to such questions might not always reflect the true prevalences of health conditions (for example, if a respondent does not have access to a doctor or other health professional). Similarly, self-reported data on quantities such as height and weight might be subject to reporting errors. Such "measurement error" in health data could affect inferences about measures of health and health disparities. In this work, we fit measurement-error models to data from the National Health and Nutrition Examination Survey, which asks self-report questions during an interview component and also obtains physical measurements during an examination component. We then develop methods for using the fitted models to improve on analyses of self-reported data from another survey that does not include an examination component. The methods, which involve multiply imputing examination-based data values for the survey that has only self-reported data, are applied to the National Health Interview Survey in examples involving diabetes, hypertension, and obesity. Preliminary results suggest that the adjustments for measurement error can result in non-negligible changes in estimates of measures of health.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110432
    Description:

    The use of discrete variables having known statistical distributions in the masking of data on discrete variables has been under study for some time. This paper presents a few results from our research on this topic. The consequences of sampling with and without replacement from finite populations are one principal interest. Estimates of first and second order moments which attenuate or adjust for the additional variation due to masking of known type are developed. The impact of masking of the original data on the correlation structure of concomitantly measured discrete variables is considered and the need for the further development of results for analyses of multivariate data is discussed.

    Release date: 2008-03-17

  • Technical products: 11-522-X200600110442
    Description:

    The District of Columbia Healthy Outcomes of Pregnancy Education (DC-HOPE) project is a randomized trial funded by the National Institute of Child Health and Human Development to test the effectiveness of an integrated education and counseling intervention (INT) versus usual care (UC) to reduce four risk behaviors among pregnant women. Participants were interviewed at baseline and three additional time points. Multiple imputation (MI) was used to estimate data for missing interviews. MI was done twice: once with all data imputed simultaneously, and once with data for women in the INT and UC groups imputed separately. Analyses of both imputed data sets and the pre-imputation data are compared.

    Release date: 2008-03-17

Date modified: