Statistical techniques

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Type

1 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (7)

All (7) ((7 results))

  • Articles and reports: 11-522-X202100100013
    Description: Statistics Canada’s Labour Force Survey (LFS) plays a fundamental role in the mandate of Statistics Canada. The labour market information provided by the LFS is among the most timely and important measures of the Canadian economy’s overall performance. An integral part of the LFS monthly data processing is the coding of respondent’s industry according to the North American Industrial Classification System (NAICS), occupation according to the National Occupational Classification System (NOC) and the Primary Class of Workers (PCOW). Each month, up to 20,000 records are coded manually. In 2020, Statistics Canada worked on developing Machine Learning models using fastText to code responses to the LFS questionnaire according to the three classifications mentioned previously. This article will provide an overview on the methodology developed and results obtained from a potential application of the use of fastText into the LFS coding process. 

    Key Words: Machine Learning; Labour Force Survey; Text classification; fastText.

    Release date: 2021-11-05

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2018014
    Description:

    The Canadian Mortality Database (CMDB) is an administrative database that collects information on cause of death from all provincial and territorial vital statistics registries in Canada. The CMDB lacks subpopulation identifiers to examine mortality rates and disparities among groups such as First Nations, Métis, Inuit and members of visible minority groups. Linkage between the CMDB and the Census of Population is an approach to circumvent this limitation. This report describes a linkage between the CMDB (2006 to 2011) and the 2006 Census of Population, which was carried out using hierarchical deterministic exact matching, with a focus on methodology and validation.

    Release date: 2018-02-14

  • Articles and reports: 11-522-X201300014268
    Description:

    Information collection is critical for chronic-disease surveillance to measure the scope of diseases, assess the use of services, identify at-risk groups and track the course of diseases and risk factors over time with the goal of planning and implementing public-health programs for disease prevention. It is in this context that the Quebec Integrated Chronic Disease Surveillance System (QICDSS) was established. The QICDSS is a database created by linking administrative files covering the period from 1996 to 2013. It is an attractive alternative to survey data, since it covers the entire population, is not affected by recall bias and can track the population over time and space. In this presentation, we describe the relevance of using administrative data as an alternative to survey data, the methods selected to build the population cohort by linking various sources of raw data, and the processing applied to minimize bias. We will also discuss the advantages and limitations associated with the analysis of administrative files.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X200600110410
    Description:

    The U.S. Survey of Occupational Illnesses and Injuries (SOII) is a large-scale establishment survey conducted by the Bureau of Labor Statistics to measure incidence rates and impact of occupational illnesses and injuries within specified industries at the national and state levels. This survey currently uses relatively simple procedures for detection and treatment of outliers. The outlier-detection methods center on comparison of reported establishment-level incidence rates to the corresponding distribution of reports within specified cells defined by the intersection of state and industry classifications. The treatment methods involve replacement of standard probability weights with a weight set equal to one, followed by a benchmark adjustment.

    One could use more complex methods for detection and treatment of outliers for the SOII, e.g., detection methods that use influence functions, probability weights and multivariate observations; or treatment methods based on Winsorization or M-estimation. Evaluation of the practical benefits of these more complex methods requires one to consider three important factors. First, severe outliers are relatively rare, but when they occur, they may have a severe impact on SOII estimators in cells defined by the intersection of states and industries. Consequently, practical evaluation of the impact of outlier methods focuses primarily on the tails of the distributions of estimators, rather than standard aggregate performance measures like variance or mean squared error. Second, the analytic and data-based evaluations focus on the incremental improvement obtained through use of the more complex methods, relative to the performance of the simple methods currently in place. Third, development of the abovementioned tools requires somewhat nonstandard asymptotics the reflect trade-offs in effects associated with, respectively, increasing sample sizes; increasing numbers of publication cells; and changing tails of underlying distributions of observations.

    Release date: 2008-03-17

  • Articles and reports: 11-522-X20010016259
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    In cut-off sampling, part of the target population is deliberately excluded from selection. In business statistics, the frame and the sample are typically restricted to enterprises of at least a given size (e.g. a certain number of employees). The response burden is eliminated for the small enterprises, but assumptions must be used for the non-sampled part of the population. Cut-off sampling has merits but requires care when measuring size and methodological work with models.

    This paper presents some empirical Swedish results based on one survey and administrative data. Different error sources and their effects on the overall accuracy are discussed.

    Release date: 2002-09-12

  • Articles and reports: 12-001-X198800214583
    Description:

    This note portrays SQL, highlighting its strengths and weaknesses.

    Release date: 1988-12-15
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (7)

Analysis (7) ((7 results))

  • Articles and reports: 11-522-X202100100013
    Description: Statistics Canada’s Labour Force Survey (LFS) plays a fundamental role in the mandate of Statistics Canada. The labour market information provided by the LFS is among the most timely and important measures of the Canadian economy’s overall performance. An integral part of the LFS monthly data processing is the coding of respondent’s industry according to the North American Industrial Classification System (NAICS), occupation according to the National Occupational Classification System (NOC) and the Primary Class of Workers (PCOW). Each month, up to 20,000 records are coded manually. In 2020, Statistics Canada worked on developing Machine Learning models using fastText to code responses to the LFS questionnaire according to the three classifications mentioned previously. This article will provide an overview on the methodology developed and results obtained from a potential application of the use of fastText into the LFS coding process. 

    Key Words: Machine Learning; Labour Force Survey; Text classification; fastText.

    Release date: 2021-11-05

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2018014
    Description:

    The Canadian Mortality Database (CMDB) is an administrative database that collects information on cause of death from all provincial and territorial vital statistics registries in Canada. The CMDB lacks subpopulation identifiers to examine mortality rates and disparities among groups such as First Nations, Métis, Inuit and members of visible minority groups. Linkage between the CMDB and the Census of Population is an approach to circumvent this limitation. This report describes a linkage between the CMDB (2006 to 2011) and the 2006 Census of Population, which was carried out using hierarchical deterministic exact matching, with a focus on methodology and validation.

    Release date: 2018-02-14

  • Articles and reports: 11-522-X201300014268
    Description:

    Information collection is critical for chronic-disease surveillance to measure the scope of diseases, assess the use of services, identify at-risk groups and track the course of diseases and risk factors over time with the goal of planning and implementing public-health programs for disease prevention. It is in this context that the Quebec Integrated Chronic Disease Surveillance System (QICDSS) was established. The QICDSS is a database created by linking administrative files covering the period from 1996 to 2013. It is an attractive alternative to survey data, since it covers the entire population, is not affected by recall bias and can track the population over time and space. In this presentation, we describe the relevance of using administrative data as an alternative to survey data, the methods selected to build the population cohort by linking various sources of raw data, and the processing applied to minimize bias. We will also discuss the advantages and limitations associated with the analysis of administrative files.

    Release date: 2014-10-31

  • Articles and reports: 11-522-X200600110410
    Description:

    The U.S. Survey of Occupational Illnesses and Injuries (SOII) is a large-scale establishment survey conducted by the Bureau of Labor Statistics to measure incidence rates and impact of occupational illnesses and injuries within specified industries at the national and state levels. This survey currently uses relatively simple procedures for detection and treatment of outliers. The outlier-detection methods center on comparison of reported establishment-level incidence rates to the corresponding distribution of reports within specified cells defined by the intersection of state and industry classifications. The treatment methods involve replacement of standard probability weights with a weight set equal to one, followed by a benchmark adjustment.

    One could use more complex methods for detection and treatment of outliers for the SOII, e.g., detection methods that use influence functions, probability weights and multivariate observations; or treatment methods based on Winsorization or M-estimation. Evaluation of the practical benefits of these more complex methods requires one to consider three important factors. First, severe outliers are relatively rare, but when they occur, they may have a severe impact on SOII estimators in cells defined by the intersection of states and industries. Consequently, practical evaluation of the impact of outlier methods focuses primarily on the tails of the distributions of estimators, rather than standard aggregate performance measures like variance or mean squared error. Second, the analytic and data-based evaluations focus on the incremental improvement obtained through use of the more complex methods, relative to the performance of the simple methods currently in place. Third, development of the abovementioned tools requires somewhat nonstandard asymptotics the reflect trade-offs in effects associated with, respectively, increasing sample sizes; increasing numbers of publication cells; and changing tails of underlying distributions of observations.

    Release date: 2008-03-17

  • Articles and reports: 11-522-X20010016259
    Description:

    This paper discusses in detail issues dealing with the technical aspects of designing and conducting surveys. It is intended for an audience of survey methodologists.

    In cut-off sampling, part of the target population is deliberately excluded from selection. In business statistics, the frame and the sample are typically restricted to enterprises of at least a given size (e.g. a certain number of employees). The response burden is eliminated for the small enterprises, but assumptions must be used for the non-sampled part of the population. Cut-off sampling has merits but requires care when measuring size and methodological work with models.

    This paper presents some empirical Swedish results based on one survey and administrative data. Different error sources and their effects on the overall accuracy are discussed.

    Release date: 2002-09-12

  • Articles and reports: 12-001-X198800214583
    Description:

    This note portrays SQL, highlighting its strengths and weaknesses.

    Release date: 1988-12-15
Reference (0)

Reference (0) (0 results)

No content available at this time.

Date modified: