Statistical techniques

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Type

2 facets displayed. 0 facets selected.

Geography

1 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (33)

All (33) (0 to 10 of 33 results)

  • Stats in brief: 89-20-00062022003
    Description:

    By the end of this video you will understand what confidence intervals are, why we use them, and what factors have an impact on them.

    Release date: 2022-05-24

  • Articles and reports: 11-522-X202100100010
    Description:

    As part of processing for the 2021 Canadian Census, the write-in responses to 31 census questions must be coded. Up until, and including, 2016, this was a three stage process, including an “interactive (human) coding” step as the second stage. This human coding step is both lengthy and expensive, spanning many months and requiring the hiring and training of a large number of temporary employees. With this in mind, for 2021, this stage was either augmented with or replaced entirely by machine learning models using the "fastText" algorithm. This presentation will discuss the implementation of this algorithm and the challenges and decisions taken along the way.

    Key Words: Natural Language Processing, Machine Learning, fastText, Coding

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100012
    Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.

    Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100027
    Description:

    Privacy concerns are a barrier to applying remote analytics, including machine learning, on sensitive data via the cloud. In this work, we use a leveled fully Homomorphic Encryption scheme to train an end-to-end supervised machine learning algorithm to classify texts while protecting the privacy of the input data points. We train our single-layer neural network on a large simulated dataset, providing a practical solution to a real-world multi-class text classification task. To improve both accuracy and training time, we train an ensemble of such classifiers in parallel using ciphertext packing.

    Key Words: Privacy Preservation, Machine Learning, Encryption

    Release date: 2021-10-29

  • Articles and reports: 12-001-X202000200005
    Description:

    In surveys, text answers from open-ended questions are important because they allow respondents to provide more information without constraints. When classifying open-ended questions automatically using supervised learning, often the accuracy is not high enough. Alternatively, a semi-automated classification strategy can be considered: answers in the easy-to-classify group are classified automatically, answers in the hard-to-classify group are classified manually. This paper presents a semi-automated classification method for multi-label open-ended questions where text answers may be associated with multiple classes simultaneously. The proposed method effectively combines multiple probabilistic classifier chains while avoiding prohibitive computational costs. The performance evaluation on three different data sets demonstrates the effectiveness of the proposed method.

    Release date: 2020-12-15

  • Articles and reports: 11-633-X2019003
    Description:

    This report provides an overview of the definitions and competency frameworks of data literacy, as well as the assessment tools used to measure it. These are based on the existing literature and current practices around the world. Data literacy, or the ability to derive meaningful information from data, is a relatively new concept. However, it is gaining increasing recognition as a vital skillset in the information age. Existing approaches to measuring data literacy—from self-assessment tools to objective measures, and from individual to organizational assessments—are discussed in this report to inform the development of an assessment tool for data literacy in the Canadian public service.

    Release date: 2019-08-14

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 12-001-X201900200008
    Description:

    High nonresponse occurs in many sample surveys today, including important surveys carried out by government statistical agencies. An adaptive data collection can be advantageous in those conditions: Lower nonresponse bias in survey estimates can be gained, up to a point, by producing a well-balanced set of respondents. Auxiliary variables serve a twofold purpose: Used in the estimation phase, through calibrated adjustment weighting, they reduce, but do not entirely remove, the bias. In the preceding adaptive data collection phase, auxiliary variables also play a major role: They are instrumental in reducing the imbalance in the ultimate set of respondents. For such combined use of auxiliary variables, the deviation of the calibrated estimate from the unbiased estimate (under full response) is studied in the article. We show that this deviation is a sum of two components. The reducible component can be decreased through adaptive data collection, all the way to zero if perfectly balanced response is realized with respect to a chosen auxiliary vector. By contrast, the resisting component changes little or not at all by a better balanced response; it represents a part of the deviation that adaptive design does not get rid of. The relative size of the former component is an indicator of the potential payoff from an adaptive survey design.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2019002
    Description:

    Survey data collection through mobile devices, such as tablets and smartphones, is underway in Canada. However, little is known about the representativeness of the data collected through these devices. In March 2017, Statistics Canada commissioned survey data collection through the Carrot Rewards Application and included 11 questions on the Carrot Rewards Mobile App Survey (Carrot) drawn from the 2017 Canadian Community Health Survey (CCHS).

    Release date: 2019-06-04

  • Articles and reports: 11-633-X2018016
    Description:

    Record linkage has been identified as a potential mechanism to add treatment information to the Canadian Cancer Registry (CCR). The purpose of the Canadian Cancer Treatment Linkage Project (CCTLP) pilot is to add surgical treatment data to the CCR. The Discharge Abstract Database (DAD) and the National Ambulatory Care Reporting System (NACRS) were linked to the CCR, and surgical treatment data were extracted. The project was funded through the Cancer Data Development Initiative (CDDI) of the Canadian Partnership Against Cancer (CPAC).

    The CCTLP was developed as a feasibility study in which patient records from the CCR would be linked to surgical treatment records in the DAD and NACRS databases, maintained by the Canadian Institute for Health Information. The target cohort to whom surgical treatment data would be linked was patients aged 19 or older registered on the CCR (2010 through 2012). The linkage was completed in Statistics Canada’s Social Data Linkage Environment (SDLE).

    Release date: 2018-03-27
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (33)

Analysis (33) (0 to 10 of 33 results)

  • Stats in brief: 89-20-00062022003
    Description:

    By the end of this video you will understand what confidence intervals are, why we use them, and what factors have an impact on them.

    Release date: 2022-05-24

  • Articles and reports: 11-522-X202100100010
    Description:

    As part of processing for the 2021 Canadian Census, the write-in responses to 31 census questions must be coded. Up until, and including, 2016, this was a three stage process, including an “interactive (human) coding” step as the second stage. This human coding step is both lengthy and expensive, spanning many months and requiring the hiring and training of a large number of temporary employees. With this in mind, for 2021, this stage was either augmented with or replaced entirely by machine learning models using the "fastText" algorithm. This presentation will discuss the implementation of this algorithm and the challenges and decisions taken along the way.

    Key Words: Natural Language Processing, Machine Learning, fastText, Coding

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100012
    Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.

    Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100027
    Description:

    Privacy concerns are a barrier to applying remote analytics, including machine learning, on sensitive data via the cloud. In this work, we use a leveled fully Homomorphic Encryption scheme to train an end-to-end supervised machine learning algorithm to classify texts while protecting the privacy of the input data points. We train our single-layer neural network on a large simulated dataset, providing a practical solution to a real-world multi-class text classification task. To improve both accuracy and training time, we train an ensemble of such classifiers in parallel using ciphertext packing.

    Key Words: Privacy Preservation, Machine Learning, Encryption

    Release date: 2021-10-29

  • Articles and reports: 12-001-X202000200005
    Description:

    In surveys, text answers from open-ended questions are important because they allow respondents to provide more information without constraints. When classifying open-ended questions automatically using supervised learning, often the accuracy is not high enough. Alternatively, a semi-automated classification strategy can be considered: answers in the easy-to-classify group are classified automatically, answers in the hard-to-classify group are classified manually. This paper presents a semi-automated classification method for multi-label open-ended questions where text answers may be associated with multiple classes simultaneously. The proposed method effectively combines multiple probabilistic classifier chains while avoiding prohibitive computational costs. The performance evaluation on three different data sets demonstrates the effectiveness of the proposed method.

    Release date: 2020-12-15

  • Articles and reports: 11-633-X2019003
    Description:

    This report provides an overview of the definitions and competency frameworks of data literacy, as well as the assessment tools used to measure it. These are based on the existing literature and current practices around the world. Data literacy, or the ability to derive meaningful information from data, is a relatively new concept. However, it is gaining increasing recognition as a vital skillset in the information age. Existing approaches to measuring data literacy—from self-assessment tools to objective measures, and from individual to organizational assessments—are discussed in this report to inform the development of an assessment tool for data literacy in the Canadian public service.

    Release date: 2019-08-14

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 12-001-X201900200008
    Description:

    High nonresponse occurs in many sample surveys today, including important surveys carried out by government statistical agencies. An adaptive data collection can be advantageous in those conditions: Lower nonresponse bias in survey estimates can be gained, up to a point, by producing a well-balanced set of respondents. Auxiliary variables serve a twofold purpose: Used in the estimation phase, through calibrated adjustment weighting, they reduce, but do not entirely remove, the bias. In the preceding adaptive data collection phase, auxiliary variables also play a major role: They are instrumental in reducing the imbalance in the ultimate set of respondents. For such combined use of auxiliary variables, the deviation of the calibrated estimate from the unbiased estimate (under full response) is studied in the article. We show that this deviation is a sum of two components. The reducible component can be decreased through adaptive data collection, all the way to zero if perfectly balanced response is realized with respect to a chosen auxiliary vector. By contrast, the resisting component changes little or not at all by a better balanced response; it represents a part of the deviation that adaptive design does not get rid of. The relative size of the former component is an indicator of the potential payoff from an adaptive survey design.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2019002
    Description:

    Survey data collection through mobile devices, such as tablets and smartphones, is underway in Canada. However, little is known about the representativeness of the data collected through these devices. In March 2017, Statistics Canada commissioned survey data collection through the Carrot Rewards Application and included 11 questions on the Carrot Rewards Mobile App Survey (Carrot) drawn from the 2017 Canadian Community Health Survey (CCHS).

    Release date: 2019-06-04

  • Articles and reports: 11-633-X2018016
    Description:

    Record linkage has been identified as a potential mechanism to add treatment information to the Canadian Cancer Registry (CCR). The purpose of the Canadian Cancer Treatment Linkage Project (CCTLP) pilot is to add surgical treatment data to the CCR. The Discharge Abstract Database (DAD) and the National Ambulatory Care Reporting System (NACRS) were linked to the CCR, and surgical treatment data were extracted. The project was funded through the Cancer Data Development Initiative (CDDI) of the Canadian Partnership Against Cancer (CPAC).

    The CCTLP was developed as a feasibility study in which patient records from the CCR would be linked to surgical treatment records in the DAD and NACRS databases, maintained by the Canadian Institute for Health Information. The target cohort to whom surgical treatment data would be linked was patients aged 19 or older registered on the CCR (2010 through 2012). The linkage was completed in Statistics Canada’s Social Data Linkage Environment (SDLE).

    Release date: 2018-03-27
Reference (0)

Reference (0) (0 results)

No content available at this time.

Date modified: