Statistical techniques

Filter results by

Search Help
Currently selected filters that can be removed

Keyword(s)

Type

2 facets displayed. 0 facets selected.

Geography

2 facets displayed. 0 facets selected.

Survey or statistical program

2 facets displayed. 0 facets selected.

Content

1 facets displayed. 0 facets selected.
Sort Help
entries

Results

All (11)

All (11) (0 to 10 of 11 results)

  • Articles and reports: 12-001-X202100200002
    Description:

    When linking massive data sets, blocking is used to select a manageable subset of record pairs at the expense of losing a few matched pairs. This loss is an important component of the overall linkage error, because blocking decisions are made early on in the linkage process, with no way to revise them in subsequent steps. Yet, measuring this contribution is still a major challenge because of the need to model all the pairs in the Cartesian product of the sources, not just those satisfying the blocking criteria. Unfortunately, previous error models are of little use because they typically do not meet this requirement. This paper addresses the issue with a new finite mixture model, which dispenses with clerical reviews, training data, or the assumption that the linkage variables are conditionally independent. It applies when applying a standard blocking procedure for the linkage of a file to a register or a census with complete coverage, where both sources are free of duplicate records.

    Release date: 2022-01-06

  • Articles and reports: 11-522-X202100100012
    Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.

    Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100006
    Description:

    In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with public census data.

    Key Words: dual system estimation, data matching, record linkage, quality, data integration, big data.

    Release date: 2021-10-22

  • Articles and reports: 12-001-X202000200005
    Description:

    In surveys, text answers from open-ended questions are important because they allow respondents to provide more information without constraints. When classifying open-ended questions automatically using supervised learning, often the accuracy is not high enough. Alternatively, a semi-automated classification strategy can be considered: answers in the easy-to-classify group are classified automatically, answers in the hard-to-classify group are classified manually. This paper presents a semi-automated classification method for multi-label open-ended questions where text answers may be associated with multiple classes simultaneously. The proposed method effectively combines multiple probabilistic classifier chains while avoiding prohibitive computational costs. The performance evaluation on three different data sets demonstrates the effectiveness of the proposed method.

    Release date: 2020-12-15

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2018017
    Description:

    Understanding women’s business ownership and the performance of women-owned enterprises is important for designing policies to promote gender equality in leadership, economic empowerment of women and inclusive growth. However, evidence on business ownership by gender remains scarce because of the lack of comprehensive data. The study, Women-owned Enterprises in Canada (Grekou, Li and Liu, 2018), fills the data gap by identifying business ownership by gender using a newly developed administrative dataset—the Canadian Employer–Employee Dynamics Database (CEEDD). The dataset contains business owner information for all unincorporated enterprises and private corporations in Canada. This paper discusses the methodology adopted to establish the gender structure of business ownership. It then presents estimates of business ownership by gender (men or women majority ownership and equal ownership). Finally, it analyzes the sensitivity of these estimates and compares them with those calculated using other data sources.

    Release date: 2018-09-24

  • Articles and reports: 11-522-X200600110402
    Description:

    This paper explains how to append census area-level summary data to survey or administrative data. It uses examples from survey datasets present in Statistics Canada Research Data Centres, but the methods also apply to external datasets, including administrative datasets. Four examples illustrate common situations faced by researchers: (1) when the survey (or administrative) and census data both contain the same level of geographic identifiers, coded to the same year standard ("vintage") of census geography (for example, if both have 2001 DA); (2) when the two files contain geographic identifiers of the same vintage, but at different levels of census geography (for example, 1996 EA in the survey, but 1996 CT in the census data); (3) when the two files contain data coded to different vintages of census geography (such as 1996 EA for the survey, but 2001 DA for the census); (4) when the survey data are lacking in geographic identifiers, and those identifiers must first be generated from postal codes present on the file. The examples are shown using SAS syntax, but the principles apply to other programming languages or statistical packages.

    Release date: 2008-03-17

  • Articles and reports: 12-002-X20060019254
    Description:

    This article explains how to append census area-level summary data to survey or administrative data. It uses examples from datasets present in Statistics Canada Research Data Centres, but the methods also apply to external datasets. Four examples illustrate common situations faced by researchers: (1) when the survey (or administrative) and census data both contain the same level of geographic identifiers, coded to the same year standard ("vintage") of census geography; (2) when the two files contain geographic identifiers of the same vintage, but at different levels of census geography; (3) when the two files contain data coded to different vintages of census geography; (4) when the survey data are lacking in geographic identifiers, and those identifiers must first be generated from postal codes present on the file. The examples are shown using SAS syntax, but the principles apply to other programming languages or statistical packages.

    Release date: 2006-07-18

  • Journals and periodicals: 84F0013X
    Geography: Canada, Province or territory
    Description:

    This study was initiated to test the validity of probabilistic linkage methods used at Statistics Canada. It compared the results of data linkages on infant deaths in Canada with infant death data from Nova Scotia and Alberta. It also compared the availability of fetal deaths on the national and provincial files.

    Release date: 1999-10-08

  • Articles and reports: 75F0002M1996011
    Description:

    This paper looks at the family data from the Survey of Labour and Income Dynamics (SLID). It also provides an explanation of the approach used in the SLID to convey changes in the family as well as examples to indicate how family data can be analysed longitudinally.

    Release date: 1997-12-31
Data (0)

Data (0) (0 results)

No content available at this time.

Analysis (11)

Analysis (11) (0 to 10 of 11 results)

  • Articles and reports: 12-001-X202100200002
    Description:

    When linking massive data sets, blocking is used to select a manageable subset of record pairs at the expense of losing a few matched pairs. This loss is an important component of the overall linkage error, because blocking decisions are made early on in the linkage process, with no way to revise them in subsequent steps. Yet, measuring this contribution is still a major challenge because of the need to model all the pairs in the Cartesian product of the sources, not just those satisfying the blocking criteria. Unfortunately, previous error models are of little use because they typically do not meet this requirement. This paper addresses the issue with a new finite mixture model, which dispenses with clerical reviews, training data, or the assumption that the linkage variables are conditionally independent. It applies when applying a standard blocking procedure for the linkage of a file to a register or a census with complete coverage, where both sources are free of duplicate records.

    Release date: 2022-01-06

  • Articles and reports: 11-522-X202100100012
    Description: The modernization of price statistics by National Statistical Offices (NSO) such as Statistics Canada focuses on the adoption of alternative data sources that include the near-universe of all products sold in the country, a scale that requires machine learning classification of the data. The process of evaluating classifiers to select appropriate ones for production, as well as monitoring classifiers once in production, needs to be based on robust metrics to measure misclassification. As commonly utilized metrics, such as the Fß-score may not take into account key aspects applicable to prices statistics in all cases, such as unequal importance of categories, a careful consideration of the metric space is necessary to select appropriate methods to evaluate classifiers. This working paper provides insight on the metric space applicable to price statistics and proposes an operational framework to evaluate and monitor classifiers, focusing specifically on the needs of the Canadian Consumer Prices Index and demonstrating discussed metrics using a publicly available dataset.

    Key Words: Consumer price index; supervised classification; evaluation metrics; taxonomy

    Release date: 2021-11-05

  • Articles and reports: 11-522-X202100100006
    Description:

    In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with public census data.

    Key Words: dual system estimation, data matching, record linkage, quality, data integration, big data.

    Release date: 2021-10-22

  • Articles and reports: 12-001-X202000200005
    Description:

    In surveys, text answers from open-ended questions are important because they allow respondents to provide more information without constraints. When classifying open-ended questions automatically using supervised learning, often the accuracy is not high enough. Alternatively, a semi-automated classification strategy can be considered: answers in the easy-to-classify group are classified automatically, answers in the hard-to-classify group are classified manually. This paper presents a semi-automated classification method for multi-label open-ended questions where text answers may be associated with multiple classes simultaneously. The proposed method effectively combines multiple probabilistic classifier chains while avoiding prohibitive computational costs. The performance evaluation on three different data sets demonstrates the effectiveness of the proposed method.

    Release date: 2020-12-15

  • Articles and reports: 12-001-X201900200003
    Description:

    Merging available sources of information is becoming increasingly important for improving estimates of population characteristics in a variety of fields. In presence of several independent probability samples from a finite population we investigate options for a combined estimator of the population total, based on either a linear combination of the separate estimators or on the combined sample approach. A linear combination estimator based on estimated variances can be biased as the separate estimators of the population total can be highly correlated to their respective variance estimators. We illustrate the possibility to use the combined sample to estimate the variances of the separate estimators, which results in general pooled variance estimators. These pooled variance estimators use all available information and have potential to significantly reduce bias of a linear combination of separate estimators.

    Release date: 2019-06-27

  • Articles and reports: 11-633-X2018017
    Description:

    Understanding women’s business ownership and the performance of women-owned enterprises is important for designing policies to promote gender equality in leadership, economic empowerment of women and inclusive growth. However, evidence on business ownership by gender remains scarce because of the lack of comprehensive data. The study, Women-owned Enterprises in Canada (Grekou, Li and Liu, 2018), fills the data gap by identifying business ownership by gender using a newly developed administrative dataset—the Canadian Employer–Employee Dynamics Database (CEEDD). The dataset contains business owner information for all unincorporated enterprises and private corporations in Canada. This paper discusses the methodology adopted to establish the gender structure of business ownership. It then presents estimates of business ownership by gender (men or women majority ownership and equal ownership). Finally, it analyzes the sensitivity of these estimates and compares them with those calculated using other data sources.

    Release date: 2018-09-24

  • Articles and reports: 11-522-X200600110402
    Description:

    This paper explains how to append census area-level summary data to survey or administrative data. It uses examples from survey datasets present in Statistics Canada Research Data Centres, but the methods also apply to external datasets, including administrative datasets. Four examples illustrate common situations faced by researchers: (1) when the survey (or administrative) and census data both contain the same level of geographic identifiers, coded to the same year standard ("vintage") of census geography (for example, if both have 2001 DA); (2) when the two files contain geographic identifiers of the same vintage, but at different levels of census geography (for example, 1996 EA in the survey, but 1996 CT in the census data); (3) when the two files contain data coded to different vintages of census geography (such as 1996 EA for the survey, but 2001 DA for the census); (4) when the survey data are lacking in geographic identifiers, and those identifiers must first be generated from postal codes present on the file. The examples are shown using SAS syntax, but the principles apply to other programming languages or statistical packages.

    Release date: 2008-03-17

  • Articles and reports: 12-002-X20060019254
    Description:

    This article explains how to append census area-level summary data to survey or administrative data. It uses examples from datasets present in Statistics Canada Research Data Centres, but the methods also apply to external datasets. Four examples illustrate common situations faced by researchers: (1) when the survey (or administrative) and census data both contain the same level of geographic identifiers, coded to the same year standard ("vintage") of census geography; (2) when the two files contain geographic identifiers of the same vintage, but at different levels of census geography; (3) when the two files contain data coded to different vintages of census geography; (4) when the survey data are lacking in geographic identifiers, and those identifiers must first be generated from postal codes present on the file. The examples are shown using SAS syntax, but the principles apply to other programming languages or statistical packages.

    Release date: 2006-07-18

  • Journals and periodicals: 84F0013X
    Geography: Canada, Province or territory
    Description:

    This study was initiated to test the validity of probabilistic linkage methods used at Statistics Canada. It compared the results of data linkages on infant deaths in Canada with infant death data from Nova Scotia and Alberta. It also compared the availability of fetal deaths on the national and provincial files.

    Release date: 1999-10-08

  • Articles and reports: 75F0002M1996011
    Description:

    This paper looks at the family data from the Survey of Labour and Income Dynamics (SLID). It also provides an explanation of the approach used in the SLID to convey changes in the family as well as examples to indicate how family data can be analysed longitudinally.

    Release date: 1997-12-31
Reference (0)

Reference (0) (0 results)

No content available at this time.

Date modified: