Data science resources

Join the Data Science Network for the Federal Public Service

Calling all data science enthusiasts! Subscribe to the Data Science Network for the Federal Public Service newsletter to discover the world of data science and find opportunities to collaborate with peers.

Learn about data science

Learn about digital government

Training and tools

Training

Tools

Communities of practice

Data science projects

Data science plays an important role at Statistics Canada. All across the agency, new data science methods are being used to make our projects more efficient and provide better data insights to Canadians.

Project categories

Contact the Centre for Artificial Intelligence Research and Excellence (CAIRE) for more information on Statistics Canada's data science projects.

Natural language processing

Event Detection and Sentiment Indicators

Statistics Canada is developing a tool to detect specific economic events by analyzing millions of news articles. The tool uses machine learning algorithms to research and summarize information from the articles and organize the data into an informative dashboard. Time that was once spent on research can now be spent analyzing the reasons for economic changes.

The agency is also exploring the development of sentiment indicators to measure economic tendencies and their connection with key economic variables. Based on positive and negative interpretations of economic-related news articles, these indicators could allow subject matter experts to gain better insights into economic trends by industry, and support the publication of near real-time economic indicators.

Retail Scanner Data

Statistics Canada publishes the total amount of products sold, as classified by the North American Product Classification System (NAPCS). Large scanner data bases are currently available from major retailers, with millions of records. Previously, products were assigned to NAPCS with dictionary-based coding in combination with manual coding when required, according to their description and other indicators. Statistics Canada uses machine learning to classify all of the product descriptions in the scanner data to the NAPCS and then obtains aggregate sales for each area. This approach has resulted in higher degree of automation, as well as in accurate, detailed retail data and a reduced response burden for major retailers.

Survey of Sexual Misconducts in the Canadian Armed Forces Comment Classification

Data scientists at Statistics Canada created a machine learning model to automatically classify the electronic comments from respondents of the Survey of Sexual Misconduct in the Canadian Armed Forces (SSMCAF). The SSMCAF required automated classification of comments into five categories: personal story, negative, positive, advice for content, and other. The machine learning model coded 6,000 comments from the first 2018 survey cycle with 89% accuracy for French and English comments. This approach will be expanded to other surveys at Statistics Canada.

Census 2021 Comments Classification

Statistics Canada has developed a machine learning algorithm to classify 1.8 million French and English respondent comments from the 2021 Census. This algorithm quickly and objectively classifies comments into different classes. The model is trained on comments from the 2016 Census and the 2019 Census test. Respondent feedback is used to support decision making regarding content determination for the next census and to monitor factors such as respondent burden. Visit 2021 Census Comment Classification for more information about this project.

Canadian Coroner and Medical Examiner Database (CCMED) Dynamic Topic Modelling

Statistics Canada has designed and deployed a dynamic topic modelling system. This system uses data from the Canadian Coroner and Medical Examiner Database to detect emerging narratives on causes of death. The objective is to provide analysts with patterns of death over time. For more information, please visit Topic Modelling and Dynamic Topic Modelling: A Technical Review.

Canadian Export Reporting System Text Classification

The Canada Border Services Agency (CBSA) and Statistics Canada recently developed a new web-based reporting tool for Canadian exporters to non-US countries called the Canadian Export Reporting System (CERS). CERS requires that an exporter self-code their goods' Harmonized System (HS) code plus an additional text description for more information for CBSA. The Data Science Division, in partnership with the International Accounts and Trade Division (IATD), developed a FastText machine learning model to classify the additional text descriptions for the exported commodities to the HS codes so that IATD can use them to validate the self-coded HS codes provided by the exporters. The motivation for adding this validation is that analysis of the data from the previous systems revealed inconsistencies between the product description and the code chosen by the exporter. With the move to CERS, electronic reporting is now mandatory and may result in an increase of cases with such inconsistencies, which is why an automated solution for review is being developed.

Image classification

In-Season Crop Classification

Monitoring the production of farms in Canada is an important but costly undertaking. Surveys and in-person inspections require a large amount of resources, and the current approach to predicting crop yields is time-consuming. For these reasons, Statistics Canada is modernizing crop classification using an image classification approach. An automated pipeline is used to download and process freely available Landsat-8 satellite imagery throughout the crop season.

Crop types are predicted using satellite imagery and the application of neural networks. The new model estimations are then used to update a database, allowing end users to acquire the most up-to-date estimates throughout the crop season. Initial results show that this approach is much faster and will reduce the survey response burden for farm-owners, especially during the busy times of the year.

Geo-Spatial Construction Starts Using Satellite Images

Canadian Mortgage and Housing Corporation tracks the starts and completions of residential building construction projects across Canada, and results are used by Statistics Canada to calibrate estimates for its Investment in Building Construction program. Statistics Canada has been employing various data science methods to detect construction starts from satellite images, such as using image augmentation to diversify and increase the data set. These methods enabled data scientists to detect the area of the building in the pre-foundation and foundation phase. The process of pre-foundation consists of creating footings and concrete slabs to support the foundation walls, including excavation. The foundation is part of a structural system that supports and anchors the superstructure of a building. AI model building and evaluation required the processing of more than 1,400 km2 imagery of 50cm resolution over many months for which a highly scalable and efficient processing pipeline was created. The developed artificial intelligence algorithms might eventually lead to more accurate and timely data, while aiding in eliminating existing data gaps for the non-residential sector and small/remote communities excluded from the current survey.

Agriculture Greenhouse Detection Using Aerial Images

The greenhouse project has been using Earth Observation data to identify greenhouses and measure their total area in Canada, in addition to a proof of concept to determine our ability to classify greenhouses based on their produce inside of the greenhouses, and the type of greenhouses (Glass or plastic cover). In an effort to produce more timely estimates and reduce the need for survey respondents, data scientists at Statistics Canada are working to automate the identification process using machine learning, administrative data sources and other technologies, such as satellite imagery and high resolution aerial imagery.

PDF extraction

Extraction of Economic Variables from Financial Reports

Statistics Canada has been applying data science solutions to extract information from PDFs and other documents in a timely and more efficient manner. For example, Statistics Canada has been experimenting with the historical dataset for SEDAR, a system used by publicly-traded Canadian companies, to file securities documents to various Canadian securities commissions.

Statistics Canada's data scientists developed a state-of-the-art machine learning pipeline that correctly identifies and extracts key financial variables (e.g. total assets) from the correct table (e.g. balance sheet) in an annual financial statement of a company. The algorithm used for table extraction called SLICE (Spatial Layout-Based Information and Content Extraction) was developed within Statistics Canada and made open-source under MIT license. SLICE is a unique computer vision algorithm that simultaneously uses textual, visual and layout information to segment pages into a tabular structure. The pipeline therefore turns a large amount of unstructured public documents from SEDAR into structure datasets, allowing the automation of information extraction related to Canadian companies. This significantly reduces the manual hours spent identifying and capturing the required information and reduces data redundancy within the organization by providing a one-point solution to access information.

Public Sector Statistics Division Scanned PDF Extraction

Public Sector Statistics Division (PSSD) at Statistics Canada receives financial statements from provincial governments and their respective municipalities on a quarterly and annual basis. These statements are in text-based and scanned PDF format, and store valuable information in tables. Each row of the table contains numerical values which must be manually extracted and stored in a database for further analysis, but this manual process is time-consuming and subject to human error. Data scientists at Statistics Canada developed a proof-of-concept that involves extracting financial data from reported financial statements using an in-house machine learning algorithm and displaying them in a tabular format that can be edited by the analysts. Additionally, the data is auto-coded and records of previous and current year numerical values are provided. Once the project transitions to production, it will reduce data redundancy within the organization by providing a one-point solution to access information, as well as save manual work hours identifying and capturing required information by analysts in the PSSD.

Predictive Analytics

Nowcasting of Economic Indicators

Many initiatives at Statistics Canada work towards near real-time estimates and the production of advanced indicators for many of the agency's key data series. In the Investment in Building Construction Program building permit values are a key series for which an early indicator via nowcasting could be produced. To facilitate the effort, an analytical cloud environment was created which allows analysts to leverage timely external data and advanced time series models. An extensive time series database with economic time series (from Statistics Canada programs), external open data, temperature sensor data and stock market data were created. This environment may potentially pave the way towards a generalized Nowcasting System at Statistics Canada. Exploratory analysis was conducted to apply nowcasting models including ARIMA-X, PROPHET and the machine learning algorithm XGBoost in nowcasting several economic indicators including monthly building permit values. It was found that ARIMA-X and PROPHET performed similarly in terms of mean absolute percentage error and mean directional accuracy while XGBoost with external open data did not perform as well.

Crop Yield Predictions

Statistics Canada recently completed a research project for the Field Crop Reporting Series (FCRS) on the use of machine learning , specifically supervised regression techniques, for early-season crop yield prediction. The objective was to investigate whether this could be used to improve the precision of the existing crop yield prediction method, while also reducing the survey response burden for busy farm operators. The main contribution of the research project was the adaptation of rolling window forward validation (RWFV) as validation protocol. RWFV is a special case of forward validation, a family of validation protocols designed to prevent temporal information leakage for supervised learning based on time series data. Our adaptation of RWFV enabled a customized validation protocol that realistically reflects the statistical production context of the FCRS. Visit Use of Machine Learning for Crop Yield Prediction for more details on the technical side of this project.

Hospital Occupancy Forecast

Data scientists at Statistics Canada are helping in the fight against COVID-19 by creating short-term hospital occupancy forecasts based on two daily inputs using Ottawa hospital data as a testcase. The inputs are daily new hospital admissions counts and hospital midnight in-patient headcounts. Admission Forecasts are determined by using two hierarchical Bayesian models. The first input models the random delay between the unobserved event of COVID-19 infection and hospital admission, for the subgroup of infected individuals who will be hospitalized due to COVID-19. The second input models the random delay between hospital admission and discharge/death.

A series of 25 consecutive weeks of mock forecasts based on real data was performed to assess the effectiveness of the forecast model. The resulting credible bands, on the one hand, encompassed consistently the real hospitalization counts within one week after the respective training data cut-offs, and on the other hand, were sufficiently narrow to be informative. The results of this project strongly suggest the feasibility of accurate and informative hospitalization forecasts at the municipality level, provided timely hospital admission and discharge/death data are available.

High Pandemic Hubs

Data scientists at Statistics Canada created a research project using a general machine learning framework to identify and predict health regions that could be considered vulnerable or at high-risk of increased COVID-19 infection rates. By identifying these regions, federal and provincial health authorities would be able to divert public health resources such as PPE or frontline workers from lower risk regions to higher risk regions; and would also able to contain cases in higher risk areas sooner with contact tracing and quarantine measures.

This effort also contributed to the creation of an interactive dashboard that could allow users to monitor COVID-19 cases and deaths at the health-region level and to choose among multiple risk prediction models and approaches.

Using COVID-19 Epidemiological Modelling to Inform Personal Protective Equipment Supply and Demand in Canada

At the beginning of the pandemic, there were concerns surrounding Personal Protective Equipment (PPE) preparedness in Canada and whether there was enough supply to support the healthcare sector, and other sectors of the economy throughout the pandemic. In response to this emerging need, Statistics Canada customized an existing epidemiological model to allow policy makers to stress-test the PPE supply under various epidemiological scenarios. Projections generated from this epidemiological model have been used by the PPE supply and demand model to compare on-hand and in-bound supplies with demand projections over twelve months. For more information, please visit Modelling SARS-CoV-2 Dynamics to Forecast PPE Demand.

Optimal Social Distancing Policy via Reinforcement Learning

Data scientists at Statistics Canada collaborated with the Public Health Agency of Canada to develop a novel epidemiological modelling framework optimizing Non-Pharmaceutical Interventions using Reinforcement Learning. This model determines the best set of population behaviours to minimize the spread of an infection within simulations. Visit Non-Pharmaceutical Intervention and Reinforcement Learning for more details on this project.

Research

Statistics Canada's First Quantum Machine Learning Project: A collaboration with Université de Sherbrooke

Quantum computing—a new way of computing that uses principles of quantum mechanics to store and process information—holds a lot of promise as a solution for some computationally heavy processes and algorithms. Increasingly, governments and major companies are working to assess how quantum computing will impact their businesses in the near future.

As of June 2021, Statistics Canada is collaborating with the Université de Sherbrooke to explore the potential of quantum computing and identify opportunities early on in its development. The six month-long project marks the first collaboration between Statistics Canada and the Quantum Hub at Université de Sherbrooke's Institut quantique (IQ). The Quantum Hub offers its members cloud-based access to advanced quantum computing systems, as well as a community of experts to support quantum research projects.

The project will explore ways to optimize the agency's machine learning processes and text classification computations, and how this technology could be used to support Statistics Canada's goal of providing high-quality data and insights to Canadians.

Homomorphic Encryption

Data security remains one of the highest priorities at Statistics Canada. Our data scientists are training a machine learning text classifier to use homomorphic encryption to protect data while they're being processed. The data are protected at two points. The first point, located at their ingestion points, allows data files to be processed remotely or on the cloud. The second point is at their dissemination points–this allows accredited external researchers at virtual labs access to more data in a secure manner. The use of homomorphic encryption not only ensures data protection but also acts as a solution for outsourcing computation. Visit A Brief Survey of Privacy Preserving Technologies for more information on homomorphic encryption and other privacy-preserving approaches.

A Novel Estimation Method for Non-Probability Samples

Probability samples allow reliable estimation of population characteristics and have been successfully used in statistics for many decades. However, due to rising costs and declining response rates, researchers have begun to develop theory for reliable estimation based on alternative data sources. Non-probability samples, such as web-based opt-in panels, are often relatively easy and inexpensive to obtain, but may suffer from severe self-selection bias where traditional estimation techniques cannot be applied. To help with this, researchers at Statistics Canada have developed nppCART, a novel estimation methodology for non-probability samples. nppCART attempts to correct for the self-selection bias by incorporating additional information from an auxiliary probability sample. nppCART is a variant of the well-known CART algorithm, and may be considered a nonparametric method. It was conceived with the hope that its nonparametric nature may be more useful against nonlinearity or complex interactions among predictor variables than existing non-probability sample estimation techniques. Visit the 2019 Annual Meeting in Calgary site for resources on the project.

Framework

Framework for Responsible Machine Learning Processes

Machine learning is becoming an increasingly integral part of many projects at Statistics Canada. Data scientists are looking to implement a responsible framework for machine learning and artificial intelligence applications that are transitioning to production. The framework includes an evaluation of the project through the use of a checklist, followed by a peer review of the project. As a final step, the methodology is presented to the Scientific Review Committee. The goal of this project is to establish a review process that ensures responsible machine learning processes are put into production while promoting good and ethical data science practices. This framework will also guide data scientists as they develop new projects. For more information, please visit Responsible Machine Learning at Statistics Canada.

Data science expertise

The agency's data scientists are experts in artificial intelligence and machine learning, leading the agency in data science-related research and development.

The data scientists are pioneering new technologies and innovative data science methods, offering expertise in image processing, natural language processing, integration of cloud tools, traceability methods, privacy preserving techniques, information retrieval and much more!

These experts have many areas of specialization, including supervised and unsupervised learning, artificial neural networks, reinforcement learning, data cloud engineering and more.

At Statistics Canada, these innovative methods are used to make more meaningful, powerful data insights.

Mission: building data science capacity

Statistics Canada's data science mission is to expand the capacity of data science and analytics within the Government of Canada and beyond.

What are the keys to building data science capacity?

Trust—Deliver concrete results while adhering to high ethical standards at all times to build trust in data science methods.

Innovation—Statistics Canada's data scientists are committed to identifying and adopting the latest data science practices to deliver fast results.

Quality—Statistics Canada's data science methods follow rigorous practices, including internal reviews of projects, to ensure high-quality results and valid statistical inference.

Collaboration—The agency is working with partners across the Government of Canada, academia, international partners and other members of the data science community to learn from one another and share leading-edge data science methods.

What are the benefits of data science for Canadians?

Data science allows Statistics Canada to better serve Canadians by creating high-value products and services. By applying the latest machine learning and artificial intelligence practices, the agency is able to quickly process large data sets in shorter periods of time, supporting the need for increasingly nuanced data to better understand our country and our economy.

Machine Learning can also be used to make sense of unstructured data such as images or sensor data, quickly classify large amounts of information, summarize and extract key information from narratives, provide predictions and assist with research.

Providing timely, high-quality information

As information needs continue to expand, it is critical for national statistical agencies to apply these innovative solutions to support evidence-based decision making. There are many benefits to data science for Canadians, including:

  • faster, timelier access to data products
  • more accurate results
  • more detailed, granular data
  • reduced response burden on households and businesses.

These solutions also benefit Statistics Canada by giving data scientists the ability to process large amounts of unstructured data, eliminating manual work and reducing costs without compromising data quality.

Data science at Statistics Canada

As the world around us continues to evolve and change rapidly in the digital age, the importance of data and how they are used is critical.

Data science is a rapidly evolving field that can tap into the power of data and empower governments to serve citizens more effectively and efficiently. As the role of national statistical organizations continues to change and expand, these organizations must adapt and embrace new technologies and innovative thinking to support the information needs of society.

Statistics Canada is one of the leaders in the Government of Canada's adoption of data science and artificial intelligence. By taking a collaborative approach to data science, the agency is pushing the boundaries of modernization and harnessing the power of new approaches and technologies to better serve Canadians.

Data science supporting the COVID-19 response

Data science allows statistical agencies to respond quickly to changing economic and social situations. Statistics Canada is using the power of data science to support the COVID-19 response in Canada.

The agency collaborated with Health Canada to visualize the supply and demand information for Personal Protective Equipment (PPE). Before the data visualization could begin, the data needed to be extracted and ingested. The data were coming daily from many different sources (different provincial/territorial governments, other federal departments and private sector companies that had been hired to help source the PPE) and in many different formats (e.g. Word documents, Excel files, PDFs) and required a significant amount of manual work to create standardized reports.

To improve this process, data scientists at Statistics Canada created an algorithm that parses the data into different data entries. Machine learning was used to identify numbers and dates within the text. The structured data were then presented in a PowerBI dashboard that was shared with other government departments to meet their information needs and better understand the supply and demand for PPE in Canada.

For more information on Statistics Canada's response to COVID-19, visit COVID-19: A data perspective portal.

Commitment to privacy and security

As Statistics Canada continues to implement new technologies and innovations, the agency's commitment to protecting privacy and security remains the highest priority. The agency has rigorous measures in place to preserve confidentiality and privacy in the modern digital era.

The amount of data we gather and use and the power of the insights they generate are increasing rapidly. It is known that data are vulnerable throughout its lifecycle: at rest, in-transit and during computation or processing. While the security mechanisms for data protection while at rest (e.g. Symmetric Key Encryption) and in-transit (e.g. Transport Layer Security) are well studied, Privacy Preserving Technologies have emerged in recent years to provide data protection while enabling data processing, such as in statistical analyses.

Privacy Preserving Technologies, or Privacy Preserving Computation Techniques, is a generic term that covers a broad range of approaches that promise to provide protection while collecting the data, processing it and disseminating the results. These approaches are homomorphic encryption, secure multi-party computation, differential privacy, trusted execution environments and zero-knowledge proofs. The data scientists at Statistics Canada are exploring the use of these existing and emerging privacy preserving technologies to continuously address the privacy preservation needs for highly sensitive data. This will also allow for alternative storage options to permit secure remote computing on encrypted data, to benefit from potential multi-party computation opportunities and to derive insights from distributed and inaccessible data.

For more information on how Statistics Canada protects data, visit Statistics Canada's Trust Centre.

Visit Data science projects at Statistics Canada to see data science in action!

Data Science Centre

Data Science Centre

In this rapidly-changing digital era, statistical agencies need to find innovative ways to harness the power of data. Statistics Canada is embracing the possibilities of data science to better serve the information needs of Canadians.

Data science at Statistics Canada

Data science at Statistics Canada

Statistics Canada is one of the leaders in the Government of Canada’s adoption of data science and artificial intelligence. Find out about the benefits of data science and how they are being used at Canada’s national statistical agency.

Data Science Network for the Federal Public Service

Data Science Network for the Federal Public Service

Join a community of data science enthusiasts to learn all about data science in the public service, collaborate on projects, share information on the latest tools, and much more.

Mission: building data science capacity

Mission: building data science capacity

Learn about Statistics Canada’s mission to expand the capacity for data science within the Government of Canada and beyond.

Data science expertise

Data science expertise

Discover the various areas of expertise of Statistics Canada’s data scientists who are leading the way with cutting-edge research and development.

Data science projects

Data science projects

Explore some of the agency’s innovative projects that are fueled by data science using natural language processing, satellite images, neural networks and other cutting-edge techniques.

Data science resources

Data science resources

Learn more about data science with these helpful resources.

Contact

Contact the Centre for Artificial Intelligence Research and Excellence (CAIRE) for more information about data science at Statistics Canada.

Canadian Centre for Energy Information (CCEI)

Consultation objectives

The Canadian Centre for Energy Information (CCEI) is an independent one-stop shop for comprehensive energy data and expert analysis. The centre compiles, reconciles and integrates energy data from a number of Canadian sources and makes data from multiple providers available free of charge on a user-friendly website. It works collaboratively to harmonize energy definitions, measurements and standards, and improve completeness, coherence and timeliness of Canada's energy information.

The CCEI is being developed by Statistics Canada in partnership with Canada Energy Regulator (CER), Natural Resources Canada (NRCan) and Environment and Climate Change Canada (ECCC). Statistics Canada launched the CCEI to expand publicly available data and analysis, and ensure all Canadians have access to centralized energy information.

The consultations ensured that the CCEI meets users' needs and identified any potential usability issues.

Consultation methodology

Statistics Canada conducted remote usability testing in both official languages with participants from across the country. Participants were asked to complete a series of tasks and to provide feedback on the product.

How participants got involved

This consultation is now closed.

Individuals who wished to obtain more information or to take part in a consultation were asked to contact Statistics Canada by sending an email to statcan.consultations@statcan.gc.ca.

Statistics Canada is committed to respecting the privacy of consultation participants. All personal information created, held or collected by the Agency is protected by the Privacy Act. For more information on Statistics Canada's privacy policies, please consult the Privacy notice.

Results

Overall, the beta version of the CCEI website was well-received by participants. They reported that it was easy to navigate and that it provided easy access to a variety of information.

Participants noted that the following areas worked:

  • The overall look and feel of the website
  • The icons and subjects on the home page
  • The inclusion of interactive features, such as data visualizations

Participants suggested that the following areas could be improved:

  • The use of space in the search results
  • The contextual information provided in the indicators
  • The organization of lists of resources throughout the website

After analysis, recommendations include:

  • Condense the search results as much as possible to allow users to easily browse through them
  • Ensure that relevant contextual information is available for the indicators
  • Ensure that the lists of datasets and publication allow users to easily sort through the content by organizing the lists logically and adding a sort or filter function

Statistics Canada thanks participants for their participation in this consultation. Their insights will guide the agency's web development and ensure that the final products meet users' expectations.

Retail Commodity Survey: CVs for Total Sales (April 2020)

Retail Commodity Survey: CVs for Total Sales (April 2020)
NAPCS-CANADA Month
202001 202002 202003 202004
Total commodities, retail trade commissions and miscellaneous services 0.58 0.60 0.53 0.56
Retail Services (except commissions) [561]  0.58 0.60 0.52 0.56
Food at retail [56111]  0.86 0.54 0.50 0.78
Soft drinks and alcoholic beverages, at retail [56112]  0.51 0.42 0.45 0.57
Cannabis products, at retail [56113] 0.00 0.00 0.00 0.00
Clothing at retail [56121]  1.01 0.72 0.94 1.64
Footwear at retail [56122]  1.17 1.27 1.80 3.64
Jewellery and watches, luggage and briefcases, at retail [56123]  5.07 5.19 10.71 31.84
Home furniture, furnishings, housewares, appliances and electronics, at retail [56131]  0.90 0.67 0.64 0.78
Sporting and leisure products (except publications, audio and video recordings, and game software), at retail [56141]  2.60 3.68 3.45 3.78
Publications at retail [56142] 8.20 6.64 8.24 12.62
Audio and video recordings, and game software, at retail [56143] 5.38 4.88 0.99 0.84
Motor vehicles at retail [56151]  1.79 1.98 2.11 2.39
Recreational vehicles at retail [56152]  3.98 4.74 4.73 4.70
Motor vehicle parts, accessories and supplies, at retail [56153]  1.46 1.51 1.71 2.03
Automotive and household fuels, at retail [56161]  2.34 2.50 1.98 1.95
Home health products at retail [56171]  2.91 2.81 2.28 2.66
Infant care, personal and beauty products, at retail [56172]  2.69 2.77 2.66 3.40
Hardware, tools, renovation and lawn and garden products, at retail [56181]  2.61 2.49 1.69 1.97
Miscellaneous products at retail [56191]  2.35 1.89 2.25 2.47
Total retail trade commissions and miscellaneous services Footnotes 1 1.41 1.47 1.62 1.79

Footnotes

Footnote 1

Comprises the following North American Product Classification System (NAPCS): 51411, 51412, 53112, 56211, 57111, 58111, 58121, 58122, 58131, 58141, 72332, 833111, 841, 85131 and 851511.

Return to footnote 1 referrer

Data science terminology

Application Programming Interface (API)
Collection of software routines, protocols, and tools which provide a programmer with all the building blocks for developing an application program for a specific platform (environment). An API also provides an interface that allows a program to communicate with other programs, running in the same environment. (BusinessDictionary.com)
Artificial Intelligence (AI)

Artificial intelligence is a field of computer science dedicated to solving cognitive problems commonly associated with human intelligence such as learning, problem solving, visual perception and speech and pattern recognition.

Artificial Intelligence System

A technological system that uses a model to make inferences to generate output, including predictions, recommendations or decisions.

Corpus
In linguistics, corpus is referred to as a large and structured set of texts. In the context of topic modelling, a corpus is a set of documents and each document is viewed as a mixture of topics that are present in the corpus. (wikipedia.org)
Data Science
Data Science is an interdisciplinary field that uses scientific methods and algorithms to extract information and insights from diverse data types. It combines domain expertise, programming skills and knowledge of mathematics and statistics to solve analytically complex problems.
Deep Learning
Subset of machine learning that imitates the workings of the human brain in processing data and improves performance. Typically, a multi-level algorithm that gradually identifies things at higher levels of abstraction. For example, the first level may identify certain lines, then the next level identifies combinations of lines as shapes, and then the next level identifies combinations of shapes as specific objects. Deep learning is popular for image classification. (www.datascienceglossary.org)
Event
An event in the Unified Modeling Language (UML) is a notable occurrence at a particular point in time. Events can, but do not necessarily, cause state transitions from one state to another in state machines represented. (wikipedia.org)
Latent variables
Latent variables are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured). Mathematical models that aim to explain observed variables in terms of latent variables are called latent variable models. (datascienceglossary.org)
Machine Learning (ML)

"Machine learning is the science of getting computers to automatically learn from experience instead of relying on explicitly programmed rules, and generalize the acquired knowledge to new settings."

United Nations Economic Commission for Europe's Machine Learning Team (2018 report)
The use of machine learning in official statistics.

In essence, Machine Learning automates analytical model building through optimization algorithms and parameters that can be modified and fine-tuned.

Machine Learning Algorithms
Machine learning algorithms use computational methods to "learn" information directly from data without relying on a predetermined equation as a model. The algorithms adaptively improve their performance as the number of samples available for learning increases. (Mathworks.com)
Machine Learning Model
A digital representation of patterns identified in data through automated processing using an algorithm designed to enable the recognition or replication of those patterns.
Natural Language Processing (NLP)
Natural language processing (NLP) is a method to translate between computer and human languages. It is a method of getting a computer to understandably read a line of text without the computer being fed some sort of clue or calculation. In other words, NLP automates the translation process between computers and humans. (techopedia.com)
One-hot vector
In NLP, a one-hot vector is a 1 x N matrix (vector), made of 0 and 1, used to distinguish each word in a vocabulary from every other word in the vocabulary. One-hot encoding ensures that machine learning does not assume that higher numbers are more important. For example, 'laughter' is not more important than 'laugh' when both words are represented in the vector. (wikipedia.org)
Parsing
Breaking a data block into smaller chunks by following a set of rules, so that it can be more easily interpreted, managed, or transmitted by a computer. Spreadsheet programs, for example, parse a data to fit it into a cell of certain size. (businessdictionary.com). ML algorithms can also be used to parse data.
Poisson process

A Poisson Process is a model for a series of discrete events where the average time between events is known, but the exact timing of events is random. A Poisson Process meets the following criteria: (towardsdatascience.com)

  • Events are independent of each other. The occurrence of one event does not affect the probability another event will occur.
  • The average rate (events per time period) is constant.
  • Two events cannot occur at the same time
Python
A programming language available since 1994 that is popular with people doing data science. Python is noted for ease of use among beginners, and great power when used by advanced users, especially when taking advantage of specialized libraries such as those designed for machine learning and graph generation. (datascienceglossary.org)
R
An open-source programming language and environment for statistical computing and graph generation available for Linux, Windows, and Mac. (datascienceglossary.org)
Reinforcement Learning (RL)
Reinforcement Learning (RL) is a sub-field of Machine Learning involving a controller (termed an agent) capable of taking actions in the form of decisions within a system. After each decision is made by the controller, the system evolves to a new state and the controller receives a measure of utility. By trial and error, the controller learns from its experience to optimize an action selection strategy that maximizes the expected cumulative utility within the system. RL is typically used to solve problems that can be modelled as sequential decision processes.
Robotic Process Automation (RPA)
Robotic process automation (RPA) is the term used for software tools that partially or fully automate human activities that are manual, rule-based, and repetitive. They work by replicating the actions of an actual human interacting with one or more software applications to perform tasks such as data entry, process standard transactions, or respond to simple customer service queries. (aiim.org)
Semantic
Semantics can address meaning at the levels of words, phrases, sentences, or larger units of discourse. In machine learning, semantic analysis of a corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents. (wikipedia.org)
Stochastic optimization
Stochastic optimization methods are optimization methods that generate and use random variables. For stochastic problems, the random variables appear in the formulation of the optimization problem itself, which involves random objective functions or random constraints. Stochastic optimization methods also include methods with random iterates. (wikipedia.org)
Supervised Learning
A type of machine learning algorithm in which a system is taught via examples. For instance, a supervised learning algorithm can be taught to classify input into specific, known classes. The classic example is sorting email into spam versus non-spam. (datascienceglossary.org)
Unsupervised Learning
A class of machine learning algorithms designed to identify groupings of data without knowing in advance what the groups will be. (datascienceglossary.org)
Web scraping
Web scraping is a term for various methods used to collect information from across the Internet. Generally, this is done with software that simulates human Web surfing to collect specified bits of information from different websites. (techopedia)

2021 Census: 2A

Message from the Chief Statistician of Canada

Thank you for taking a few minutes to participate in the 2021 Census. The information you provide is converted into statistics used by communities, businesses and governments to plan services and make informed decisions about employment, education, health care, market development and more.

Your answers are collected under the authority of the Statistics Act and kept strictly confidential. By law, every household must complete a 2021 Census of Population questionnaire.

Statistics Canada makes use of existing sources of information such as immigration, income tax and benefits data to ensure the least amount of burden is placed on households.

The information that you provide may be used by Statistics Canada for other statistical and research purposes or may be combined with other survey or administrative data sources.

Make sure you count yourself into Canada's statistical portrait, and complete your census questionnaire today.

Thank you,

Anil Arora
Chief Statistician of Canada

Complete your census questionnaire:

  • Online: at www.census.gc.ca by using the secure access code printed above.
  • or
    On paper: please print using CAPITAL LETTERS.

Any questions?

  • www.census.gc.ca
  • Call us free of charge at 1-855-340-2021
  • TTY: 1-833-830-3109

Ce questionnaire est disponible en français (1-855-340-2021)

Confidential when completed

This information is collected under the authority of the Statistics Act, R.S.C. 1985, c. S-19.

Step A

1. What is your telephone number?

2. What is the address of this dwelling?

  • Number (and suffix, if applicable)
    (e.g., 302, 151 B, 16 1/2)
  • Street name, street type (e.g., DR = Drive), direction (e.g., N = North)
  • Apartment/unit
  • City, municipality, town, village, Indian reserve
  • Province/territory
  • Postal code

3. What is the mailing address of this dwelling, if different from above?
(e.g., Rural Route, PO Box, General Delivery)

Step B

1. Including yourself, how many persons usually live at this address on May 11, 2021?

Include: all persons who have their main residence at this address, even if they are temporarily away.

See the instructions on page 3 (joint custody, students, landed immigrants, secondary residence, etc.).

  • Number of persons

2. Including yourself, list all persons who usually live here on May 11, 2021.

Important: Begin the list with an adult followed, if applicable, by that person's spouse or common-law partner and by their children. Continue with all other persons who usually live at this address.

  • Person 1: Family name(s), Given name(s)
  • Person 2: Family name(s), Given name(s)
  • Person 3: Family name(s), Given name(s)
  • Person 4: Family name(s), Given name(s)
  • Person 5: Family name(s), Given name(s)
  • Person 6: Family name(s), Given name(s)
  • Person 7: Family name(s), Given name(s)
  • Person 8: Family name(s), Given name(s)
  • Person 9: Family name(s), Given name(s)
  • Person 10: Family name(s), Given name(s)

Step C

Did you leave anyone out of Step B because you were not sure the person should be listed?

For example, a student, a child in joint custody, a person temporarily away, a person who lives here temporarily, a resident from another country with a work or study permit, a refugee claimant, etc.

  • No
  • Yes
    • Specify the name and the relationship:
    • Specify the reason:

Step D

Copy the names in Step B to question 1, at the top of page 4.

Keep the same order.

If more than six persons live here, you will need an extra questionnaire; call 1-855-340-2021.

  1. Whom to include in Step B
    • All persons who have their main residence at this address on May 11, 2021, including newborn babies, roommates and persons who are temporarily away
    • Canadian citizens, landed immigrants (permanent residents), persons who have claimed refugee status (asylum seekers), persons from another country with a work or study permit and family members living here with them
    • Persons staying at this address temporarily on May 11, 2021 who have no main residence elsewhere.
  2. Where to include persons with more than one residence
    • Children in joint custody should be included in the home of the parent where they live most of the time. Children who spend equal time with each parent should be included in the home of the parent with whom they are staying on May 11, 2021.
    • Students who return to live with their parents during the year should be included at their parents' address, even if they live elsewhere while attending school or working at a summer job.
    • Spouses or common-law partners temporarily away who stay elsewhere while working or studying should be listed at the main residence of their family, if they return periodically.
    • Persons in an institution for less than six months (for example, in a home for the aged, a hospital or a prison) should be listed at their usual residence.

If this address is:

  • a secondary residence (for example, a cottage) for all persons who stayed here on May 11, 2021 (all these persons have their main residence elsewhere in Canada), mark this circle. Print your name, your telephone number and your main residence address at the bottom of this page. Do not answer other questions.
  • a dwelling occupied only by residents of another country visiting Canada (for example, on vacation or on a business trip), mark this circle. Print your name, your telephone number and your country of residence at the bottom of this page. Do not answer other questions.
  • the home of a government representative of another country (for example, an embassy or a high commission) and family members, mark this circle. Print your name, your telephone number and the country that you represent at the bottom of this page. Do not answer other questions.
  • Name
  • Telephone number
  • Number (and suffix, if applicable)
    (e.g., 302, 151 B, 16 1/2)
  • Street name, street type (e.g., DR = Drive), direction (e.g., N = North)
  • Apartment/unit
  • City, municipality, town, village, Indian reserve
  • Province/territory
  • Postal code
  • Country

Mail this questionnaire in the enclosed envelope today.

1. Name

In the spaces provided, copy the names in the same order as in Step B. Then answer the following questions for each person.

Person 1

  • Family name
  • Given name

The following questions refer to each person's situation on May 11, 2021, unless otherwise specified.

2. What was this person's sex at birth?

Sex refers to sex assigned at birth.

  • Male
  • Female

3. What is this person's gender?

Refers to current gender which may be different from sex assigned at birth and may be different from what is indicated on legal documents.

  • Male
  • Female
  • Or please specify this person's gender:

4. What are this person's date of birth and age?

If exact date of birth is not known, enter best estimate. For children less than 1 year old, enter 0 for age.

  • Day
  • Month
  • Year
  • Age

5. What is this person's marital status?

Mark "x" one circle only.

  • Never legally married
  • Legally married (and not separated)
  • Separated, but still legally married
  • Divorced
  • Widowed

6. Is this person living with a common-law partner?

Common-law refers to two people who live together as a couple but who are not married, regardless of the duration of the relationship.

  • Yes
  • No

7. What is the relationship of this person to Person 1?

If none of the responses in the list describes this person's relationship to Person 1, then specify a response under "Other relationship".

Person 1

  • Person 1

Person 2

  • Husband or wife of Person 1
  • Common-law partner of Person 1
  • Son or daughter of Person 1 only
  • Grandchild of Person 1
  • Son-in-law or daughter-in-law of Person 1
  • Father or mother of Person 1
  • Father-in-law or mother-in-law of Person 1
  • Brother or sister of Person 1
  • Foster child
  • Roommate, lodger or boarder
  • Other relationship — specify:

Persons 3-6

  • Son or daughter of both Persons 1 and 2
  • Son or daughter of Person 1 only
  • Son or daughter of Person 2 only
  • Grandchild of Person 1
  • Son-in-law or daughter-in-law of Person 1
  • Father or mother of Person 1
  • Father-in-law or mother-in-law of Person 1
  • Brother or sister of Person 1
  • Foster child
  • Roommate, lodger or boarder
  • Other relationship — specify:

8. Can this person speak English or French well enough to conduct a conversation?

Mark "x" one circle only.

  • English only
  • French only
  • Both English and French
  • Neither English nor French

9. a) What language(s) does this person speak on a regular basis at home?

  • English
  • French
  • Other language(s) — specify:

If this person indicates only one language in question 9. a), go to question 10.

9. b) Of these languages, which one does this person speak most often at home?

Indicate more than one language only if they are spoken equally at home.

  • English
  • French
  • Other language — specify:

10. What is the language that this person first learned at home in childhood and still understands?

If this person no longer understands the first language learned, indicate the second language learned.

  • English
  • French
  • Other language — specify:

11. Has this person ever served in the Canadian military?

Canadian military service includes service with the Regular Force or Primary Reserve Force as an Officer or Non-Commissioned Member. It does not include service with the Cadets (COATS), the Supplementary Reserve or the Canadian Rangers.

Mark "x" one circle only.

  • Yes, currently serving in the Regular Force or the Primary Reserve Force
  • Yes, but no longer serving in the Regular Force or the Primary Reserve Force
  • No

The following questions collect information in accordance with the Canadian Charter of Rights and Freedoms to support education programs in English and French in Canada.

12. Is this dwelling located in Quebec?

  • No
    • Continue with question 13.
  • Yes
    • Go to question 16.

13. Did this person do any of their primary or secondary schooling in French in Canada (including immersion)?

Mark "x" one circle only.

  • Yes (previously or currently attending)
  • No
    • Go to Step E

14. In which type of program was this schooling in French done?

  • A regular French program in a French-language school
  • A French immersion program in an English-language school
    • Go to Step E
  • Both types of programs
  • Other program — specify:

15. For how many years did this person attend a regular French program in a French-language school?

  • Number of years in primary schooling (including kindergarten and middle school)
    • Number of years
      • Go to Step E
  • Number of years in secondary schooling
    • Number of years
      • Go to Step E

16. Did this person do any of their primary or secondary schooling in an English-language school in Canada (including immersion)?

Mark "x" one circle only.

  • Yes (previously or currently attending)
  • No
    • Go to Step E

17. For how many years did this person do their schooling in an English-language school in Canada (including immersion)?

  • Number of years in primary schooling (including kindergarten)
    • Number of years
  • Number of years in secondary schooling
    • Number of years

Step E

Comments

Please use the space provided below if you have concerns, suggestions or comments to make about:

  • the steps to follow or the content of this questionnaire (for example, a question that was difficult to understand or to answer)
  • the characteristics of the questionnaire (for example, the design, the format, the size of the text).

Step F

If more than six persons live here, you will need an extra questionnaire; call 1-855-340-2021.

You have now completed your questionnaire. Please mail it today. If you have lost the return envelope, please mail the questionnaire to:

Statistics Canada
PO BOX 99996 STN FED-GOVT
Ottawa, ON K1A 9Z6

Thank you for your cooperation.

Reasons why we ask the questions

Steps A to C and question 1 are used to collect contact information and determine who should be included on the questionnaire. They help us ensure that we have counted everyone we need to count and that no one is counted twice.

Questions 2 to 7 provide information about the living arrangements of people in Canada, the family size, the number of children living with one parent or two parents, and the number of people who live alone. This information is used for planning social programs, such as Old Age Security and the Canada Child Benefit. It is also used by municipalities to plan a variety of services such as day care centres, schools, police, fire protection and residences for seniors.

Questions 8 to 10 are used to provide a profile of the linguistic diversity of Canada's population. This information is used to estimate the need for services in English and French, and to better understand the current state and the evolution of Canada's various language groups.

Question 11 provides information on the number of people with Canadian military experience. Governments will use this information to develop programs and services to meet the changing needs of the Veteran population.

Questions 12 to 17 collect information in accordance with the Canadian Charter of Rights and Freedoms to support education programs in English and French in Canada.

    The law protects what you tell us

    The confidentiality of your responses is protected by law. All Statistics Canada employees have taken an oath of secrecy. Your personal information cannot be given to anyone outside Statistics Canada without your consent. This is your right.