Data science projects

Data science plays an important role at Statistics Canada. All across the agency, new data science methods are being used to make our projects more efficient and provide better data insights to Canadians.

Contact the Data Science Centre for more information on Statistics Canada’s data science projects.

Event detection and sentiment indicators

Statistics Canada is developing a tool to detect specific economic events by analyzing millions of news articles. The tool uses machine learning algorithms, trained in accordance with analysts, to research and summarize information from the articles and organize the data into an informative dashboard. This tool saves lots of research time that can now be spent on understanding the reasons behind these economic changes.

The agency is also producing new sentiment indicators to measure economic tendencies and their connection with key economic variables. Based on positive and negative interpretations of economic-related news articles, this tool will allow subject matter experts to gain better insights into economic variables by industry, and support the publication of near real-time economic confidence indicators.

Crop classification

Monitoring the production of farms in Canada is an important but costly undertaking. Surveys and in-person inspections require a large amount of resources, and the current approach to predicting crop yields is time-consuming. For these reasons, Statistics Canada is modernizing crop classification using an image classification approach.

Crop types are predicted using satellite imagery and the application of neural networks. Initial results show that this approach is much faster, and will reduce the survey response burden for farm-owners, especially during the busy times of the year.

PDF extraction

Extracting information from PDFs and other documents can be a time-consuming process. Statistics Canada has been applying data science solutions to this challenge to make information available in a timelier manner.

For example, Statistics Canada acquired the historical dataset for SEDAR, a system used by publicly traded Canadian companies to file securities documents to various Canadian securities commissions. The SEDAR database is used by Statistics Canada employees for research, data confrontation, validation, frame maintenance process, etc. The extraction of public securities documents such as financial statements, annual reports, and annual information forms is currently done manually, which is very time-consuming.

To increase efficiency, Statistics Canada's data scientists developed a state-of-the-art machine learning algorithm that correctly identifies and extracts key financial variables (e.g. total assets) from the correct table (e.g. Balance Sheet) in an annual financial statement of a company (PDF document). They also transformed a large amount of unstructured public documents from SEDAR into structured datasets, allowing the automation and extraction of information related to Canadian companies.

This algorithm automates the financial variable extraction process for up to 70,000 PDFs per year in near real time, significantly reducing the manual hours spent identifying and capturing the required information. This project also helps reduce data redundancy within the organization by providing a one point solution to access information. StatCan also developed an interactive web application that allows analysts across the organization to visualize and automatically extract variables for multiple purposes.

Retail scanner data

Statistics Canada publishes the total amount of products sold, as classified by the North American Product Classification System (NAPCS). Large scanner data bases are currently available from major retailers, with millions of records. Previously, products were assigned to NAPCS with dictionary-based coding in combination with manual coding when required, according to their description and other indicators. Statistics Canada uses machine learning to classify all of the product descriptions in the scanner data to the NAPCS and then obtains aggregate sales for each area. This approach has resulted in higher degree of automation, as well as in accurate, detailed retail data and a reduced response burden for major retailers.

Survey comment classification

Data scientists at Statistics Canada created a machine learning model to automatically classify the electronic comments from respondents of the Survey of Sexual Misconduct in the Canadian Armed Forces (SSMCAF). The SSMCAF required automated classification of comments into five categories: personal story, negative, positive, advice for content, and other. The machine learning model coded 6,000 comments from the first 2018 survey cycle with 89% accuracy for French and English comments. This approach will be expanded to other surveys at Statistics Canada.

Date modified: