Supplement to Statistics Canada's Generic Privacy Impact Assessment related to web-scraping and other web-based collection activities for company-specific COVID-19-related information

October 2020

Program manager: Director, Mining, Manufacturing and Wholesale Trade Division

Reference to Personal Information Bank (PIB)

Not applicable as there are no direct personal identifiers being collected and retained.

Description of statistical activity

Statistics Canada will be automating web-scraping and other web-based collection activities in order to more expediently and efficiently gather web-based, public information required to analyse the impact of the COVID-19 pandemic on Canadian economic activity.

This initiative will automate three methods for collecting web-based, public information that are currently performed manually:

  • Scraping of Canadian companies' websites and of provincial and territorial government websites that provide information on COVID-19 -specific essential services
  • Collecting information posted by these companies on their social media accounts (LinkedIn and Twitter)
  • Collecting company-specific information from news aggregator services (Government of Canada NewsDesk and Google News).

The information to be retrieved includes company name, date of access or date of publication, information source and “snippets” (paragraphs of text) that contain COVID-19 -related keywords of interest. This would provide information on pandemic-related closures, changes in products or production processes, lay-offs, etc.

Web-Scraping

Web-scraping is carried out by using automated programs, or "bots", to access specific parts of company websites containing news on current activities.

Statistics Canada will scrape on a daily basis the websites of the Canadian manufacturers with the largest sales for mentions of COVID-19 -related events such as closures, changes in products or production processes, lay-offs, etc. Provincial and territorial government websites that post information about essential services will also be scraped on a daily basis.

All scraping will be done in compliance with the site owners' terms and conditions.

Social Media

Relevant announcements by manufacturing companies on their LinkedIn and Twitter accounts will also be retrieved, either directly using an interface with the data (Application Programming Interface - API), or indirectly through NewsDesk (which provides this service in addition to news aggregation).

News Service

NewsDesk and Google News will also be accessed, using company names together with keywords as search terms.

While the company and provincial / territorial websites will be scraped on a daily basis, information from the social media accounts and news aggregators will be retrieved monthly.

The information collected from all three sources (web-scraping, social media and news services) will be processed, combined and stored in a database for access by Statistics Canada employees only, to assist with analysis of the economic impacts of COVID-19.

These activities are not meant to collect, create or use personal information. Should any personal information or personal identifiers – such as account name, handle, or any other piece of personal information relating to an individual – be inadvertently collected, this personal information will be stripped from the data and deleted.

Reason for supplement

The Generic Privacy Impact Assessment (PIA) addresses most of the privacy and security risks related to statistical activities conducted by Statistics Canada.

The purpose of this supplement is to address any privacy risks associated with the inadvertent collection of personal information, such as social media account names or handles relating to an individual, during the web-scraping and other web-based collection activities. If applicable, any personal information inadvertently collected will be stripped from the data and deleted.

Necessity and Proportionality

The automated web-scraping and web-based collection activities for the study of the impact of COVID-19 on Canadian economic activity are not meant to collect, create or use personal information. Any personal information inadvertently collected during these activities will be stripped from the data and deleted.

Furthermore, this project has been assessed against Statistics Canada's Necessity and Proportionality Framework:

  1. Necessity: This information is needed to measure the impact of COVID-19 on the manufacturing sector and to generate flash estimates of monthly GDP, a new statistical product put out by Statistics Canada.

    This activity will augment coverage and allow for high quality information on the impact of COVID-19 on Canadian economic activity for the benefit of Statistics Canada stakeholders, including the public, and will inform government policy and decision-making.

    This information will help provide more accurate data that will enable Canadians to have a much better understanding of how the COVID-19 pandemic is affecting various industries across Canada. For example the impacts of lock-down measures and plant closures on employment; how certain manufacturers modified their production line to produce personal protection equipment, respirators or hand sanitizer; others where employment is on the rise, etc.

    This type of web-based information is used by economic programs to validate, augment and analyze the information collected by other instruments: surveys or administrative data. Analysts use this information to ensure the quality of statistical products and to gain an understanding of the economic phenomena being measured.

  2. Effectiveness (Working assumptions): In the current pandemic context where economic activity is impacted and quickly evolving, the automation of this web-scraping activity provides the means of obtaining information on emerging or current issues regarding economic activity in a systematic, efficient and timely manner.

    When deployed in an interactive environment where information can be collected and presented on a daily basis, these tools will support the agency in meeting one of its stated objectives: the near real-time release of statistical information.

    Automating the data collection process is expected to result in measurable time and resource savings. In addition, automation makes it easier to share information across programs, which will ensure coherence of analysis across the agency.

    As a proof of concept, this initiative provides a test case of IT environments, machine learning, programming applications, and processes for the acquisition of information, all of which will allow the agency to modernize its processes for information collection, processing, reporting, and visualization.

  3. Proportionality: Measuring the impact of COVID-19 on Canadian economic activity does not require any personal information or personal identifiers. Only the necessary information about Canadian manufacturers will be collected. The data will be used only to enhance the agency's analysis and to replace what is currently collected manually. There is no intent to release this information to other departments or agencies, or to the public.

    The personal information that might be collected inadvertently is already in the public domain. Furthermore, since the privacy settings of the social media platforms being used (Twitter and LinkedIn) are well understood by users, especially when compared to the privacy settings of Facebook, the information being disclosed by users is being done so with their knowledge.

  4. Alternatives: The aim of this project is to automate processes and present the information in a usable format.

    The alternative is to collect social media information (the only source under consideration that may contain personal identifiers) on a manual and intermittent basis, which is the current process. In comparison to current methods, this project has the potential to generate considerable time savings and automatically track company-related developments in real time. As well, conducting a survey has also been considered, but it would not achieve the main goal which is to produce real-time information.

    Finally, in terms of privacy, this project is not accessing any information that isn't currently available to analysts using manual processes. Just as analysts don't currently retain personal identifiers contained in social media, this project will remove any such identifiers before further processing.

Mitigation factors

Any personal information that is inadvertently collected will be identified, removed and destroyed immediately. An application will be set up to automatically identify and remove user account IDs and similar identifiers that are not associated with the companies whose information is being sought.

Using Twitter as an example, tweets and the re-tweets that they include are presented as separate database records. These individual records contain fields with personal identifies such as the user ID and handle. As the data are being captured, the contents of these fields can be deleted for all users other than the companies whose information is being sought.

Conclusion

This assessment did not identify any privacy risks that cannot be managed using existing safeguards.

Formal approval

This Supplementary Privacy Impact Assessment has been reviewed and recommended for approval by Statistics Canada's Chief Privacy Officer, Director General for Modern Statistical Methods and Data Science, and Assistant Chief Statistician for Social, Health and Labour Statistics.

The Chief Statistician of Canada has the authority for section 10 of the Privacy Act for Statistics Canada, and is responsible for the Agency's operations, including the program area mentioned in this Supplementary Privacy Impact Assessment.

This Privacy Impact Assessment has been approved by the Chief Statistician of Canada.

Date modified: