In this information era, it is more important than ever to provide Canadians with reliable and timely data in order to enable informed decision-making.
Statistics Canada is using web scraping to gather data efficiently. Web scraping is a process by which information is collected and copied from the Internet for analysis.
The use of web scraping is part of a broader effort to reduce burden on businesses and organizations while continuing to provide high-quality, timely data in a cost-effective manner.
In the spirit of openness and transparency, Statistics Canada is committed to respecting the following best practices when conducting web scraping.
Statistics Canada will:
- Transparency
- carry out web scraping activities in a transparent, consistent and ethical manner;
- notify the relevant companies that web scraping activities will be taking place;
- publish the results of the web scraping activities on its website;
- conduct all web scraping activities on Statistics Canada authorized computer equipment connected to its highly secure networks and secure the data on encrypted servers.
- Ethics
- use web scraped data appropriately and responsibly in statistical programs in order to facilitate fulfilment of its mandate;
- collect only data available to the public from businesses and organizations for use in its statistical and research programs;
- take steps to minimize burden on the websites, such as scraping during off-peak hours and only as needed, and coordinating data requirements across statistical programs to avoid duplicating efforts;
- use an application programming interface (API) when possible in lieu of web scraping;
- limit collection to only what is necessary and proportional for the production of the required statistical outputs.
Statistics Canada will not:
- scrape personal information about individuals from any website;
- scrape personal information that could establish a profile of individuals;
- resell web scraped data or use them for commercial purposes;
- scrape any information that will not be used to produce statistical outputs.