As Canadians grew increasingly concerned about the impact of COVID-19 on our society and our economy in March 2020, Statistics Canada set to work collecting vital information to support citizens and critical government operations during unprecedented times.
At the same time, analysts, researchers and data scientists across the Government of Canada were faced with another pressing concern…how could they provide much-needed information quickly and securely, while working remotely with limited access to their usual tools and computing infrastructure?
As the need for analytical capabilities became increasingly urgent, a team of experts at Statistics Canada came together to fast-track the Data Analytics as a Service (DAaaS) project and explore open source data solutions. The aim was to equip data scientists with the work environment they need to conduct deeper analysis and to provide insights on the impact of COVID-19 in Canada.
The result is the COVID-19 cloud platform for advanced analytics: a virtual data science workspace that integrates data from reliable StatCan sources, extracts information and presents it in a central platform that includes robust presentation and dissemination options.
Not only does this solution meet the needs of data scientists, but it also drives forward modernization at the national statistical agency by helping meet the strategic objectives of the Statistics Canada Data Strategy—including an enhanced focus on data science—at an expedited pace.
A multi-disciplinary tiger team creating a "dream" data science environment
The analytical platform is the result of a collaboration between Statistics Canada's Data Science Division, the IT DAaaS team, the Cloud Team and partners at Microsoft.
Each group had an important role to play. The cloud team laid the foundation for the work, providing a robust containerized foundation using Kubernetes and the underlying Azure infrastructure as a service (IaaS) base. The DAaaS team worked on integrating service components, including the portal, using the underlying services. The Data Science team worked with the other teams to identify open source software to be installed and worked to define pipelines and data flows. By having data science experts working with cloud and platform experts, the team was able to deliver a scalable, accessible platform that met data science needs. The result is an environment with a variety of advanced tools for satellite image processing, natural language processing and automation.
By breaking down barriers internally and externally, the team was able to create a cohesive workbench in a matter of weeks—all while working securely from home. This was done with a user-centric approach to modernize the experience for data users and better meet their rapidly-evolving needs, while providing end-to-end data science support.
"The platform has had a major, positive impact on the way we work. We are able to get better results, work in an agile way and see the benefits of modernization in action," explains Sarah MacKinnon, Assistant Director of the Information Technology Project Delivery team at Statistics Canada.
Inside the workbench you will find a state-of-the-art platform, a "dream data science environment," says Sevgui Erman, Director of the Data Science Division at Statistics Canada. "This environment addresses the high-capacity computing needs of data scientists and meets our needs for collaborative workspaces and tools. The workbench is equipped with tools for continuous integration and continuous development that allow for scalable and reproducible data pipelines, as well as advanced data and model management capabilities."
"You can also build out your workflows using GitHub Actions and Kubeflow Pipelines. With templates for training, validation, preprocessing, and RESTful model serving, and with integrations with Platform as a service offerings like Databricks or managed Data Lakes, the advanced analytics workspace gives you the freedom to harness whatever tools you want, and it gives you a unified layer to use them from," adds Blair Drummond, an analyst with StatCan's Data Science Division and a member of the tiger team.
Peek inside the workbench
The team gathered the best available open source tools to create a workbench that allows users to remotely access data loaded by Statistics Canada—with a focus on COVID-19. This powerful environment employs a full suite of data science and analytics tools, including
- Jupyter Notebooks for R and Python
- Linux remote desktop
- Power BI
- R Shiny
- Pachyderm (data lineage and pipelines)
- Kubeflow Pipelines
- MLflow for model tracking, custom web applications
- self-serve sharable storage
- and more.
The platform also includes support channels for user feedback and guidance.
The result is that data users are better equipped to analyze the impacts of COVID-19 and share their findings in a secure, confidential manner.
Why open source software? As Blair explains, "Open source software tools give users more flexibility and autonomy over their own work. These tools are accessible and crowd-sourced, meaning that users can also get support and help with analysis." Furthermore, results are reproducible by colleagues in other departments. An approach that incorporates open source software supports collaboration between data scientists that benefits all users.
The platform in action
By leveraging the resource functionality of the platform, data scientists at StatCan have been hard at work as they put the platform to use.
One example is the work done by Kenneth Chu, a Senior Methodologist with StatCan's Data Science Division, who was one of the early adopters of the new platform and tested it's capabilities by performing a massively parallelized statistical analysis that otherwise would not have been feasible with pre-existing computing infrastructure.
Kenneth fitted a hierarchical Bayesian model (to provincial COVID-19 death count time series) that estimated the effects of social distancing measures on COVID-19 transmissibility. There were, however, certain crucial but unknown input parameters, namely, the provincial COVID-19 infection fatality rates (IFR, defined as the conditional probability of dying of COVID-19 given that one is infected with it). Their theoretically straightforward estimates are simply the provincial ratios of the number of COVID-19 deaths to the true number of COVID-19 infections. Unfortunately, the near-complete lack of knowledge of the latter, especially during the early phase of the pandemic, rendered the IFRs highly uncertain.
The parallelized sensitivity analysis involved simply executing the Bayesian analysis independently a reasonably large number of times (200, to be precise), each time sampling the provincial IFRs randomly from the full range of plausible values. Each independent execution required approximately eight hours to complete, using two computing cores. The full sensitivity analysis, executed on DAaaS, thus required 3,200 CPU core hours in total, which would have been impossible with pre-existing infrastructure.
The capacity to execute distributed/massively parallel workflows contributes to StatCan's Big Data infrastructure. In addition, such computing capacity also enables the use of many distribution-free statistical methods (e.g. resampling-, permutation-based ones), which are highly computationally intensive but complement modern complex analytical techniques from machine learning or Bayesian statistics.
Overall, the increased computing capabilities support the agency's mission to provide timely, critical information to Canadians during the unprecedented challenges of the COVID-19 pandemic.
A secure, phased approach
Currently, the COVID-19 analytical platform is accessible to Statistics Canada employees, and to other Government of Canada departments who have research data partnerships with the agency. If you are a data scientist interested in this platform, please reach out to get involved and experience the platform by emailing firstname.lastname@example.org.
This is part of Statistics Canada's phased approach to grant access to the platform in a secure manner. For the first phase, access to the platform was limited to internal StatCan employees working with publicly available data only. The second phase featured access to unclassified data (publicly available data only) and access to the platform was made available to select Government of Canada employees by invitation. The third phase will feature protected data and use a mix of public and other data sets. Access to this platform will be promoted externally on the StatCan website. Each phase will include the necessary safeguards to ensure a secure environment is maintained at all times, including regular security assessments.
As this project continues to progress, Statistics Canada looks forward to engaging with the data science community and continuing to provide vital information to all Canadians.