The proceedings of the Symposium are available. Please visit the Statistics Canada International Symposium Series: Proceedings catalogue page to access the papers for the presentations.
All times listed in the schedule refer to Eastern Daylight Time (EDT): UTC-4
Friday October 15, 2021
09:15 – 09:30
Opening Remarks
- Anil Arora, Chief Statistician of Canada, Statistics Canada, Canada
09:30 – 10:30
Session 1 -- Keynote Address
Chairperson: Sevgui Erman
- Recent progress and upcoming challenges for research in machine learning
Yoshua Bengio, Mila - Québec Artificial Intelligence Institute, Canada-
Abstract
Statistical learning methodologies such as deep learning, inspired by human cognition and neurosciences, have made tremendous progress in the last decade and are being deployed on a large scale throughout society. What may be the reasons for this success, sometimes even seeming to contradict common wisdom in statistical learning theory? As more and more products incorporate these algorithms, many questions are being raised about potential biases and other ethical concerns about their social impact. Looking forward, the danger of nefarious uses grows and calls upon internationally coordinated regulations. On the other hand, the problem-solving, robust generalization abilities of state-of-the-art machine learning systems is far from those of humans. What are promising areas of investigation aiming at bridging that gap towards human-level intelligence? We will discuss fundamental questions arising in this context regarding out-of-distribution generalization and the hope that a causal perspective and cognitive inspiration can help us bridge that gap.
-
10:30 -- 10:45
Morning Break
10:45 -- 12:00
Session 2A -- Inference from non-probability samples
Chairperson: Jean-François Beaumont
- Survey data integration for regression analysis using model calibration
Jae-Kwang Kim, Iowa State University, USA-
Abstract
Data integration is an emerging area of research in survey sampling. By incorporating the partial information from external sources, we can improve the efficiency of the resulting estimator and obtain more reliable parameter analysis.
In this paper, we consider regression analysis in the context of data integration. To combine partial information from external sources, we employ the idea of model calibration which introduces a “working” reduced model based on the observed covariates. The working reduced model is not necessarily specified correctly, but can be a useful device to incorporate the partial information. The actual implementation is based on a novel application of the empirical likelihood method. The proposed method is particularly attractive for combining information from several sources with different missing patterns.
-
- Robust Bayesian inference for count data with varying exposures in non-probability samples using Gaussian processes of propensity prediction
Ali Rafei, University of Michigan, USA-
Abstract
The ubiquitous availability of large-scale unstructured data has led to a growing interest in the use of such data for producing official statistics. However, the non-probabilistic nature of the sampling mechanism raises serious concern over the potential selection bias when making finite population inference. In the presence of a relevant probability sample, one can use augmented inverse propensity weighting to make doubly robust inference by combining the idea of propensity modeling with that of prediction modeling. However, this method may no longer be applicable when the rate of a rare event is of interest per unit exposure and the average exposure is substantially different across the two datasets. In addition, as a general drawback of design-based approaches, this adjustment method may perform poorly when there is evidence of influential pseudo-weights. We propose an alternative model-based approach using a partially linear Gaussian process regression, which remains doubly robust under these circumstances. It utilizes a negative binomial prediction model with a flexible function of the estimated pseudo-selection probabilities as a predictor to impute the rare outcome for non-sampled units of the population, and the varying exposure is treated as offset in the model. We show that the Gaussian process regression behaves as a non-parametric matching technique based on the estimated propensity scores. As a second advantage, our proposed method can be implemented under a Bayesian framework. Using the 2017 National Household Travel Survey as a benchmark, we apply our method to the naturalistic driving data from the second phase of the Strategic Highway Research Program (SHRP2) to estimate the traffic accident rates per driven mile and per calendar year in the United States.
-
- Advances in the use of auxiliary information for estimation from nonprobability samples
Ramon Ferri Garcia, Universidad de Granada, Spain-
Abstract
Recent developments in questionnaire administration modes and data extraction have favored the use of nonprobability samples, which are often affected by selection bias that arises from the lack of a sample design or self-selection of the participants. This bias can be addressed by several adjustments, whose applicability depends on the type of auxiliary information available. Calibration weighting can be used when only population totals of auxiliary variables are available. If a reference survey that followed a probability sampling design is available, several methods can be applied, such as Propensity Score Adjustment, Statistical Matching or Mass Imputation, and doubly robust estimators. In the case where a complete census of the target population is available for some auxiliary covariates, estimators based in superpopulation models (often used in probability sampling) can be adapted to the nonprobability sampling case. We studied the combination of some of these methods in order to produce less biased and more efficient estimates, as well as the use of modern prediction techniques (such as Machine Learning classification and regression algorithms) in the modelling steps of the adjustments described. We also studied the use of variable selection techniques prior to the modelling step in Propensity Score Adjustment. Results show that adjustments based on the combination of several methods might improve the efficiency of the estimates, and the use of Machine Learning and variable selection techniques can contribute to reduce the bias and the variance of the estimators to a greater extent in several situations.
-
10:45 -- 12:00
Session 2B -- Visualization and Mapping of Image Data
Chairperson: Hélène Bérard
- Modernizing Construction Indicators Through Machine Learning and Satellite Imagery
Aidan Smith, U.S. Census Bureau and Hector Ferronato, Reveal Global Consulting, USA-
Abstract
Official statistical agencies must continually seek new methods and techniques that can increase both program efficiency and product relevance. The U.S. Census Bureau’s measurement of construction activity is currently a resource-intensive endeavor, relying heavily on monthly survey response via questionnaires and extensive field data collection. While our data users continually require more timely and granular data products, the traditional survey approach and associated collection cost and respondent burden limits our ability to meet that need. The availability of satellite imagery and advancements in data science techniques present a unique opportunity to overcome these limitations.
Since 1959, the Census Bureau has conducted the Survey of Construction to produce monthly estimates of housing starts and completions as part of the New Residential Construction principal federal economic indicator. In 2019, we began research on whether the application of machine learning techniques to satellite imagery could accurately estimate housing starts and completions while meeting our existing monthly indicator timelines at a cost equal to or less than existing methods. Using historical Census construction survey data in combination with targeted satellite imagery, the team trained, tested, and validated two convolutional neural networks capable of classifying images by their stage of construction. Used in conjunction with construction-boundary and change-detection models, the project is demonstrating the viability of a data science-based approach to producing official measures of construction activity.
-
- Statistics Canada's Seasonal Adjustment Dashboard
François Verret, Statistics Canada, Canada-
Abstract
Seasonal adjustment at Statistics Canada is performed using the X-12-ARIMA method. For most statistical programs performing seasonal adjustment, subject matter experts (SME) are responsible for managing the program and for validation, analysis and dissemination of the data; while methodologists from the Time Series Research and Analysis Center (TSRAC) are responsible for developing and maintaining the seasonal adjustment process and for providing support on seasonal adjustment to SME. A visual summary report called the seasonal adjustment dashboard has been developed in R Shiny by the TSRAC to build capacity to interpret seasonally adjusted data and to reduce the resources needed to support seasonal adjustment. It is currently being made available internally to assist SME to interpret and explain seasonally adjusted results. The summary report includes graphs of the series across time, as well as summaries of individual seasonal and calendar effects and patterns. Additionally, key seasonal adjustment diagnostics are presented and the net effect of seasonal adjustment is decomposed into its various components. During this presentation, the visual representation of the seasonal adjustment process will be shown and a demonstration of the report and its interactive functionality will be provided.
-
- Diagnosis of Connectivity in Brazilian Education, an approach to support the formulation of public policies for connectivity in education.
Paulo Kuester Neto, Brazilian Network Information Center, Brazil-
Abstract
Bearing in mind that in recent years we have seen a profound digital transformation in various sectors of society, and an increase in the datafication, we must pay attention to the importance of proper data curation. On the other hand, there is an opportunity to use such data in conjunction with official statistics in order to provide a holistic view to decision makers and public policy makers.
The promotion of actions to open data and indicators, whether by government agencies, national statistical institutes or even by the organized third sector, presents challenges, but at the same time a gigantic opportunity: to obtain a less multifaceted portrait of the reality or object observed. This work aims to contribute to this vision, by making public available a WebApp in Shiny (R) that allows, through official statistical bases, Brazilian school census (INEP), geographic objects (IBGE), data from national regulatory agency (ANATEL) and Internet quality metrics (NIC.br), a less fragmented diagnosis of the condition of connectivity in 144,000 Brazilian public schools. For this purpose, a measurement agent was developed in partnership with the Brazilian Ministry of Education, which collects the metrics of connectivity in these public schools, and through summaries and statistical compositions allows a view at the different states (27) and Brazilian municipalities (5572). Having as main goal to support state and municipal education secretaries, school directors and formulators of connectivity and education policies that can make cuts in order to quantitatively and geographically check the condition of their education network.
-
- Multifactor Productivity Interactive Tool
Ken Peng, Ryan Macdonald and Claudiu Motoc, Statistics Canada, Canada-
Abstract
The Multifactor Productivity application is an analytical tool that provides custom aggregation and custom tabulation of productivity statistics based on a series in CODR table 36-10-0211. The application permits custom aggregation across industries for all published variables, data transformations such as growth rate calculations, log-transformations and index re-basing, correlation and density analysis and visualization of retrieved outputs. The output values from calculations are available as .csv files. Visualizations can be downloaded as .png files. The data in the application has annual values beginning in 1961 and ending with the most recent data available. The application was developed in R on a Windows platform and uses a number of packages.
-
12:00 – 12:30
Afternoon Break
12:30 – 13:45
Session 3A -- Quality Considerations when Using Machine Learning in the Production of Statistics
Chairperson: Wesley Yung
- With machine learning comes great power, let's be responsible!
Keven Bosa, Statistics Canada, Canada-
Abstract
A framework for the responsible use of machine learning processes has been developed at Statistics Canada. The framework includes guidelines for the responsible use of machine learning and an associated checklist, which are organized into four themes: respect for people, respect for data, sound methods, and sound application. These four themes together ensure the ethical use of algorithms and machine learning results.
The framework is anchored in a vision that seeks to create a modern workplace and provide direction and support to those who use machine learning techniques. It applies to all statistical programs and projects conducted by Statistics Canada that use machine learning algorithms. This includes supervised and unsupervised learning algorithms.
During the presentation, the framework and supporting guidelines will be presented first. The machine learning project review process, which is how the framework is applied to Statistics Canada projects, will be explained. Finally, future work for improving the framework will be described.
-
- Design-unbiased statistical learning
Li-Chun Zhang, University of Southampton, United Kingdom-
Abstract
A basic problem with supervised machine learning (ML) is that one needs to be able to ‘extrapolate’ the model learned from the available sample to the out-of-sample units, in order for learning to have any value at all. No matter how learning is organised given the sample, one cannot ensure it is valid outside the sample, unless the sample is selected from the population in some controlled manner. This well-known problem in statistical inference is sometimes recast as the problem of concept drift in the ML literature.
We develop a subsampling Rao-Blackwell method for exactly design-unbiased estimation using any ML technique, by combining three classic ideas from ML and Statistical Science: sample training-test split, Rao-Blackwellisation and design-based model-assisted estimation. For instance, by our approach, one can be certain that replacing linear regression by random forest would still lead to valid model-assisted estimation. Thus, whenever rich feature data are available, the method allows one to adopt any flexible ML technique automatically for estimating descriptive statistics at the aggregated level, “irrespectively of the unknown properties of the population” (Neyman, 1934).
In addition to design-unbiasedness, we develop stability conditions for design-consistency under both simple random sampling and arbitrary unequal probability sampling designs.
-
- Random Forest models, a proposal for the analysis of selective editing strategies
Roberta Varriale, ISTAT, Italy-
Abstract
ISTAT has started a new project for the Short Term statistical processes, to satisfy the coming new EU Regulation to release estimates in a shorter time. The assessment and analysis of the current Short Term Survey on Turnover in Services (FAS) survey process, aims at identifying how the best features of the current methods and practices can be exploited to design a more “efficient” process. In particular, the project is expected to release methods that would allow important economies of scale, scope and knowledge to be applied in general to the STS productive context, usually working with a limited number of resources.
The analysis of the AS-IS process revealed that the FAS survey incurs substantial E&I costs, especially due to intensive follow-up and interactive editing that is used for every type of detected errors.
In this view, we tried to exploit the lessons learned by participating to the High-Level Group for the Modernisation of Official Statistics (HLG-MOS, UNECE) about the Use of Machine Learning in Official Statistics. In this work, we present a first experiment using Random Forest models to: (i) predict which units represent “suspicious” data, (ii) to assess the prediction potential use over new data and (iii) to explore data to identify hidden rules and patterns. In particular, we focus on the use of Random Forest modelling to compare some alternative methods in terms of error prediction efficiency and to address the major aspects for the new design of the E&I scheme.
-
12:30 – 13:45
Session 3B -- Innovative solutions in social applications
Chairperson: Martin Renaud
- Leveraging the power of administrative data through the Longitudinal Social Data Development Program
Larry MacNabb and Jenneke Le Moullec, Statistics Canada, Canada-
Abstract
Increased demands for disaggregated information on increasingly complex subjects, in conjunction with concerns around response burden and decreasing survey response rates has necessitated the need to look beyond household surveys to address the questions of today and the future. The Longitudinal Social Data Development Program (LSDDP) is a new program currently looking at how administrative data can meet these challenges. The presentation will provide an overview of how the LSDDP is exploring expanding the use of administrative data. Innovative areas being explored include methods for variable replacement on social surveys, cross domain analysis in response to urgent demands for insights such as the Opioid crisis and lastly how administrative data can be used in multisectorial lifecourse cohort analysis. Items covered will include progress, innovative techniques and challenges still to be resolved.
-
- Predicting transitions into and out of poverty using machine learning
Joep Burger and Jan van der Laan, Statistics Netherlands, the Netherlands-
Abstract
The first sustainable development goal set by the United Nations in 2015 is to ‘end poverty in all its forms everywhere’ by 2030. To end poverty it is important to be able to identify risk factors that drive transitions into and out of poverty. Well-known risk factors for poverty are individual and macro-economic characteristics that affect income and expenditure. Policy makers have to move beyond averages, however, to effectively combat so-called micro-poverty traps. Supervised machine learning allows for a more flexible mapping of non-linear relationships and complex interactions than traditional regression techniques. The Dutch system of social statistical datasets provides the necessary data both in terms of population coverage and feature space. In this paper we address two research questions: 1) How well can individual (transitions into and out of) poverty be estimated from registered life histories using supervised machine learning? 2) Does the approach reveal new insight into risk factors for poverty? Two gradient boosting models have been developed: one to estimate the probability that a person who is not poor in one year becomes poor the next year, and one to estimate the probability that a person who is poor in one year stays poor the next year. Over five hundred features have been derived, on persons, households, dwellings and neighborhoods, about the past three years, covering demography, economy, crime and health. In addition to model performance, we studied feature importance, effects and interactions using SHAP and partial dependence. To move beyond the well-known risk factors we also studied subpopulations that differ in the mean observed poverty rate defined by the models. We will discuss the strengths and weaknesses of the applied approach.
-
- Bridging the gap between the displaced and in-demand occupations
Vishal Subramanian Balashankar, Badri Venkataraman and Chris Astle, Cybera, Canada-
Abstract
With labour market uncertainty increasing across Canada, there is a need for innovative ways to help displaced workers re-skill/up-skill and potentially pivot to in-demand occupations. In our study, we present a unique approach to bridge the gap between the displaced and in-demand occupations and also provide a machine learning framework to forecast the employment by NAICS for 6 months. We have combined the monthly employment data from Statistics Canada’s Labour Force Survey, and the monthly job ads counts from Burning Glass to achieve our goal.
Our approach consists of the following three steps.
- Finding the displaced occupations in Alberta over the last 7 years based on the integrated actual employment and job ads count data. Validation is performed to establish the correlation between the two data sets in this step.
- Using the list of displaced occupations, a unique pivot graph is developed to map a displaced occupation to a list of in-demand occupations which are similar to the chosen displaced occupation. To establish similarity between occupations, a similarity score (from Burning Glass) is used. Once a prospective in-demand occupation is selected, the skill gap is computed and presented to the user.
- Applying SARIMA and SARIMAX models to forecast employment for 6 months. The models have a mean absolute percentage error of 1.4 % and 10.76 % in the test sets in 2019 and 2020, respectively across all the NAICS sectors. The monthly predictions have errors less than 0.5 %.
The above approaches are aimed at helping the government with public policy design and planning.
-
- Machine Learning for estimating heterogeneous treatment effects in program evaluations
Yves Gingras, Leeroy Tristan Rikhi and Andy Handouyahia, Employment and Social Development Canada, Canada-
Abstract
Our study will show how the Evaluation directorate at Employment and Social Development Canada uses rich administrative data and Modified Causal Forests (MCF), a causal machine learning estimator, to inform policy development through impact evaluations. We will illustrate our implementation of the innovative MCF algorithm to estimate individualized treatment effects, thereby learning what works for whom. This endeavour is fully aligned with the Government of Canada’s commitment to implement a Gender-Based Analysis+ lens in evaluation work, ensuring that differential impacts on people of various sociodemographic backgrounds are considered in policy and program development.
-
13:45 – 14:15
Networking Event
Friday October 22, 2021
10:00 – 11:00
Session 4 -- In Memoriam of Professor Chris Skinner
Chairperson: Danny Pfeffermann
- Statistical Disclosure Control and Developments in Formal Privacy – Notes from Chris Skinner's Waksberg Lecture
Natalie Shlomo, University of Manchester, United Kingdom-
Abstract
Chris Skinner was the honouree of the Waksberg Award in 2019 and sadly never got a chance to present his lecture at the Canadian International Methodology Symposium. Based on his notes sent to me by his son, Tom Skinner, I will give an overview of Statistical Disclosure Control (SDC) over the last decades and how it has evolved to more formal definitions of privacy. I will also emphasize Chris’s many contributions in the area of SDC. I will review his seminal research, starting in the 1990’s with his work on the release of UK Census sample microdata. This led to a wide-range of research on measuring the risk of re-identification in survey microdata through probabilistic models. Chris was deeply knowledgeable and expanded the depth and breadth of SDC research with publications on disclosure risk and harm, disclosure risk and record linkage, disclosure risk and forensic science and most recently disclosure risk and differential privacy. Chris’s decades of research in SDC made him the definitive voice of a generation.
-
11:00 – 11:15
Morning Break
11:15 -- 12:30
Session 5A -- Issues of Ethics and Privacy in the Application of Data Science in Official Statistics
Chairperson: Martin Beaulieu
- Explaining Explanations for Trustworthy Decision Making
Leilani Hendrina Gilpin, Sony AI / MIT Computer Science and Artificial Intelligence Laboratory, USA-
Abstract
There has recently been a surge in the area of eXplanatory artificial intelligence (XAI), which strives to create human-understandable or interpretable mechanisms by design or after the fact. However, these promises are not aligned with the technical capabilities of explanations; which are largely produced after the fact, without measuring completeness or whether the explanation is true to the processing of the underlying (possibly opaque) mechanism. In this talk, I will review the current capabilities of XAI, and focus on the necessity of explainability for developing XAI systems that are ethical and trustworthy decision-makers.
-
- Mitigating Algorithmic Discrimination in AI
Golnoosh Farnadi, HEC Montreal, Canada-
Abstract
AI and machine learning tools are being used with increasing frequency for decision-making in domains that affect peoples' lives such as employment, education, policing, and loan approval. These uses raise concerns about biases and algorithmic discrimination and have motivated the development of fairness-aware mechanisms in the machine learning (ML) community. In this talk, I will show how to measure bias and define fairness and why this is a challenging task. Then, I will present some techniques from my group to ensure fairness in different stages of the ML/AI pipeline. I will conclude my talk with takeaways, open questions, and future directions towards building a trustworthy AI system.
-
- Empowering analysts to consider the ethics of their work: A case study of the UK Statistics Authority's data ethics framework
Simon Whitworth, United Kingdom Statistics Authority, United Kingdom-
Abstract
Our increasingly digital society provides multiple opportunities to maximise our use of data for the public good – using a range of sources, data types and technologies to enable us to better inform the public about social and economic matters and contribute to the effective development and evaluation of public policy. Ensuring use of data in ethically appropriate ways is an important enabler for realising the potential to use data for public good research and statistics. Earlier this year the UK Statistics Authority launched the Centre for Applied Data Ethics to provide applied data ethics services, advice, training and guidance to the analytical community across the United Kingdom. The Centre has developed a framework and portfolio of services to empower analysts to consider the ethics of their research quickly and easily, at the research design phase thus promoting a culture of ethics by design. This session will provide an overview of this framework, the accompanying user support services, the impact of this work and future plans for the work of the Centre.
-
11:15 -- 12:30
Session 5B -- Quality and Measurement Error
Chairperson: Fritz Pierre
- Creation of a Composite Quality Indicator for Estimates Based on Administrative Data Using Clustering
Roxanne Gagnon, Martin Beaulieu, Danielle Lebrasseur, Wei Qian and Anthony Yeung, Statistics Canada, Canada-
Abstract
Measuring and communicating quality is a challenge for statistical programs using administrative data only. Quality indicators such as coding rates, reported rates or linkage error rates are useful information to assess the accuracy of variables obtained from administrative data. What is not so clear is how these indicators can also be used to communicate information about quality to users along with clear recommendations on how to use the published estimates.
One example of programs using administrative data sources exclusively to produce their estimates is the Canadian Housing Statistics Program (CHSP). This program provides comprehensive information to monitor and analyze the Canadian housing market by combining multiple sources of administrative data. These sources have varying quality levels at the moment of their acquisition and the different steps to process these data and produce final estimates can potentially introduce errors.
Unsupervised machine learning is one way of building a composite quality indicator to describe the accuracy of different estimates in a multidimensional table. In this presentation, we will describe how a clustering algorithm was used to group domains that are similar in terms of the quality indicators derived for different post-acquisition steps, such as linkage, geo-coding and imputation. This analysis was used to assign labels to the resulting clusters and to inform users on their relative global quality.
-
- Urban Tree Measurement Error and the Additional Uncertainty in Estimates of Ecosystem Services
James Westfall, Jason G. Henning and Christopher B. Edgar, U.S. Forest Service, The Davey Institute and University of Minnesota, USA-
Abstract
The collection and analysis of urban forest inventory data has been steadily increasing in recent decades. In addition to typical assessments of structure and composition, amount and value of ecosystem services are estimated as indicators of benefits to anthropologic populations. As urban inventories are sample-based, sources of uncertainty and their magnitude provide important information for judging the reliability of estimated population parameters. Most analytical tools provide a sampling error statistic, but other types of uncertainty due to measurements or statistical models are not accounted for. In this study, measurement variation for a suite of urban tree attributes was examined and measurements were found to be equally or less variable than those taken on forest-grown trees. The prominent exception was tree diameter which was more highly variable. In addition to quantifying the measurement variability, simulations that propagate the variation assessed the additional variance incurred for estimates of ecosystem services and associated valuations. Generally, there was an increase of about 1% or less in the standard error for most ecosystem services and their value. Measurement variation may contribute larger amounts of uncertainty for urban inventories lacking adequate field crew training and quality assurance processes.
-
- Administrative data for the estimation of population: statistical learning from the first waves of the Italian Permanent Population Census
Angela Chieppa, Nicoletta Cibella, Antonella Bernardini, Silvia Farano and Giampaolo de Matteis, ISTAT, Italy-
Abstract
The Permanent Census of Population and Housing is the new census strategy adopted in Italy in 2018: it is based on statistical registers combined with data collected through surveys specifically designed to improve registers quality and assure Census outputs. The register at the core of the Permanent Census is the Population Base Register (PBR), whose main administrative sources are the Local Population Registers.
The population counts are determined correcting the PBR data with coefficients based on the coverage errors estimated with surveys data, but the need for additional administrative sources clearly emerged while processing the data collected with the first round of Permanent Census. The suspension of surveys due to global-pandemic emergency, together with a serious reduction in census budget for next years, makes more urgent a change in estimation process so to use administrative data as the main source.
A thematic register has been set up to exploit all the additional administrative sources: knowledge discovery from this database is essential to extract relevant patterns and to build new dimensions called ‘signs of life’, useful for population estimation.
The availability of the collected data of the two first waves of Census offers a unique and valuable set for statistical learning: association between surveys results and ‘signs of life’ could be used to build classification model to predict coverage errors in PBR.
This paper present the results of the process to produce ‘signs of life’ that proved to be significant in population estimation.
-
- Measuring the Undercoverage of Two Data Sources with a Nearly Perfect Coverage through Capture and Recapture in the Presence of Linkage Errors
Abel Dasylva, Arthur Goussanou and Christian Olivier Nambeu, Statistics Canada, Canada-
Abstract
In the context of its "admin-first" paradigm, Statistics Canada is prioritizing the use of non-survey sources to produce official statistics. This paradigm critically relies on non-survey sources that may have a nearly perfect coverage of some target populations, including administrative files or big data sources. Yet, this coverage must be measured, e.g., by applying the capture-recapture method, where they are compared to other sources with good coverage of the same populations, including a census. However, this is a challenging exercise in the presence of linkage errors, which arise inevitably when the linkage is based on quasi-identifiers, as is typically the case. To address the issue, a new methodology is described where the capture-recapture method is enhanced with a new error model that is based on the number of links adjacent to a given record. It is applied in an experiment with synthetic data generated from public census data from Canada and the United States.
-
12:30 – 13:00
Afternoon Break
13:00 – 14:15
Session 6A -- Data Visualization for Official Statistics
Chairperson: France Labrecque
- Find, explore and export data with the Canadian Statistical Geospatial Explorer
France Labrecque, Statistics Canada, Canada-
Abstract
The Canadian Statistical Geospatial Explorer (CSGE) is a web mapping application that empowers users to discover Statistics Canada’s geo-enabled data at different geographic levels of details on a thematic map. Using a hierarchical list of dynamic filters, users can access thousands of health, demographic and socio-economic indicators from the Census, other surveys and datasets produced and collected by the agency. Users can also customize the view of the thematic map (colors and data distribution), change the basemaps (satellite imagery, topography, etc.) to view the data in a different context, then export the map or the data selected in various formats to use in their own workflows. In short, the application is meant to be a tool to quickly find, explore and export data from a single screen, accessible from any device.
-
- Ireland's innovative approach to monitoring the SDGs and the COVID-19 Outbreak through geospatial visualisation
Kevin McCormack, Central Statistics Office, Ireland-
Abstract
Ireland’s innovative approach for monitoring the national indicators for the UN Sustainable Development Goals (SDG) and the COVID-19 Outbreak, using geographic information systems, will be discussed. The Global Frameworks, statistical and geospatial, underpinning Ireland’s work will be referenced. Details of Ireland’s SDG and COVID-19 reporting ecosystems, with the National Statistics Office having a central role, will be presented.
It will be demonstrated that the development of a close and successful relationship between the statistical and geospatial communities in Ireland has facilitated the rapid development of Ireland’s National SDG and COVID-19 HUBs, which are geospatially enabled dashboards. These dashboards are recognised nationally as important dissemination and communication channels. Within these dashboards, the geospatially enabled data are visualised to a number of national geographies.
-
- INEGI's strategies towards a user-centric approach to dissemination
Andrea Fernandez Conde, Instituto Nacional de Estadística, Geografía e Informática, Mexico-
Abstract
A 2017 national study revealed that 88.5% of Mexico’s economic units trust its National Institute of Statistics and Geography (i.e., Instituto Nacional de Estadística, Geografía e Informática; INEGI). Nonetheless, only 10.3% of them reported using its data to inform their business activity. The gap between trust and usage has been linked to challenges with the accessibility of INEGI’s data.
Some difficulties with accessibility include siloed approaches towards dissemination (e.g., by domain or statistical program), which led to the emergence of multiple tools with different subsets of data and independent architectures. Hence, support and updates of the ecosystem are the main focus of developing time. Furthermore, users were treated as a single homogenous entity. Thus, tools did not necessarily match a specific usage purpose, complicating their usability. Lastly, there was no feedback loop between the intended user and tool design.
To address accessibility challenges, management created an organization-wide Dissemination Unit (DU). The DU was created with the long-term mandate of improving accessibility and serviceability. The Generic Statistical Business Process Model (GSBPM) was the framework that allowed us to separate the responsibility of producers and disseminators while creating a space for multidisciplinary collaboration. For this presentation, the DU processes will be presented, as well as the quality assurance framework that guides their work plan.
Since the DU’s creation, users in INEGI’s website have increased from 10 million in 2016 to 13.3 million in 2020; visits from 62.5 to 70.1 million; and downloads from 3.3 million to 7.2 million. Over the same period, the level of satisfaction, on a scale of zero to 100, reported by our users, corresponding to the quality in navigation and organization, moved from 60.3 to 83.8. It is our belief that other National Statistical Offices around the world could benefit from our framework.
-
13:00 – 14:15
Session 6B -- Health and COVID-19
Chairperson: Julie Bernier
- Physician experiences during the COVID-19 pandemic in the United States: Adapting an annual survey to assess pandemic-related challenges
Zachary J. Peters and Danielle Davis, National Center for Health Statistics, USA-
Abstract
The U.S. National Center for Health Statistics (NCHS) annually administers the National Ambulatory Medical Care Survey (NAMCS) to assess practice characteristics and care provided by office-based physicians in the United States, including interviews with sampled physicians. This presentation will describe challenges, opportunities, and methodological adjustments in administering the 2020 NAMCS during the COVID-19 pandemic.
After the onset of the pandemic, NCHS adapted NAMCS methodology to assess the impacts of COVID-19 on office-based physicians. Specifically, partway through 2020, NCHS introduced questions to the NAMCS physician interview that assessed physician experiences related to COVID-19, including: shortages of personal protective equipment; COVID-19 testing in physician offices; providers testing positive for COVID-19; and telemedicine use during the pandemic.
NCHS also introduced novel analytic and dissemination strategies to capitalize on these adjustments to survey methodology. To enhance timeliness, quarterly weights were developed to allow for the early release of nationally representative physician estimates as each interview period was completed. Estimates of physicians’ experiences will be disseminated via data dashboards on the NCHS website (first release in Summer 2021), updated quarterly, and accompanied by corresponding data files for public use. Presenters will discuss the development and utility of these dashboards and will detail measures of physician experiences during the COVID-19 pandemic.
Although COVID-19 posed challenges, NCHS adapted and modernized NAMCS, producing more open and timely statistics and disseminating more interactive and user-centric data.
-
- Applying the data science approach to COVID-19 epidemiological modelling to inform PPE demand and supply in Canada
Deirdre Hennessy, Jihoon Choi, Joel Barnes, Christina Tucker, Kayle Hatt, Gillian Dawson and James Van Loon, Statistics Canada and Health Canada, Canada-
Abstract
The global severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic continues to pose a serious threat to the health of Canadians. As of April 2021, over one million diagnosed cases and twenty thousand deaths have been reported in Canada.
The SARS-CoV-2 pandemic has put unprecedented demands on the Government of Canada to provide timely, accurate and relevant information to inform policy-making around a host of issues, including personal protective equipment (PPE) procurement and PPE deployment to the provinces and territories. The application of data science techniques, including the automation of data capture, processing and reporting has allowed the Government of Canada, to quickly stand-up the Pan-Canadian PPE Demand and Supply project.
A key part of this PPE project was to model the demand for PPE in the health care system, which is particularly sensitive to the epidemiology of SARS-CoV-2. Epidemiological models can be used to project the trajectory of epidemics, under different future assumptions, allowing policy-makers to consider a range of scenarios.
Our team applied important elements of the data science approach to quickly iterate epidemiological scenarios and respond to emerging trends in the epidemic such as vaccination and the emergence of variants. We will describe how we developed the epidemiological model from an existing open-source code base, and optimized calculation processes by utilizing the power of multi-core processors in Azure Cloud environment, which allows parallel execution of multiple scenarios. In addition, we will describe how we created visualization tools to automate reporting of the model outputs for validation and communication.
-
- Harnessing Natural Language Processing and Machine Learning to Enhance Identification of Opioid-Involved Health Outcomes in the National Hospital Care Survey
Amy M. Brown and Nikki Adams, National Center for Health Statistics and Centers for Disease Control and Prevention, USA-
Abstract
Electronic data collection has increasingly been utilized by national surveillance systems to reduce respondent burden and improve efficiency. Thus, there is a need to incorporate data science methods into systems that are growing in volume and complexity. The National Hospital Care Survey collects data from a nationally representative sample of hospitals in the U.S., including patient information from administrative claims and electronic health records. The National Center for Health Statistics has received funding from the Department of Health and Human Services’ Patient-Centered Research Trust Fund to enhance medical code-based algorithms that incorporate newly available data and data science methods. In this session, we will describe the use of natural language processing and machine learning techniques to search unstructured data (i.e., clinical text notes) to complement searches of structured data (i.e., diagnosis, procedure, medication, and laboratory codes). The presentation will include an overview of algorithms that identify evidence of opioid use, type of opioid agent taken, opioid overdose, and the presence of co-occurring substance use disorders and selected mental health issues. Methods used include keyword searches, negation detection, and named entity recognition to find drug name misspellings. Algorithm performance is also evaluated against an annotated dataset developed in-house. The presentation will also discuss challenges faced in integrating data science methods at a federal statistical agency, including various technological and data security limitations and how they were overcome. The presentation will conclude by describing efforts to make the algorithms and analytic data files accessible to researchers.
-
- The importance of data integration and automation for interactive web applications
Peter Solymos and Khalid Lemzouji - Analythium Solutions Inc., Canada-
Abstract
The COVID-19 pandemic brought real-time data analytics and visualization into the forefront of news and public discussion. Shortly after the 1st dashboard by the Johns Hopkins University, we also started to build our own COVID-19 web app that pulls together various data sources as part of daily automated data updates (https://hub.analythium.io/covidapp/). The app condensed lots of information about COVID-19 worldwide, in Canada, and in Alberta. Driven by our own interest in looking at case counts close to our homes, we decided to drill down into the Alberta data utilizing the space-time information we have available for 132 local areas in the province. Alberta Health updates data regularly for case numbers, including active cases, recovered cases, and deaths. We recorded daily space-time data at the level of these local areas every day since March 2020. Using this information, we made an interactive map that is interlinked with the time series graph next to it. Besides the cumulative case numbers, we also looked at incidences. In our presentation we explain the app functionality and also the automated data ingestion process behind the app that takes data from wide-format, long-format, and unstructured data sources. We’ll explain the challenges we encountered during 400 days of the pandemic and how critical well maintained data processing pipelines are for decision making.
-
Friday October 29, 2021
10:00 – 11:00
Session 7 -- Waksberg Award Winner Address
Chairperson: Bob Fay
- Multiple-Frame Surveys for a Multiple-Data-Source World
Sharon L. Lohr, Arizona State University, USA-
Abstract
Multiple-frame surveys, in which independent probability samples are selected from each of Q sampling frames, have long been used to improve coverage, to reduce costs, or to increase sample sizes for subpopulations of interest. Much of the theory has been developed assuming that (1) the union of the frames covers the population of interest, (2) a full-response probability sample is selected from each frame, (3) the variables of interest are measured in each sample with no measurement error, and (4) sufficient information exists to account for frame overlap when computing estimates. After reviewing design, estimation, and calibration for traditional multiple-frame surveys, I consider modifications of the assumptions that allow a multiple-frame structure to serve as an organizing principle for other data combination methods such as record linkage, mass imputation, sample matching, small area estimation, and capture-recapture estimation. Finally, I discuss how results from multiple-frame survey research can be used when designing and evaluating data collection systems that integrate multiple sources of data.
-
11:00 – 11:15
Morning Break
11:15 -- 12:30
Session 8A -- Integrating Multiple Data Sources
Chairperson: François Brisebois
- Methodological challenges of smart surveys – some case studies
Barry Schouten, Statistics Netherlands/Utrecht University, the Netherlands-
Abstract
Smart surveys employ the potential of smart devices such as computing power, local data storage, sensor measurements, and linkage of public and personal online data. The main motivations for smart surveys are reduction of respondent burden, improvement of data quality and more accurate proxy measures of the statistical concepts of interest. Smart surveys form a bridge to big data and administrative data, but still treat the respondents as central persons in data collection.
The use of multiple data sources leads to a hybrid form of data collection. Since sensor data and other forms of data are subject to representation and measurement error themselves and since smart surveys lean heavily on respondent willingness and motivation, there are various new methodological challenges. Perhaps the most challenging is the trade-off between passive measurement and active involvement of respondents. A trade-off that concerns respondent burden, data quality, respondent involvement and privacy. However, also efficient fieldwork employing planned missing designs and adaptive recruitment and motivation strategies are important open issues.
In the presentation, methodological challenges will be discussed at the hand of various case studies currently being developed or evaluated at Statistics Netherlands.
-
- On a Bayesian approach to improving probability sample estimators using a supplementary non-probability sample
Abel DaSylva, Yong You and Jean-Francois Beaumont, Statistics Canada, Canada-
Abstract
Non-probability samples are being combined with probability samples to reduce survey costs and provide more timely estimates. This paper describes a Bayesian methodology for doing so when a finite population mean is estimated with a non-probability sample and a probability sample, where both sources contain the variables of interest and the auxiliary variables, the population mean is unknown for the latter, the sample design is possibly non-ignorable and the probability sample is without an indicator of inclusion in the non-probability sample. The proposed methodology may be used to improve the quality or reduce the costs of an existing probability survey through the acquisition of inexpensive non-probability sample data. It is evaluated in a simulation study featuring different priors and sample designs that are ignorable and non-ignorable.
-
- Imputation Methods for the Experimental Monthly State Retail Sales Report
Stephen J. Kaputa, US Census Bureau, USA-
Abstract
On September 30, 2020, the Census Bureau began producing new monthly retail sales estimates for the experimental Monthly State Retail Sales (MSRS) report. These measures are composite estimates combining independently obtained synthetic estimates and hybrid estimates comprising of third-party and directly collected establishment (point of sale) sales data and modeled establishment data. This presentation focuses on the hybrid estimator, specifically walking through the imputation model development and validation procedures. The imputation model is a Bayesian formulation of a linear mixed model that uses regression and random effects parameters to predict an establishment’s monthly retail sales; developed from blended data combining administrative records, survey data, and third-party data and validated against national industry level estimates from the Monthly Retail Trade Survey (MRTS). State level geography variation is modeled with an Intrinsic Conditional Auto-Regressive (ICAR) prior, which smooths estimates by modeling correlation between adjacent states. Multiple imputations from the predictive posterior distribution are combined with survey and third-party data to estimate state level sales totals. The model parameters are estimated using Bayesian inference with the open-source probabilistic programming language “Stan” in R.
-
11:15 -- 12:30
Session 8B -- Response Burden, Synthetic Data and Privacy Protection
Chairperson: Steven Thomas
- Growing Regression Trees that Use Sampling Frame Covariates to Explore Response Burden for Use in Survey Design
Yeng Xiong, Laura Bechtel, Diane Willimack and Colt Viehdorfer, US Census Bureau, USA-
Abstract
The Economic Directorate of the U.S. Census Bureau is developing coordinated design and sample selection procedures for an integrated Annual Survey System. The unified sample will replace the directorate’s existing practice of independently developing sampling frames and sampling procedures for a suite of separate annual surveys, which optimizes sample design features at the cost of increased response burden. Size attributes of business populations, e.g., revenues and employment, are highly skewed. A high percentage of companies operate in more than one industry. Therefore, many companies are sampled into multiple surveys compounding the response burden, especially for “medium sized” companies.
This component of response burden is reduced by selecting a single coordinated sample but will not be completely alleviated. Response burden is a function of several factors, including (1) questionnaire length and complexity, (2) accessibility of data, (3) expected number of repeated measures, and (4) frequency of collection. The sample design can have profound effects on the third and fourth factors. To help inform decisions about the integrated sample design, we use regression trees to identify covariates from the sampling frame that are related to response burden. Using historic frame and response data from four independently sampled surveys, we test a variety of algorithms, then grow regression trees that explain relationships between expected levels of response burden (as measured by response rate) and frame covariates common to more than one survey. We validate initial findings by cross-validation, examining results over time. Finally, we make recommendations on how to incorporate our robust findings into the coordinated sample design.
-
- Evaluation of respondents' participation in the survey of Information and Communication Technologies usage in Enterprises (ICT)
Samanta Pietropaoli, Damiana Cardon, Claudio Ceccarelli, Gabriella Fazzi and Alessandra Nurra, ISTAT, Italy-
Abstract
We propose a longitudinal analysis with a point of view connected to the organizational changes that have taken place in the Italian National Institute of Statistics. In 2016 the Institute introduced a new Directorate, intending to standardize and generalize the business process of Data Collection according to the European standard of the GAMSO model. The paper discusses the pros and cons of this change from the perspective of the survey's participation. The ICT survey response rate analysis demonstrates an increase of around 20% since the beginning of the new organization: the paper tries to focus on the impact of the changes introduced with the new organization. We used the data on response burden, collected in a specific section of the ICT questionnaire, paradata collected during the online compilation, and metadata.
This analysis suggests some actions that could be taken to improve respondents' participation, data quality, and respondents' perception of the official statistics. We focused our attention on a specific subset of respondents - the so-called "wanted" - the ones who have never answered to an ICT survey or any other Istat survey.
The paper aims to illustrate how an efficient organization of data collection reflects its benefits on survey results and what kind of actions should be taken to catch the attention of the "wanted".
-
- Generating smart deep files: the example of synthesizing hierarchical data
Héloïse Gauvin, Statistics Canada, Canada-
Abstract
The Government of Canada’s Directive on Open Government aims to ensure Canadian’s get access the most government information and data possible. One solution for open data are smart synthetic files, which retain as much analytical value as possible and address confidentiality issues posed by personal information.
Statistics Canada has acquired a recognized expertise in producing such synthetic data files of high analytic value. For an ongoing project, we tackle a new challenge by preserving through synthesis the hierarchical structure its family relations present which translates into common traits that must be maintained. Similar challenges arise when synthesizing other structured data such as business data.
The presentation will illustrate the challenges and solutions set in place to build smart synthetic files for hierarchical data. An application of this strategy will be shown with the development of a synthetic database that supports the development of policies about retirement income. This database includes over 20 variables for 8 million records structured in about 4 million family units. We will present how the family structure was preserved, discuss the practical and technical challenges inherent to the development of such a large and complex file, evaluate the file’s risk and utility, and present future research avenues.
-
- Supervised Text Classification with Leveled Homomorphic Encryption
Zachary Zanussi, Benjamin Santos and Saeid Molladavoudi, Statistics Canada, Canada-
Abstract
Privacy concerns are a barrier to applying remote analytics, including machine learning, on sensitive data via the cloud. In this work, we use a leveled fully Homomorphic Encryption scheme to train an end-to-end supervised machine learning algorithm to classify texts while protecting the privacy of the input data points. We train our single-layer neural network on a large simulated dataset, providing a practical solution to a real-world multi-class text classification task. To improve both accuracy and training time, we train an ensemble of such classifiers in parallel using ciphertext packing.
-
12:30 – 13:00
Afternoon Break
13:00 – 14:15
Session 9A -- Making Official Statistics More Open
Chairperson: Claude Julien
- Building better data to build a better future
Darren Barnes, Office for National Statistics, United Kingdom-
Abstract
Imagine the government data landscape being a simple place to navigate and content and data are easy to discover. Imagine an approach that afforded users a better point of access to the portfolio of data the government produces. Imagine data and metadata built on consistency and standards.
The Integrated Data Programme (IDP) Dissemination in the UK is offering an approach that can make this a reality. It will be a game-changer. For statistical producers, we will build frameworks around recognised data and metadata standards, support tooling to help produce data that will be part of the web and not just on the web. We are aiming to develop exciting new products that enable more engaging content and world class visualisations to be produced. For users, it offers a gateway to the regularly published outputs and sophisticated faceted search options to discover the data and content they need - whatever the source. This work opens up new avenues for the distribution of government data and builds a world leading online presence that will keep the UK relevant to users for years to come.
-
- The Linkable Open Data Environment: harmonizing open microdata for heterogeneous sources
Alessandro Alasia and Joseph Kuchar, Statistics Canada, Canada-
Abstract
The Linkable Open Data Environment (LODE) is an exploratory initiative that aims at enhancing the use and harmonization of open micro data primarily from municipal, provincial and federal sources. The results are a collection of datasets, released under a single open data license, as well as open source tools used to process the data, and collaborations in an open space. For example, recently the LODE team released the Open Database of Addresses (ODA), as well as two open source scripts for the automatic merging and processing of over 13 million building footprints and 10 million address point data. This presentation will outline the vision of the LODE, detailing goals, processes and outputs, including an interactive open source web map to visualize the georeferenced data, known as the LODE Viewer.
-
- Development of R libraries for common tasks with open Canada data
Dmitry Gorodnichy, Canada Border Services Agency, Canada-
Abstract
Many Government of Canada groups are developing code to load, clean/transform, analyze and visualize various open Canada data, often duplicating each other’s efforts and with limited level of code quality peer-reviewing. This project aims at developing a unified set of R packages for use by anyone to perform those data science tasks. To achieve the objective, data professionals from across the government have been invited to share the related experiences and contribute related code at weekly “Lunch and Learn R” meet-ups. A dedicated GC Code team (r4gc) and GC Collab group (Use R!) have been created to facilitate code and knowledgebase exchange and development. This presentation provides an overview of the applied package development methodologies and the results obtained to date.
-
13:00 – 14:15
Session 9B -- Use of Data Science for Modeling
Chairperson: Jean LeMoullec
- Nowcasting Finnish real economic activity using traffic loop data
Pontus Lindroos, Henri Luomaranta and Paolo Fornaro, Statistics Finland, Finland-
Abstract
Statistics Finland started publishing an early version of the trend indicator of output (TIO) to answer users’ needs during the COVID-19 pandemic. The indicator was first published in April 2020, at the very beginning of the pandemic in Finland, and has since been published monthly, close to the end of the reference month. Nowcasting the TIO reduces publication lag from t+18 (flash estimate) and t+45 (first official release) and provides a quick response to user needs during exceptional times.
The nowcasted TIO is produced using open source data on truck traffic volumes at about 100 automatic measuring points (Traffic Monitoring System, TMS) in the Helsinki/Uusimaa -region and the Economic Sentiment Indicator (ESI) for Finland, published by Eurostat. Traffic data is updated continuously at t+1 day, which enables practically estimating the indicator in real-time. Estimation is done using a machine learning approach and the methodology is based on previous work done by Statistics Finland and ETLA Economic Research.
The nowcasted TIO is a real example of how new methodologies and data helps improve the production of statistics by, for example, reducing lags and thereby better supporting policymaking. The indicator has been used by both public and private actors during the pandemic and publication will continue at least until the pandemic has settled. Statistics Finland is also exploring ways of including new data and methodologies in the production of statistics on a regular basis, and the nowcasted TIO provides a good example for the future.
-
- Relative Performance of Methods Based on Model-Assisted Survey Regression Estimation
Erin Lundy and J.N.K. Rao, Statistics Canada and Carleton University, Canada-
Abstract
Use of auxiliary data to improve the efficiency of estimators of totals and means through model-assisted survey regression estimation has received considerable attention in recent years. Generalized regression (GREG) estimators, based on a working linear regression model, are currently used in establishment surveys at Statistics Canada and several other statistical agencies. GREG estimators use common survey weights for all study variables and calibrate to known population totals of auxiliary variables. Increasingly, many auxiliary variables are available, some of which may be extraneous. This leads to unstable GREG weights when all the available auxiliary variables, including interactions among categorical variables, are used in the working linear regression model. On the other hand, new machine learning methods, such as regression trees and lasso, automatically select significant auxiliary variables and lead to stable nonnegative weights and possible efficiency gains over GREG. In this talk, a simulation study, based on a real business survey sample data set treated as the target population, is conducted to study the relative performance of GREG, regression trees and lasso in terms of efficiency of the estimators.
-
- On the path to more timely economic indicators: A comparison of traditional and new machine learning nowcasting methods
Christian Ritter and Zdenek Patak, Statistics Canada, Canada-
Abstract
This talk will present results of a comparative study to evaluate several models in the context of nowcasting to produce more timely estimates of statistical economic indicators. A case study on nowcasting two macroeconomic indicators, Canadian GDP and building permits, is used to contrast models based on machine learning approaches and more traditional time series models. As well, discussion of identifying and evaluating official statistics and alternative data sources for the predictive models, and potential data pipelines in a production scenario to generate the nowcasts will also be included.
-
- Automation of Information Extraction from Financial Statements in SEDAR System using Spatial Layout based Techniques
Anurag Bejju, Statistics Canada, Canada-
Abstract
Portable Document Format (PDF) is most commonly used by companies for financial reporting purposes. The absence of effective means to extract data from these highly unstructured PDF files in a layout-aware manner presents a significant challenge for financial analysts to efficiently analyze and process information in a timely manner. In this project, we introduce ‘Spatial Layout based Information and Content Extraction’ (SLICE) - a unique computer vision algorithm that simultaneously uses textual, visual, and layout information to segment several data points into a tabular structure. This proposed solution, significantly reduce manual work and the hours spent in identifying and capturing required information by automating the financial variable extraction process for close to 70,000 PDFs per year near real-time. It also includes the development of a robust metadata management system that indexes close 150 variables for each financial document as well as a web application that allows users to interact with the extracted data points.
-
Friday November 5, 2021
10:00 – 11:00
Session 10 – Poster Session
- What can we learn from missing data? Examining nonreporting patterns of height, weight, and BMI among Canadian youth
Amanda Doggett, Ashok Chaurasia, Jean-Phillipe Chaput and Scott Leatherdale, University of Waterloo, University of Ottawa and Children’s Hospital of Eastern Ontario Research Institute, Canada-
Abstract
Youth body mass index (BMI) derived from self-reported height and weight tends to suffer greatly from non-reporting. However, missing data examinations are rare in this domain, and mishandling or ignoring missingness can bias research results and conclusions. The objective of this research is to examine the patterns and predictors of missing data in youth overweight and obesity research. Using data from the 74,501 Canadian secondary school students who participated in the COMPASS study in 2018/19, descriptive statistics and data visualization were used to understand the degree and characteristics of missingness. In order to understand predictors of missingness, two approaches were used: sex-stratified generalized linear mixed models selected using an adapted pseudo-likelihood model selection framework, and classification trees. In this sample, 31% of BMI data were missing. Females were more likely to leave their weight unreported, whereas males were more likely to leave their height unreported. Preliminary models indicate a variety of diet, exercise, mental health, and substance use variables are associated with missingness. Perceiving oneself as overweight and having weight loss intentions were positively associated with BMI missingness among females, while self-perception as underweight and reporting weight gain intentions were positively associated with BMI missingness among males. These preliminary findings suggest that missingness in youth BMI is unlikely to be missing at random, highlighting the importance of using appropriate missing data methodology to limit potential bias in research which utilizes youth BMI. The predictors of missingness identified in this study can be used as the foundation for future research to identify auxiliary variables for maximum likelihood or multiple imputation approaches.
-
- A bridging model to reconcile statistics based on data from multiple sources
Andreea Luisa Erciulescu, Jean D. Opsomer and F. Jay Breidt, Westat and Colorado State University, USA-
Abstract
Surveys designed to collect data on similar variables using samples representing the same population may still result in different estimates due, for example, to differences in sample designs or modes of data collection. Considered in this paper is the case where two surveys were conducted concurrently, with one using the same methodology as used in prior rounds of the survey and the other using an updated methodology, resulting in substantial differences in several key estimates. Due to differences in sample size, only the latter survey was detailed enough for disaggregated-level estimates of publishable quality. We propose a hierarchical model to account for discrepancies in the estimates from the two surveys and a Bayesian approach for producing reliable estimates at various levels of aggregation. The model relies on a common latent structure at the disaggregated level to allow “bridging” between the two surveys. The methodology is applied to the 2016 National Survey of Fishing, Hunting and Wildlife-Associated Recreation and the 2016 50-State Surveys of Fishing, Hunting and Wildlife-Related Recreation. Aligning these two surveys is critical to extend the series of related statistics that have been published since 1955, allowing for meaningful comparisons over time despite the change in survey methodology.
-
- Combining rules for F- and Beta-statistics from multiply-imputed data
Ashok K Chaurasia, University of Waterloo, Canada-
Abstract
Missing values in data impedes the task of inference for population parameters of interest. Multiple Imputation (MI) is a popular method for handling missing data since it accounts for the uncertainty of missing values. Inference in MI involves combining point and variance estimates from each imputed dataset via Rubin's rules. A sufficient condition for these rules is that the estimator is approximately (multivariate) normally distributed. However, these traditional combining rules get computationally cumbersome for multicomponent parameters of interest, and unreliable at high rate of missingness (due to an unstable variance matrix).
New combining rules for univariate F- and Beta-statistics from multiply-imputed data are proposed for decisions about multicomponent parameters. The proposed combining rules have the advantage of being computationally convenient since they only involve univariate F- and Beta-statistics, while providing the same inferential reliability as the traditional multivariate combining rules. Simulation study is conducted to demonstrate that the proposed method has good statistical properties of maintaining low type I and type II error rates at relatively large proportions of missingness. The general applicability of the proposed method is demonstrated within a lead exposure study to assess the association between lead exposure and neurological motor function.
-
- Quality Assurance in Emergencies: Developing a Framework for Reporting on Emergency Performance Statistics in Response to the COVID-19 Pandemic
Simon Rioux, Anuoluwa Iyaniwura and Chimaobi Amadi, Employment and Social Development Canada, Canada-
Abstract
The emergency situation related to the spread of COVID-19 led the Government of Canada to take unprecedented measures to help the population cope with the economic impact of this pandemic. With the almost immediate introduction of emergency benefits, new information needs emerged and there were several calls for official statistics on the number of people receiving benefits and the amounts spent. Shortly after the introduction of the Canada Emergency Response Benefit, the Office of the Chief Data Officer (OCDO) of Employment and Social Development Canada (ESDC) became the central point for producing evidence to respond to multiple requests from all sides: politicians, media, other government departments, other teams within ESDC, provincial partners, etc. The first version of the database combining employment insurance and Canada Revenue Agency data had just been built, combining data from two different universes, each with its own structure and standards. In this context, the data quality team at the OCDO was responsible for the quality assurance of the data requests, the data for the website and the database itself. The presentation aims to show both the process and the evidence that demonstrates the net benefits of such a process, and will demonstrate how the urgency to act can be an opportunity to do better.
-
- Innovative Use of Mapping Applications to Support Recruitment and Collection Activities of the 2021 Census of Population
Mark Oswald, Kimberley Easter and Jacob MacLean, Statistics Canada, Canada-
Abstract
Mapping applications were developed to facilitate data integration, data insights and decision making for the implementation of targeted communication, collection and recruitment activities.
The Census operations web mapping application was designed to provide greater spatial and geographical context to inform decision making regarding collection activities. Whether this is through visualizing a more systemic issue such as internet connectivity’s effect on certain collection methods or visualizing more acute factors such as COVID interfering with the ability to deploy personnel to certain regions.
Mapping applications can provide indicators via timely and adaptive data updates to facilitate decision making,. Their strength resides in the capacity to create linkages between different data points (around collection units) and to demonstrate potential dependencies/relationships to guide decision making.
-
- Improved decision-making in imputation design through data visualization
Darren Gray, Statistics Canada-
Abstract
A number of decisions must be made before approving an imputation method for production. Apart from choosing a specific method (and associated parameters), one must also determine whether an approach meets acceptable quality thresholds, or if more resources (particularly time) are required to investigate improvements or alternatives. Data visualization offers a number of tools that can facilitate this process, offering quick and efficient approaches to assess and compare imputation methods, identify potential issues through exploratory analysis, and incorporate uncertainty into decision-making. In particular, we attempt to integrate modern concepts of uncertainty visualization into our outputs.
-
- Estimating hog inventories using traceability data: A feasibility study
Joshua Gutoskie, Jeremie Spagnolo and Herbert Nkwimi Tchahou, Statistics Canada-
Abstract
Statistics Canada’s AgZero initiative aims to reduce the response burden on Canadian farmers by replacing survey based estimates with modelled based estimates using alternative data sources. One of these AgZero projects looks to replace hog inventory estimates coming from Statistics Canada’s Livestock Survey by leveraging hog traceability data obtained from the Canadian Pork Council. The PigTRACE dataset tracks all hog movements within Canada. The goal of this project is to determine the feasibility of using the PigTRACE dataset and historical survey estimates to produce provincial level inventory estimates, birth estimates, and interprovincial movement estimates. This presentation will describe the methods that were investigated in this study, including preprocessing, classification, and estimation.
-
11:00 – 11:15
Morning Break
11:15 -- 12:30
Session 11A -- Applying Data Science and Machine Learning Methods in Official Statistics: Opportunities and Challenges
Chairperson: Saeid Molladavoudi
- Data science for faster, richer insights: opportunities and challenges
Louisa Nolan, Office for National Statistics, United Kingdom-
Abstract
The appetite for faster, richer information has never been greater. We face two global challenges: the impact of Covid-19 pandemic and climate change.
In this presentation, we will discuss how the UK’s Office for National Statistics has been responding to the desire for faster richer data, illustrated with examples from ONS’ Data Science Campus. We have been applying data science tools and technology to new sources of data, such as mobility data, ship tracking, traffic cameras and Earth Observation to better understand our economy, society, and environment. This supplements our traditional survey data.
The use of new data sources and new tools comes with challenges if we are to use them confidently and maintain trust in our official statistics and in our use of personal data. We will discuss those challenges, and the progress being made – domestically and internationally – in addressing them.
-
- Fair and explainable AI from an official statistics perspective
M.P.W. (May) Offermans and Barteld Braaksma, Statistics Netherlands, the Netherlands-
Abstract
The European strategy for artificial intelligence (AI) puts a lot of emphasis on fairness, transparency and explainability. How to operationalise such abstract notions remains a challenge. In particular, awareness is growing that it all starts with understanding characteristics of underlying data sets. When a machine learning method is trained on a selective or biased data set, it takes effort to make sure its results do not display undesired discrimination. In addition, phenomena such as feedback loops and concept drift may strengthen such effects when repeating algorithms over time. A natural question is thus how to deal with data for AI from an official statistics perspective. In fact, our skill in understanding and processing data is appreciated beyond our own statistical world. Government institutions come to us to seek advice when considering sensitive AI applications. In the quickly developing world of AI it is not always clear what is the role of a national statistical institute in such cases, but only by doing hands-on research and discussing results with stakeholders we can define it better.
This contribution discusses work that has been done at Statistics Netherlands on fair and explainable AI, both for internal use in statistics production and for use cases in other government bodies. We discuss fairness models based on counterfactual fairness, examples of applications in official statistics that display concept drift and a dashboard and AI starters kit we developed for use by civil servants; all aimed at understanding what it takes to create fair AI applications.
-
- Data science pipelines @ Istat: challenges and solutions
Monica Scannapieco, ISTAT, Italy-
Abstract
In line with the path taken by the European Statistical System, Istat is investing on innovative methods to harness Big Data sources and to use them for the production of new and enriched Official Statistics products. Big Data sources are not, in general, directly tractable with traditional statistical techniques, just think of specific data types such as images and texts that are examples of the Variety dimension of Big Data. This motivates and justifies the growing interest of National Statistical Institutes in Machine Learning techniques.
Istat is currently using Machine Learning techniques in innovation projects and for the publication of experimental statistics. This paper will provide an overview of the main current projects by Istat and will focus on two specific Big Data-based production pipelines, related to the processing of respectively text sources and imagery sources. The paper will highlight the main challenges related to Machine Learning tasks within these two pipelines and the solutions put in place to solve them.
-
11:15 -- 12:30
Session 11B -- Machine Learning and Modeling for Classification
Chairperson: Steve Matthews
- Need for Speed : Using fastText (Machine Learning) to Code the Labour Force Survey
Justin Evans and Javier Oyarzun, Statistics Canada, Canada-
Abstract
Statistics Canada’s Labour Force Survey (LFS) plays a fundamental role in the mandate of Statistics Canada. The labour market information provided by the LFS is among the most timely and important measures of the Canadian economy’s overall performance. An integral part of the LFS monthly data processing is the coding of respondent’s industry according to the North American Industrial Classification System (NAICS), occupation according to the National Occupational Classification System (NOC) and the Primary Class of Workers (PCOW). Each month, up to 20,000 records are coded manually. In 2020, Statistics Canada worked on developing Machine Learning models using fastText to code responses to the LFS questionnaire according to the three classifications mentioned previously. This presentation will provide an overview on the methodology developed and results obtained from a potential application of the use of fastText into the LFS coding process.
-
- Testing Covariate Effects for Differences in Text Reviews of Canadian Beers
Dave Campbell and Gabriel Phelan, Carleton University and Simon Fraser University, Canada-
Abstract
Text provides rich opportunities for respondents to provide data unbounded by numeric or categorical constraints. Although rich in information, the unstructured nature of text data documents complicates analysis and inference. Typical strategies involve converting text into binary variables of word mentions, but inconsistent terminology hinders automation of this approach. Converting words into numeric vectors through embedding, or clustering documents into topics have excellent use cases, but are limited when considering statistical inference for covariate effects on topics of discourse. In this talk we consider product reviews for Canadian beers. The reviews are augmented by covariates such as geography and beer style. Given the rationality of ingredient production, there should be differences in geography induced beer flavours. Formally we wish to produce point and interval estimates for covariate effects on the language used to describe flavours. This talk showcases non-negative matrix factorization with anchor words to provide a deterministic conversion from text to topic. Permutation tests are then used for estimating effect sizes and hypothesis testing.
-
- Machine Learning Classifier Accepted Criteria: application to price statistics
Serge Goussev, William Spackman and Daniel Ma, Statistics Canada, Canada-
Abstract
As part of Statistics Canada’s initiative to modernize price indices such as the Consumer Price Index (CPI), traditional price data collection in the field is being augmented with alternative sources, such as scanner, Application Programming Interface, or web scraped data, to enhance quality and timeliness and lessen collection costs. The utilization of these data sources requires a robust supervised classification framework, as products must be accurately categorized in order to be aggregated through an applicable taxonomy. In a production setting, the utilization of a highly accurate classifier also minimizes the human effort required by National Statistical Institute (NSI) officers to quality assure the classified data every month prior price index publication. Therefore, only optimal classification models can be accepted for production purposes. In order to select an applicable model, a detailed and methodologically effective evaluation framework is required.
This paper proposes a series of systematically defined evaluation criteria necessary to assess and select the optimal classification model for use in prices indices, focusing on the needs of the Canadian CPI. Specifically, rigorous scoring criteria for model effectiveness utilized for traditional flat classification methods are combined with novel hierarchical metrics, as well as other criteria applicable to price statistics context. Compared to the traditional metrics, hierarchical metrics align well with the taxonomy structure adopted by NSIs. These novel metrics, although rarely used in the context of price indices, provide a new perspective on assessing and comparing the severity of the incorrectly classified samples.
The research combines applicable methods and metrics into a holistic framework for model evaluation that can also be used to weigh trade-offs faced in utilizing a classifier to calculate price statistics. The proposed framework is evaluated using a publicly available dataset applicable to calculation of price statistics to showcase the method and to support NSI replicability on their own data.
-
- Integrating machine learning into coding of the 2021 Canadian Census using fasttext
Andrew Stelmack, Statistics Canada, Canada-
Abstract
As part of processing for the 2021 Canadian Census, the write-in responses to 31 Census questions must be coded. Up until 2016, this was a three stage process, including an “interactive (human) coding” step as the second stage. This human coding step is both lengthy and expensive, spanning many months and requiring the hiring and training of a large number of temporary employees. With this in mind, for 2021, this stage will either be augmented with or replaced entirely by machine learning models using the "fasttext" algorithm. In this presentation we will discuss the implementation of this algorithm and the challenges and decisions taken along the way.
-
12:30 – 12:45
Afternoon Break
12:45 – 14:00
Session 12 – Panel Discussion
- Using data science to innovate and address emerging needs in official statistics
Panelists : Eric Deeben, Office of National Statistics, Data Science Campus, United Kingdom, Wendy Martinez, Bureau of Labor Statistics, USA and Danny Pfeffermann, Central Bureau of Statistics, Israel
Moderator : Eric Rancourt, Statistics Canada, Canada-
Abstract
This session will offer a discussion by three experts on the following themes:
- Leveraging the power of data science to produce more timely and granular statistics and improve on existing methods to create new high-quality solutions for our data needs.
- Striking the balance in using real-time, open, and unstructured data sources with advanced modeling techniques to partner with traditional methods and produce defendable user-centric results faster and at a lower cost.
-
14:00 – 14:15
Closing Remarks
- André Loranger, Assistant Chief Statistician, Statistics Canada, Canada