The proceedings of the Symposium are available. Please visit the Statistics Canada International Symposium Series: Proceedings catalogue page to access the papers for the presentations.
All times listed in the schedule refer to Eastern Daylight Time (EDT): UTC-4
Thursday, November 3, 2022
09:45 – 10:00
Opening Remarks
- Eric Rancourt, Director General, Modern Statistical Methods and Data Science Branch, Statistics Canada, Canada
10:00 – 11:00
Session 1 -- Keynote Address
Chairperson: Eric Rancourt
- Breaking the Cycle of Invisibility in Data
Grace Sanico Steffan, United Nations Office of the High Commissioner for Human Rights (OHCHR), Switzerland-
Abstract
For too long, many States have been reluctant to collect disaggregated data on many human rights issues and marginalized groups. For example, the lack of data, disaggregated by race or ethnic origin, as well as by gender, age, and other factors, hides the disproportionate impact of certain laws, policies and practices on racial or ethnic groups in all areas of life, from housing and education to employment, health and the criminal justice system. It also hinders the development of legal and policy responses that speak to the lived experiences of racial or ethnic groups and the intersectional forms of racial and other forms of discrimination that they face. Lack of disaggregated data on delivery of basic rights, such as food, water and sanitation, housing and health, illustrates the hidden ways marginalized groups are forgotten.
The work of UN Human Rights on human rights indicators seeks to make available relevant, robust, and internationally comparable indicators on progress in advancement (as well as lags) in human rights enjoyment by all. The UN's guidance note on Human Rights-Based Approach to Data (HRBAD) demonstrates how data can be produced following international human rights and statistical standards while putting people at the center. This work draws attention to human rights and their practical and normative contributions to ensuring meaningful participation, especially by vulnerable and at-risk groups, in all stages of the data life cycle. The approach also improves visibility around groups left behind and reinforces equality and non-discrimination. HRBAD highlights the nexus between human rights standards and data-specific ethical and professional principles, particularly the Fundamental Principles of Official Statistics. It espouses six key principles: participation, self-identification, data disaggregation, privacy, transparency, and accountability that national statistical systems need to operationalize. And through HRBAD, the statistical community can provide access to meaningful statistics, which is a public good and essential to meet people's right to information.
-
11:00 – 11:15
Morning Break
11:15 -- 12:45
Session 2A -- Sampling Hard-to-Reach Populations
Chairperson: François Brisebois
- Sampling Hard-to-Reach Populations
Mark S. Handcock, University of California Los Angeles, USA-
Abstract
In many situations, standard survey sampling strategies fail because the target populations cannot be accessed through well-defined sampling frames. Typically, a sampling frame for the target population is not available, and its members are rare or stigmatized in the larger population so that it is prohibitively expensive to contact them through the available frames. We discuss statistical issues in studying hard-to-reach or "hidden" populations. These populations are characterized by the difficulty in survey sampling from them using standard probability methods. Examples in a demographic setting include unregulated workers and migrants. Examples of such populations in a behavioral and social setting include injection drug users, men who have sex with men, and female sex workers. Hard-to-reach populations are under-served by current sampling methodologies mainly due to the lack of practical alternatives to address these methodological difficulties. We will focus on populations where some form of social network information can be used to assist the data collection. In such situations sophisticated statistical methods are needed to allow the characteristics of the population to be inferred from the collected data. We review time-location sampling, adaptive network sampling, including respondent-driven sampling, as well as indirect and meta-methods. We also discuss model- assisted methods and capture-recapture ideas. This is joint work with Ian E. Fellows, Krista J. Gile, and Henry F. Raymond.
-
11:15 -- 12:45
Session 2B -- Disclosure Control Strategies for Disaggregated Data
Chairperson: Steven Thomas
- Statistical Disclosure Control and special focus groups: a European perspective
Peter-Paul de Wolf, Statistics Netherlands, Netherlands-
Abstract
With the availability of larger and more diverse data sources such as administrative data, Statistical Institutes in Europe are inclined to publish statistics on smaller groups than they used to. Moreover, high impact global events like the Covid crisis and the situation in Ukraine may also ask for statistics on specific groups of people. Publishing on small, targeted groups not only raises questions on statistical quality of the figures, it also raises issues concerning statistical disclosure risk. The principle of statistical disclosure control does not depend on the size of the groups the statistics are based on. However, the risk of disclosure does depend on the group size: the smaller a group, the higher the risk. Traditional ways to deal with statistical disclosure control and small group sizes include suppressing information and coarsening (e.g. combining) categories. These methods essentially increase the (mean) group sizes. More recent approaches include perturbative methods that have the intention to keep the group sizes small while reducing the disclosure risk sufficiently. In this paper we will mention some European examples of special focus group statistics and discuss the implications on statistical disclosure control. Additionally, we will discuss some issues that the use of perturbative methods brings along: its impact on disclosure risk and utility as well as the challenges in proper communication thereof.
-
- Automated checking methodology to control disclosure risks of microdata research outputs
Joseph Chien, Australian Bureau of Statistics, Australia-
Abstract
The Australian Bureau of Statistics (ABS) is committed to improving access to personal and business microdata through its virtual DataLab. The DataLab supports researchers to undertake complex research and the number of sessions has increased substantially since 2019–20, with 15,520 sessions accessed in 2020-21 and 24,037 sessions accessed in 2021-22. To ensure privacy and confidentiality rules are followed, DataLab outputs need to follow strict procedures to minimise disclosure risks. Currently, output clearances is a manual checking process which is not scalable, cost effective or free from human error. There is also a risk that the increasing number of outputs from different projects could potentially introduce differencing risks even though these outputs have individually met the strict output criteria. To automate the process and make the output checking scalable, the ABS has been exploring the possibility of providing output checking tools that use the ABS perturbation methodology to ensure that outputs across different projects are protected consistently, thereby to minimising differencing risks and reducing the costs associated with output checking
-
- Statistical Disclosure Limitation and Equity-enhancing Publications: Some Examples from the 2020 Census
John M. Abowd, United States Census Bureau, United States-
Abstract
One of the primary statutory uses of decennial census data in the United States is to provide population, race, and ethnicity data granular enough to support legislative redistricting. The primary accuracy challenges are the requirements of "one person, one vote" (equal population counts within districts) and "equal protection" (districts that support minority voting rights as defined in the Voting Rights Act of 1965). The primary disclosure limitation challenge is the requirement to release the data with sufficient geographic granularity such that new voting districts, which can have widely heterogeneous populations from less than 1,000 to more than 1,000,000, can be drawn from the same data release. The U.S. Census Bureau works with a nonpartisan organization (National Conference of State Legislatures) throughout the decade preceding a population census to define the data requirements. Since 1990, these have included publication at the census block level, a geographic disaggregation with an average population of about 50 persons (in blocks with at least one housing unit or group quarters). This talk discusses how the 2020 Census Redistricting Data (P.L. 94-171) Summary File addressed these dual challenges.
-
- Reconstruction Attack Risk using Statistics Canada Census Data
Matthew Abado and George Steffan, Statistics Canada, Canada-
Abstract
The publication of more disaggregated data can increase transparency and provide important information on underrepresented groups. Developing more readily available access options increases the amount of information available to and produced by researchers. Increasing the breadth and depth of the information released allows for a better representation of the Canadian population, but also puts a greater responsibility on Statistics Canada to do this in a way that preserves confidentiality, and thus it is helpful to develop tools which allow us to quantify the risk from the additional data granularity.
In an effort to evaluate the risk of a database reconstruction attack on Statistics Canada's published census data, we follow the strategy of the US Census Bureau, who outlined a method to use a boolean satisfiability (SAT) solver to reconstruct individual attributes of residents of a hypothetical US Census block, based just on a table of summary statistics. We plan to expand this technique to attempt to reconstruct a small fraction of Statistics Canada's census microdata. In this presentation, we will discuss our findings, the challenges involved in mounting a reconstruction attack, and the effect of our existing confidentiality measures in mitigating these attacks. Furthermore, we plan to compare our current strategy to other potential methods used to protect data -- in particular, releasing tabular data perturbed by some random mechanism, such as differential privacy.
-
11:15 -- 12:45
Session 2C -- Addressing Disaggregated Data in Practice
Chairperson: Hélène Bérard
- The Trajectories and Origins 2 Survey: Challenges of conducting a survey on population diversity in France
Odile Rouhban and Jérôme Lê, Institut National de la Statistique et des Études Économiques, France-
Abstract
The Trajectories and Origins 2 (TeO2) survey was conducted by INSEE (National Institute of Statistics and Economic Studies) and INED (National Institute of Demographic Studies) in 2019/2020 to gain a better understanding of the trajectories and living conditions of immigrants and descendants of immigrants in France. Like the first edition (2008/2009), the survey strives to represent the major immigration groups present in France and their descendants. Conducted 10 years ago, the survey generated many multi-focused studies and research projects: migration trajectories, access to employment, education, marital life, political and religious opinions, and experiences with discrimination. The challenge facing this new edition was to remain consistent with the first survey and to observe changes in these issues while accounting for new, emerging immigration groups in France and identifying the grandchildren of immigrants. Completing the TeO2 reflects the many issues covered by this survey and the complexity involved in identifying and targeting a population with ties to migration for official statistics. From the start of the survey, the validation process involving the Conseil National de l'Information Statistique (National council for statistical information) led to numerous debates on ethnic statistics in France, and later on the identification of the grandchildren of immigrants. From a methodological perspective, the difficulty in identifying the populations of interest in the existing data required innovative sampling strategies. During collection, the specificity of the populations surveyed (more mobile, non-francophone) required specific protocols to contact individuals who were sometimes difficult to reach. Finally, the correction and weighting chain had to consider the complexity of the samples built for the survey. Despite the challenges encountered during implementation and the sensitivity of the topic, the TeO2 Survey actively contributes to knowledge about the diversity of the French population through official statistics.
-
- Children Born into Vulnerability: Challenges encountered in a Quebec Longitudinal Study
Catherine Fontaine, Karine Dion, Institut de la statistique du Québec, Canada-
Abstract
Grandir au Québec (Growing up in Québec), also known as the Quebec Longitudinal Study of Child Development, 2nd edition (QLSCD 2), is a probabilistic population survey that began in the spring of 2021. Its purpose is to monitor the development of approximately 4,500 Quebec children from the age of 5 months. The first edition of the study (QLSCD 1) began in 1998 and identified factors that explain the developmental vulnerability seen in some children. For example, Desrosiers and Ducharme (2006) showed that children from economically disadvantaged families or whose mothers had a low level of education were more likely to have delayed vocabulary in kindergarten. Based on the findings of the QLSCD 1, along with consideration of the literature and consultations with child development experts, the following direction was taken for the second study: studying a subpopulation of interest composed of vulnerable children over time with a good level of accuracy. Referred to as "children born into socioeconomic poverty," there were many challenges in defining, measuring and enlisting the participation of this subpopulation in the first year of data collection for the new study. The presentation will show how Growing up in Québec made this subgroup of children the central focus of its concerns, its methodological design and data collection strategies, adjusted to account for the limitations due to the COVID-19 pandemic.
-
- Toward a system of integrated statistical data on education and training
Giovanna Brancato, Donatella Grassi, Claudia Busetti, Italian National Statistical Institute (ISTAT), Italy-
Abstract
Education and training (E&T) is a key factor for the growth of a society and one of the sectors with high investments in the Italian National Recovery and Resilience Plan settled to recover from COVID-19 pandemic crisis. It is a complex phenomenon, which determinants are ascribable to several interrelated familiar and socio-economic conditions, thus requiring supporting statistical information for policymaking and its monitoring process. The Italian National Statistical Institute (Istat) is designing a thematic statistical register on education and training (TRE&T). It tracks the individuals' paths from pre-primary to tertiary education, marking the relevant events in the student life (attainments, programs' changes, internships, drop-outs, etc.). It also includes data on the factors affecting E&T, e.g. learning skills, characteristics of E&T institutions and their staff and socio-economic conditions.
The TRE&T is part of a wider Istat system of statistical registers, which allows relating data on E&T with information from other registers, e.g. demographic events, occupation and income. Several methodological and quality issues have to be faced when designing and implementing the TRE&T. First, coverage has to be assessed and adjusted by integrating register microdata with macro-data from other sources. Individuals' backword and forward tracking is conditioned to correct record linkage and privacy requirements. Sources' integration can lead to lack of microdata consistency and incoherence among estimates, which have to be properly managed during the statistical production process.
-
- New Disaggregated Measures of Health Disparity Between Groups in Complex Survey Data
Mark Louie F. Ramos, Barry I. Graubard, Joseph L. Gastwirth, National Cancer Institute, USA-
Abstract
Health disparities between racial/ethnic or socioeconomic advantaged and disadvantaged groups can be particularly difficult to measure using summary statistics of central distributional location such as the mean or median. For instance, even small group differences in the means of relevant health variables, such as Body Mass Index (BMI), can imply substantial disparity experienced systemically between the groups' members as reflected by differences in the relative quantiles of the distribution of BMI. In this study, we adapt transformations of the Lorenz curve and an analog of the Gini index, proposed by Gastwirth (2016) for the analysis of income inequality, to provide a graphical and some analytical measures of health disparities between such groups. Akin to the idea behind the classical Peters-Belson regression method for partitioning disparity using explanatory covariates, this approach describes the behavior of the health variable for the disadvantaged group when applied to the advantaged group and quantifies the extent of the disparity potentially attributable to group membership. The estimating equations for the Lorenz curve and Gini index for data obtained from complex sample surveys as derived by Binder and Kovacevic (1995) are modified to account for using the advantaged group as the reference distribution. Approaches to the estimation of the variances of the proposed measures are explored through simulation studies. The new graph and related measures are used to compare the BMI of women across different racial/ethnic groups in the US National Health and Nutrition Examination Survey (NHANES). If time permits, similar analyses of other health disparities will be described.
-
- Building a Panel to Better Understand the Experiences of Diverse Populations in Canada
Agnes Waye and Cilanne Boulet, Statistics Canada, Canada-
Abstract
In 2021, Statistics Canada initiated the Disaggregated Data Action Plan, a multi-year initiative to support more representative data collection methods, enhance statistics on diverse populations to allow for intersectional analyses, and support government and societal efforts to address known inequalities and bring considerations of fairness and inclusion into decision making. As part of this initiative, we are building the Survey Series on People and their Communities, a new probabilistic panel specifically designed to collect data that can be disaggregated according to visible minority group. This new tool will allow us to address data gaps and emerging questions related to diversity. This talk will give an overview of the design of the Survey Series on People and their Communities.
-
12:45 – 13:15
Afternoon Break
13:15 – 14:45
Session 3A -- Next-Generation Data-Driven Methods for Equity Science – Panel session
Moderator: Andrew Gibson, Public Health Agency of Canada, Canada
- Next-Generation Data-Driven Methods for Equity Science – Panel session
-
Abstract
Inequities stemming from colonization, racism, poverty, gendered impacts, and other oppressive and discriminatory societal structures are on the forefront of many people's minds. Various initiatives are being launched to address the consequences and underlying causes of these inequities. What researchers and policymakers are finding, however, is that methods to both measure and address inequities are not yet adequate to fully support this effort. Significant work is underway to improve this situation, of which advancing methods around data disaggregation is a key component. But what lies beyond the horizon? What are the next-generation data-driven methods for equity science that we are not yet using, but that have great potential for the short, medium, and long term, and what is needed to enable use of these methods? In this panel discussion, we will highlight and discuss what we see in the future of this field of study. The panel will be introduced and moderated by Andrew Gibson, Executive Director for Data Science at the Public Health Agency of Canada. Our discussants are:
- Fatima Mussa (Project Manager, Institute of Population and Public Health, Canadian Institutes of Health Research), who will discuss what is needed to advance methods for equity science, and how research funders can support innovations in this space.
- Eric Rancourt (Director General, Modern Statistical Methods and Data Science Branch, Statistics Canada), who will discuss how traditional frameworks and approaches need to be adapted and/or expanded to include modern methods, while continuing to ensure that data can lead to valid conclusions.
- Dr. Ayaz Hyder (Assistant Professor, Division of Environmental Health Sciences, College of Public Health, and Core Faculty, Translational Data Analytics Institute, The Ohio State University), who will discuss how to operationalize the core value of equity to engage communities, build and translate data analytics tools and computational modeling in public health practice.
- Dr. Wanda Phillips-Beck (Seven Generations Scholar, First Nations Health and Social Secretariat of Manitoba) will discuss how the application of alternative frameworks, such as Indigenous Research Methodologies (IRMs), and the operationalization and manifestation of the core values of respect, honesty, and humility in the context of big data linkage and the development of disease modeling platforms is equity in action.
Each speaker will present their subject for 10 minutes, followed by a panel discussion.
-
13:15 – 14:45
Session 3B -- Data Integration
Chairperson: Wesley Yung
- Secondary analysis of linked categorical data
Li-Chun Zhang, University of Southampton, United Kingdom-
Abstract
Linkage is important for integration of data residing across different sources. However, false links and missing links are generally unavoidable without identification keys, which may cause misleading inference if the linked data are treated as if they were true joint observations. We consider analysis of linked categorical data from the secondary user perspective, where the linked data has to be prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. In particular, our approach allows for the probabilities of correct linkage to vary across the records, without assuming that one is able to estimate this probability for individual linked records, and it accommodates the general situation where the separate files to be linked are of different sizes and each of them contain records that cannot possibly be linked correctly. All one needs from the linker is an estimate of the overall false linkage rate in the given linked dataset. Methods for logistic regression, test of independence and log-linear modelling of contingency tables are developed, illustrated and applied.
-
- Multiply bias calibrated data integration
Jae-Kwang Kim, Iowa State University, USA-
Abstract
Valid statistical inference is notoriously challenging when the sample is subject to selection bias. We approach this difficult problem by employing multiple candidate models for the propensity score function combined with empirical likelihood. By incorporating the multiple propensity score (PS) models into the internal bias calibration constraint in the empirical likelihood setup, the selection bias can be safely eliminated so long as the multiple candidate models contain the true PS model. The bias calibration constraint for the multiple PS model in the empirical likelihood is called the multiple bias calibration. The multiple PS models can include both ignorable and nonignorable models. In the context of data integration setup, the conditions for multiple bias calibration can be achieved. Asymptotic properties are discussed, and some limited simulation studies are presented to compare with the existing methods.
-
- Modelling intra-annual measurement errors in linked administrative and survey data
Arnout van Delden, Statistics Netherlands, Netherlands-
Abstract
Statistics Netherlands (CBS) produces monthly output based on survey data and quarterly output based on administrative data. That output shares the variable turnover. Earlier studies have shown that the quarterly turnover based on administrative tax data has a relatively larger value in the fourth quarter of the year than the survey data. Ideally, those two estimates are made consistent with each other. A first step is the availability of an analysis instrument that can explain the quarterly differences in outcomes. Van Delden et al. (2020) developed a model for this purpose.
This original model describes the population as a mixture of (groups of) units. Each group of units has different systematic and random measurement errors with respect to the administrative and survey data. Some units report nearly the same values for both sources while others have quarterly differences. We found that the original model only captured part of the seasonal effects and did not yet provide sufficient results for all economic sectors and years. Recently we have developed an adapted model which estimates relative turnover levels within a year, whereas the original model estimated absolute turnover levels. The use of relative turnover levels leads to so-called compositional data, which we again model as a mixture of groups of units. Using simulated data we tested to what extent the adapted mixture model was estimated reliably. We are currently applying the model to real data to estimate the unknown group proportions to see if the quarterly effects can be explained. Furthermore, we test whether the estimated quarterly group effects stabilise over years.
-
13:15 – 14:45
Session 3C -- Adaptations to Survey Methods for Difficult-to-reach Populations
Chairperson: Peter Wright
- Bayesian model assisted design-based estimators of the size, total and mean of a hard-to-reach population from a link-tracing sample with initial cluster sample
Martin Humberto Félix Médina, Universidad Autonoma de Sinaloa, Mexico-
Abstract
In this work we present design-based Horvitz-Thompson-like and multiplicity-like estimators of the size, total and mean of a response variable associated with the elements of a hidden population, such as drug users and sex workers to be used with the link-tracing sampling variant which uses an initial cluster sample (Félix-Medina and Thompson, Jour. Official Stat., 2004). In this sampling variant a frame of venues where the elements of the population tend to gather is constructed. The frame does not need to cover the whole population. An initial sample of venues is selected and people in those sites are asked to name other members of the population. Since the computation of the design-based estimators require to know the number of venues in the frame that are linked to each sampled person and this information is not observable, we consider a Bayesian model which allows us to estimate that number for each person in the sample and consequently to compute the Horvitz-Thompson and multiplicity estimators. The estimation of the number of sites linked to each sampled person, as well as the estimation of the size, total and mean is carried out by means of Gibbs sampling; however, inference is carried out under the design-based approach. The results of a small numeric study indicate that the performance of the proposed estimators is acceptable.
-
- Evaluating Sampling Methods for Ethnic Minorities
Mariel McKone Leonard, Deutsches Zentrum für Integrations- und Migrationsforschung (DeZIM), Germany-
Abstract
For much of the history of social science research, survey methodologists and statisticians have focused on developing and refining methods of sampling members of the "general" population. While supposedly representative of all non-institutionalised adults within a society, it is tacitly acknowledged that members of demographic subgroups are "hard-to-reach" for numerous reasons, including sampling, linguistic, and access barriers. While their exclusion from social science research is detrimental to both survey data quality (Willis et al. 2014) and human rights (European Commission 2021), many studies have struggled to apply resource-intensive probability-based sampling methods to these populations, resulting in their continuing exclusion.
Fortunately, methodologists and statisticians - particularly those working within the field of public health and epidemiology - have proposed and developed a number of sampling methods for improving representation of demographic subgroups in population surveys (Reichel & Morales 2017). While an ever-expanding body of literature discusses and advances these methods, few studies have directly compared these methods empirically. As a result, researchers seeking to improve representation of demographic subgroups in their studies may be both overwhelmed by the methodological possibilities yet simultaneously unsure how best to evaluate their options.
In this paper, I will present a comparative discussion of several of the most commonly recommended methods with evaluation of representativity of each method as well as cost effectiveness. I will also present lessons learned from several studies which have developed samples of ethnic and racial minorities in Germany.
-
- Measuring Women's and Youths' Informal Work in Non-Urban Settings: Evidence from El Salvador
Ivette Contreras Gonzalez, Valentina Costa and Amparo Palacios-Lopez, World Bank, and Lelys Dinarte-Diaz, World Bank and CESifo, USA-
Abstract
Measuring informal work is crucial for policy making, especially in low-income countries where informal work represents a high share of total employment. For example, informal sector accounts for roughly 70% of employment with an increase over time in emerging market and developing economies (World Bank, 2018). Despite the relevance of the informal work in the labor market, there is still a lack of a universally accepted description or definition of 'informality' and of accurate survey tools for capturing it. First, many surveys rely on the notion of the main and second work activity and do not use appropriate screening questions to define 'activity,' which might be problematic for capturing household members engaging in atypical work, such as self-employed workers in the informal market. Second, data on youth and women may disproportionally suffer from 'proxy response' bias because the respondent may accurately report his own activities, but under‐ or overreport activities of other household members, especially of household members who primarily engage in atypical work. This research proposal provides experimental evidence to overcomes these limitations and to improve data collection on informal work, with a focus on women and youth in rural and peri-urban areas in El Salvador. We design a methodological experiment that aims to assess how the measurement of informal labor is affected by the screening list of activities and/or the 'proxy response' bias. As part of the experiment, we conduct qualitative field activities with women and youths to create the list of paid and unpaid activities defined as 'work' in rural and peri urban communities. Finally, we explore women's and youths' preferences on non-informal work attributes through a discrete choice experiment inspired in Datta (2019) and the associations between these and risk preferences using incentivized-field experiments.
-
- Number of food aid recipients estimated by the "Aide alimentaire 2021" survey (2021 food aid survey)
Aliocha Accardo, Institut National de la Statistique et des Études Économiques, France-
Abstract
In France, many associations provide in-kind food assistance to people in serious financial difficulty. The food is distributed to thousands of centres across the country. Until last year, the recipient population was surveyed only in public statistics household surveys through retrospective questions (such as, "have you received food assistance in the last 12 months") administered to samples of households living in regular housing. In November 2021, a new food aid survey was conducted "on site," i.e., during a visit to a food distribution centre, with a sample of 4,515 respondents. Its sampling, based on the that of "homeless" surveys, had to resolve a number of problems: before the survey, very different information on centres in operation, the non-availability at most centres of a list of people who use their services, a high proportion of recipients who use several different centres, and finally, the lack of calibration data for a very poorly understood population. The final weighting, albeit not completely accurate given the required assumptions and simplifications, produced an estimated total for the food aid recipient population considered nonetheless plausible by the associations themselves. This estimate is far higher than estimates reported by surveys based on retrospective questions. This suggests a major impact due to a likely reluctance on the part of respondents to admit to the surveyor that they had to resort to this kind of food aid.
-
- From theory to practice: Some lessons learned from implementing the "network sampling with memory" method used to survey Chinese immigrants in Ile-de-France
Geraldine Charrance, Institut national d'études démographiques, France-
Abstract
To overcome the traditional drawbacks of chain sampling (respondent-driven sampling) methods, a team from the University of North Carolina developed a sampling method known as "network sampling with memory." Its unique feature is to recreate, gradually in the field, a frame for the target population composed of individuals identified by respondents and to randomly draw future respondents from this frame, thereby minimizing selection bias. The algorithm includes an initial exploratory or "Search" phase that looks for new network segments, followed by a phase of random draws within the new network.
Used for the first time in France between September 2020 and June 2021 for a survey of Chinese immigrants in Ile-de-France (ChIPRe), implementing the method was extremely difficult. Specifically, we discovered a paradox inherent in the "Search" algorithm: it shows a "preference" for small rosters (those with few referrals), which were considered as an opportunity to explore unexplored parts of the network, as opposed to large rosters that comprise more people already referred (duplicates) and associated with parts of the network that have already been explored. As a result, the efforts of "good" surveyors were not always rewarded with selection of the rosters they had collected after time-consuming negotiations in the field (future questionnaire potential, which meant that their remuneration depended on this random draw). In all, 501 questionnaires and 1,698 referrals were collected.
-
14:45 - End
Friday November 4, 2022
9:00 – 10:00
Session 4 – Poster Session
- Poverty Imputation in Contexts without Consumption Data: A Revisit with Further Refinements
Kseniya Abanokova, World Bank, USA-
Abstract
A key challenge with poverty measurement is that household consumption data are often unavailable or infrequently collected or may be incomparable over time. In a development project setting, it is seldom feasible to collect full consumption data for estimating the poverty impacts. While survey-to-survey imputation is a cost-effective approach to address these gaps, its effective use calls for a combination of both ex-ante design choices and ex-post modeling efforts that are anchored in validated protocols. This paper refines various aspects of existing poverty imputation models using 14 multi-topic household surveys conducted over the past decade in Ethiopia, Malawi, Nigeria, Tanzania, and Vietnam. The analysis reveals that including an additional predictor that captures household utility consumption expenditures—as part of a basic imputation model with household-level demographic and employment variables—provides poverty estimates that are not statistically significantly different from the true poverty rates. In many cases, these estimates even fall within one standard error of the true poverty rates. Adding geospatial variables to the imputation model improves imputation accuracy on a cross-country basis. Bringing in additional community-level predictors (available from survey and census data in Vietnam) related to educational achievement, poverty, and asset wealth can further enhance accuracy. Yet, there is within-country spatial heterogeneity in model performance, with certain models performing well for either urban areas or rural areas only. The paper provides operationally-relevant and cost-saving inputs into the design of future surveys implemented with a poverty imputation objective and suggests directions for future research.
-
- Bespoke automated linkage to enable analysis of Covid-19 deaths in England and Wales by ethnicity
Mary Cleaton, Office for National Statistics, United Kingdom-
Abstract
In early 2020, there was intense speculation that ethnicity and Covid-19 deaths were correlated. However, the UK's existing method of adding ethnicity to death data resulted in low linkage rates for recent deaths and there was concern that certain ethnic groups were particularly affected. This prevented the Office for National Statistics from publishing real-time statistics on Covid-19-related mortality by ethnicity.
We developed a bespoke linkage in three days, using deterministic linkage using personal identifiers and testing matchkeys via clerical review. Our best source of ethnicity information was the 2011 England and Wales Census. To solve the issue of information changing since 2011, we took an innovative approach. We linked death records to 2019 National Health Service (NHS) data then used NHS Number to access individuals' 2011 NHS records and thus their 2011 information. This was then linked to the Census ethnicity information.
The previous method of adding ethnicity data had an overall linkage rate of ~90%. However, for recent deaths (since March 2020, when the UK Covid-19 pandemic began) the rate was ~30%. Our method improved the rate for recent deaths to ~90% without impacting accuracy: the false positive rate was ~0.2%. This enabled analysis demonstrating the risk of Covid-19-involved death was significantly higher among certain ethnic groups.
-
- Complete case analysis and multiple imputation: assessing the impact of missing data within youth overweight and obesity research
Amanda, Doggett, University of Waterloo, Canada-
Abstract
Missing data is a problem in most applied research, but particularly for epidemiological studies that utilize surveys or questionnaires as data collection instruments. Complete case analysis (CCA) is the most common technique for handling missing data but has been shown to introduce bias in situations where there are large amounts of non-random missing data. Youth overweight and obesity (OWO) research, which primarily uses body mass index (BMI) as the main indicator of body adiposity, often suffers from high proportions of missing data, yet CCA is common in this domain. This study will use BMI and related covariate data from 74 501 Canadian youth who participated in the COMPASS study in 2018/19, where 31% of BMI data are missing. Analyses which examine the predictors of BMI through generalized linear mixed models will be performed using CCA and multiple imputation (MI), and results and associated inferences will be compared between the two approaches. Results are hypothesized to show that some Type I or II errors occur when using the CCA approach comparted to MI. The implications of this study are expected to highlight that appropriate methodological choices on the handling of missing data are essential in youth OWO research, and that such choices can impact research inference and concomitant policy and programming recommendations.
-
- The Social Data Linkage Environment at Statistics Canada
Goldwyn Millar, Statistics Canada, Canada-
Abstract
The Social Data Linkage Environment (SDLE) at Statistics Canada is a secure record linkage environment that expands the potential of data integration of existing administrative and survey data to address research questions and inform socio-economic policy. The design of the SDLE incorporates strong governance practices. The premise behind the SDLE is that a file is linked once within its infrastructure and then used for multiple purposes by analysts from various domains (e.g. health, justice, education, income) thus reducing duplication of work and standardizing the record linkage process and results. Probabilistic record linkage using Statistics Canada's generalized record linkage software G-Link, which employs the Fellegi-Sunter methodology, is the main tool for data integration within the SDLE. This presentation will provide an overview of the SDLE and will include information on the governance in place to adhere to policy and privacy requirements, the structure of the SDLE, data sources, record linkage methods including the calculation of linkage error rates, and the linked analytical data files produced.
-
10:00 – 11:00
Session 5 – Waksberg Award Winner Address
Chairperson: Jean Opsomer
- Bayes, buttressed by design-based ideas, is the best overarching paradigm for sample survey inference
Roderick J. Little, University of Michigan, USA-
Abstract
Conceptual arguments and examples are presented suggesting that the Bayesian approach to survey inference can address the many and varied challenges of survey analysis. Bayesian models that incorporate features of the complex design can yield inferences that are relevant for the specific data set obtained, but also have good repeated-sampling properties. Examples focus on the role of auxiliary variables and sampling weights, and methods for handling nonresponse. The article offers ten top reasons for favoring the Bayesian approach to survey inference.
-
11:00 – 11:15
Morning Break
11:15 -- 12:45
Session 6A -- Small Area Estimation
Chairperson: Jean-François Beaumont
- Multivariate Mixture Model for Small Area Estimation of Poverty Indicators
Isabel Molina, Universidad Complutense de Madrid, Spain-
Abstract
For small area estimation of general indicators, including poverty indicators, in the presence of heterogeneous areas, we propose a mixture model of multivariate normal distributions. This model considers that there is a latent cluster structure on the areas, such that the area vectors of interest follow a nested error linear regression model, but where all the model parameters (regression coefficients and variance components) vary by cluster. Under this model, we propose two types of predictors of the area indicators of interest; the first is obtained by predicting the cluster where the area belongs to using the posterior probabilities of belonging to each cluster, and the second by averaging across all the possible clusters with weights given by these posterior probabilities. We also propose a parametric bootstrap procedure for mean squared error estimation. We study the performance of the proposed predictors compared with the usual empirical best predictors based on a nested error model and with direct estimators, and we apply our methodology to the estimation of mean expenditure and poverty rates and gaps in Palestinian localities.
-
- Model-based imputation methods for small area estimation
Aditi Sen and Partha Lahiri, University of Maryland, College Park, USA-
Abstract
There is a growing demand to produce reliable estimates of different characteristics of interest for small geographical areas (e.g.., states) or domains obtained by a cross-classification of different demographic factors such as age, sex, race/ethnicity. The information on the outcome variable(s) of interest often comes from a sample survey that targets reliable estimation for large areas (e.g., national level). In this talk, I will discuss how model-based imputation methods can be used to improve inferences about different small area or domain parameters. The proposed method essentially uses suitable statistical models that can be used to extract information from multiple data sources. We illustrate the proposed methodology in the context of election projection for small areas. The talk is based on collaborative research with UMD students Aditi Sen and Zhenyu Yue.
-
- Small area benchmarked estimation under the basic unit level model when the sampling rates are non-negligible
Mike Hidiroglou, Statistics Canada (retired), Canada-
Abstract
We consider the estimation of a small area mean under the basic unit-level model. The sum of the resulting model-dependent estimators may not add up to estimates obtained with a direct survey estimator that is deemed to be accurate for the union of these small areas. Benchmarking forces the model-based estimators to agree with the direct estimator at the aggregated area level. The generalized regression estimator is the direct estimator that we benchmark to. In this paper we compare small area benchmarked estimators based on four procedures. The first procedure produces benchmarked estimators by ratio adjustment. The second procedure is based on the empirical best linear unbiased estimator obtained under the unit-level model augmented with a suitable variable that ensures benchmarking. The third procedure uses pseudo-empirical estimators constructed with suitably chosen sampling weights so that, when aggregated, they agree with the reliable direct estimator for the larger area. The fourth procedure produces benchmarked estimators that are the result of a minimization problem subject to the constraint given by the benchmark condition. These benchmark procedures are applied to the small area estimators when the sampling rates are non-negligible.
-
11:15 -- 12:45
Session 6B -- Measuring and projecting diversity
Chairperson: Scott Meyer
- Projecting ethnic populations in New Zealand
Melissa Adams, Statistics New Zealand, New Zealand-
Abstract
The subnational ethnic population projections give an indication of the future size and composition of four broad and overlapping ethnic groups – Māori, Pacific, Asian, and 'European or Other' – living in all areas of New Zealand. These projections are part of a suite of population projections produced by Stats NZ, and they assist local and ethnic communities, as well as central government, in planning and policy-making. The projections are developed using a cohort-component method, requiring assumptions for fertility, mortality, migration, and inter-ethnic mobility at a local level. They are produced as deterministic projections, with a low, medium, and high growth scenario and complement the national ethnic population projections and subnational total population projections. This presentation will discuss the methodology used for these projections, and some of the reasons why these ethnic projections are more uncertain than projections of the total population.
-
- Projecting Racial and Ethnic Diversity: Methods, Assumptions, and Limitations from the U.S. Census Bureau's Population Projections
Sandra Johnson, United States Census Bureau, USA-
Abstract
The U.S. Census Bureau regularly produces population projections by demographic characteristics, including race and ethnicity. Our projections offer insight into how the population might look in the future and are used for planning purposes by a variety of audiences in the public and private sectors. This presentation will provide an overview of the methods that we used to project the population in our last series of population projections, the 2017 National Series, with a particular focus on how we assigned race and ethnicity in each of the components of population change - births, deaths, and international migration.
Categorizing data by race and ethnicity is important for advancing racial equity and offering support for underserved communities, but it is not without challenges. The racial and ethnic categories included in our projections are based on the standards set by the United States Office of Management and Budget in 1997. However, government standards for measuring race change over time as do societal definitions. Long-term projections of the racial and ethnic composition of the population include implicit assumptions about how race and ethnicity will be measured in the future. These will be discussed during the presentation, along with the methods, to provide a greater understanding of what the projections can tell us about the future U.S. population.
-
- Demosim: a powerful microsimulation tool for disaggregated population projections and nowcasting exercises
Samuel Vézina, Statistics Canada, Canada-
Abstract
In response to ever-increasing needs for more accurate and disaggregated population projection data, projection models must be able to generate robust results for small sub-populations, and by a large number of characteristics including regional level data. For more than two decades, Statistics Canada has developed a microsimulation tool – Demosim – to project not only the entire Canadian population but also for multiple, targeted subgroups of the population in a coherent manner. In this paper, we present recent projections for Canada, provinces and more than 50 smaller geographical units, focusing on different ethnocultural groups such as Indigenous populations, racialized populations, or linguistic groups. We show how microsimulation offers a much more flexible alternative to compute detailed population projections than other types of projection models. We also demonstrate how Demosim has become a powerful nowcasting tool providing an up-to-date and detailed portrait of the Canadian population between censuses, and how the model was recently used to fill data gaps related to initiatives such as the Call for Action on Anti-Racism, Equity, and Inclusion. Finally, we discuss possible new developments in upcoming years, notably related to the production of projections for the gender diverse population.
-
- Understanding data collection quality, inclusivity and representativeness at source
Ella Williams Davies and Karina Williams, Office for National Statistics, United Kingdom-
Abstract
In the Methodology and Quality directorate, at the Office for National Statistics, we aim to optimise the collection of data to better inform our society through producing statistics for public good. We will present:
- Exploring inclusivity and representativeness in administrative data.
- Understanding administrative data quality at the start of the data journey.
We are carrying out innovative research on, and placing importance in, collecting and assessing administrative data input quality, inclusivity, and representativeness at source. We are exploring inclusivity and representativeness from group representatives (as gatekeepers) and directly with the public, who are identified as vulnerable and with protected characteristics. We conduct qualitative interviews to gain in-depth understanding in how these groups interact with services which contribute to administrative data. This gains insight in how inclusive and representative these sources are.
Assessing quality further along the data journey, we are conducting research to understand quality of specific administrative data from the perspective of administrative staff that collate and process the data. Products, for use across statistical organisations and wider, from our research programme includes: developing tools and frameworks to aid assessment of administrative data quality and to assist conversations with data suppliers.
-
11:15 -- 12:45
Session 6C -- Machine Learning and Data Integration Strategies for Disaggregated Data
Chairperson: Michelle Simard
- Machine Learning for estimating heterogeneous treatment effects in program evaluations
Andy Handouyahia and Leeroy Tristan Rikhi, Employment and Social Development Canada, Canada-
Abstract
The study shows how the Evaluation directorate at Employment and Social Development Canada uses rich administrative data and Modified Causal Forests, a causal machine-learning estimator, to inform policy development through impact evaluations. The study also illustrates the implementation of the innovative Modified Causal Forests algorithm to estimate individualized treatment effects, thereby inform what works for whom. This study lays the foundation for the conduct of evaluation from a Gender-Based Analysis+ perspective with a view to inform differential impacts of polities and programs on people of various sociodemographic backgrounds. In particular, it provides a distribution of net impacts for key sub-groups of participants in addition to the average program impact.
-
- Representative Absolute Risk Estimation from Combining Individual Data from Epidemiologic Cohort Studies and Representative Surveys with Summary Statistics from Disease Registries
Lingxiao Wang, Barry Graubard, Hormuzd Katki, National Cancer Institute, and Yan Li, University of Maryland, College Park, USA-
Abstract
Epidemiologic cohort studies follow samples of individuals over time to study risk of disease or mortality associated with biomarkers and socio-demographic/behavioral factors. These cohorts are generally not sampled with probability sampling and thus usually lack population representativeness, which can invalidate absolute risk estimation (e.g., the probability of dying in 5 years). Current methods of improving external validity of cohort absolute risk estimation are "model-based" that aim to build the risk model for outcome prediction, under which the model parameters estimated are assumed to be unbiased, and thus they are transportable between a cohort study and the target population. However, this transportability assumption can be violated if the risk model is misspecified or the cohorts are not target-population representative. We propose a "design-based" method with a two-step weighting procedure to estimate absolute risk in the target population without transportability assumptions. The first step improves external-validity for the cohort by creating "pseudoweights" for the cohort using a propensity-based kernel-weighting method, which fractionally distributes sample weights from external probability reference survey units to cohort units, according to their kernel smoothed distance in propensity score. The second step uses poststratification by event status and categories of variables available in the population-based disease/mortality registry to further weight adjust the pseudoweighted cohort to the target population. Our approach produces finite population consistent absolute risks under correctly specified propensity model. Poststratification improves efficiency and further reduces bias of absolute risk estimates overall and for subgroups of the population specified by the poststratification variables when the true propensity model is misspecified. We apply our methods to develop a representative all-cause mortality risk model by combining data from the non-representative US National Institutes of Health–American Association of Retired Persons cohort, the US-representative National Health Interview Survey, and mortality data from the US National Vital Statistics System.
-
- Integration of existing data to develop an indicator of ethnicity status in the LSDDP
Aziz Farah, Bassirou Diagne, Abdelnasser Saidi, Statistics Canada, Canada-
Abstract
The Longitudinal Social Data Development Program (LSDDP) is a social data integration approach aimed at providing longitudinal analytical opportunities from an exploratory perspective, without requiring any additional burden on respondents. LSDDP takes advantage of a multitude of signals that come from different sources for the same individual, which made it possible to better understand their interactions and follow the evolution of events.
In this presentation, we will show how we were able to integrate already existing administrative data in order to reconstruct characteristics of the Canadian population without having to conduct a new survey. In particular, we will discuss how we can estimate the ethnicity status of people in Canada at the lowest disaggregated level using both the results of a variety of business rules and algorithms applied to existing linkages and the LSDDP population. We will finish with the improvements obtained in our modelling algorithms using machine learning methods such as decision trees and random forest techniques.
-
- Correcting Selection Bias in Big Data by Pseudo Weighting
An-Chiao Liu and Ton de Waal, Tilburg University and Statistics Netherlands, Sander Scholtus, Statistics Netherlands, Netherlands-
Abstract
Non-probability samples do not come from a sampling design and therefore may suffer from selection bias. To correct for selection bias, Elliott and Valliant (2017) (EV) proposed a pseudo-weight estimation method which applies a two-sample setup. That is, a setup where other than the target non-probability sample, a probability sample that shares some common auxiliary variables with the non-probability sample is used. By estimating the propensities of inclusion in the non-probability sample given the two samples, we may correct the selection bias by (pseudo) design-based approaches. However, the EV method is not suitable for large inclusion fractions of the population or for units having high inclusion probabilities for either sample, which is often seen in administrative data sets and is more and more common for Big Data.
We extend the EV method to be suitable for all ranges of inclusion probabilities, while retaining the attractive properties of the original study. Any model that is suitable for propensity estimation can be easily applied, for instance, a machine learning model. Furthermore, the possible dependency between the selection of the non-probability sample and the probability sample is discussed, to deal with the scenario where inclusion in the non-probability sample is affected by being included in the probability sample. For variance estimation, two finite population bootstrap algorithms are proposed that account for the two-sample setup. We show in a simulation study based on a real data set that the proposed method outperforms other comparing methods, and that the pseudo population bootstrap algorithms give reasonable variance estimates.
-
- Privacy Protection, Measurement Error, and the Integration of Remote Sensing and Socioeconomic Survey Data
Talip Kilic and Siobhan Murray, World Bank, Anna Josephson and Jeffrey D. Michler, University of Arizona, USA-
Abstract
When publishing socioeconomic survey data, survey programs implement a variety of statistical methods designed to preserve privacy but which come at the cost of distorting the data. We explore the extent to which spatial anonymization methods to preserve privacy in the large-scale surveys supported by the World Bank Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) introduce measurement error in econometric estimates when that survey data is integrated with remote sensing weather data. Guided by a pre-analysis plan, we produce 90 linked weather-household datasets that vary by the spatial anonymization method and the remote sensing weather product. By varying the data along with the econometric model we quantify the magnitude and significance of measurement error coming from the loss of accuracy that results from protect privacy measures. We find that spatial anonymization techniques currently in general use have, on average, limited to no impact on estimates of the relationship between weather and agricultural productivity. However, the degree to which spatial anonymization introduces mismeasurement is a function of which remote sensing weather product is used in the analysis. We conclude that care must be taken in choosing a remote sensing weather product when looking to integrate it with publicly available survey data.
-
12:45 – 13:15
Afternoon Break
13:15 – 14:45
Session 7A -- Data collection and other perspectives from Indigenous populations – Panel session
Moderator: Timothy Leonard, Statistics Canada, Canada
- Data collection and other perspectives from Indigenous populations – Panel session
-
Abstract
The First Nations Information Governance Centre, the Australian Bureau of Statistics and Statistics Canada will present how each organization conducts surveys with Indigenous populations. Each organization has their unique perspective and challenges when it comes to survey design and data collection. Information will be shared in the spirit of learning from each other and having an open discussion. The speakers are:
- Katie Wood, First Nations Information Governance Centre, Canada
- John Boxsell and Tamie Anakotta, Australian Bureau of Statistics, Australia
- Danielle Léger, Statistics Canada, Canada
-
13:15 – 14:45
Session 7B -- A range of international experiences in record linkage: techniques, tools and usage in statistical agencies
Chairperson: Abdelnasser Saidi
- Probabilistic or deterministic? Linkage methods tested for the RéSIL program
Olivier Haag, Institut national de la statistique et des études économiques, France-
Abstract
The purpose of the RéSIL program (Répertoires Statistiques d'Individus et de Logements, or statistical directories of individuals and dwellings) is to build a sustainable and evergreen system of statistical directories containing data on individuals, households and dwellings that are updated using a variety of administrative sources. In this context, linkages are essential not only to compile the directories, but also because the directories system provides a framework for the DSDS information system. In fact, linkage with other sources, such as survey data, dministrative data or even private data, will be possible to the extent that they comprise an identifier common to the directory in question, either directly or through prior identification. To define the identification proposed by RéSIL, different linkage methods were tested to select the ones that seemed most efficient in terms of statistical quality but also from an IT performance perspective (essential given the volumes to process).
This article compares the results of linkages between individuals in the tax source (individual tax file) and the 2019 annual survey (census) obtained using different methods. Three methods were tested:
- Rapsodie: Developed internally by INSEE, this tool uses a deterministic linkage method;
- Relais: Developed by Istat, this tool uses Fellegi and Sunter's probabilistic linkage method;
- R and Python packages use Fellegi and Sunter's probabilistic methods.
-
- A proposal for the problem of matching probabilities estimation in record linkage
Mauro Scanu, Italian National Institute of Statistics (ISTAT), Italy-
Abstract
Record linkage aims at identifying record pairs related to the same unit and observed in two different data sets, say A and B. Fellegi and Sunter (1969) suggest each record pair is tested whether generated from the set of matched or unmatched pairs. The decision function consists of the ratio between m(g) and u(g),probabilities of observing a comparison g of a set of k>3 key identifying variables in a record pair under the assumptions that the pair is a match or a non-match, respectively. These parameters are usually estimated by means of the EM algorithm using as data the comparisons on all the pairs of the Cartesian product Ω=A×B. These observations (on the comparisons and on the pairs status as match or non-match) are assumed as generated independently of other pairs, assumption characterizing most of the literature on record linkage and implemented in software tools (e.g. RELAIS, Cibella et al. 2012). On the contrary, comparisons g and matching status in Ω are deterministically dependent. As a result, estimates on m(g) and u(g) based on the EM algorithm are usually bad. This fact jeopardizes the effective application of the Fellegi-Sunter method, as well as automatic computation of quality measures and possibility to apply effcient methods for model estimation on linked data (e.g. regression functions), as in Chambers et al. (2015). We propose to explore Ω by a set of samples, each one drawn so to preserve independence of comparisons among the selected record pairs. Simulations are encouraging.
-
- Record linkage techniques to identify 2021 Canadian Census dwellings in the new Statistical Building Register
Martin Lachance, Statistics Canada, Canada-
Abstract
The reconciliation of 2021 census dwellings with the new Statistical Building Register (SBgR) presented linkage challenges. The census of population collected information from various dwelling types. For a large proportion of the population, mailing addresses were at the centre: they were used for reaching out to people and collected as contact info. In parallel, the register environment has been evolving. The agency is transitioning from the Address Register (AR) to the SBgR holding both mailing and location addresses, while also covering non-residential buildings. The reconciliation was conducted using a combination of systems, notably the new Register Matching Engine (RME) for difficult cases. The RME holds an interesting range of sophisticated string comparators. A deterministic linkage approach was used, while incorporating some data knowledge like the entropy. Through metadata, the matching expert could also reduce the amounts of false positives and false negatives.
-
- Using Splink for Census overcount estimation
Kristina Xhaferaj and Rachel Shipsey, Office for National Statistics, United Kingdom-
Abstract
Splink is an implementation of Fellegi-Sunter and the Expectation Maximisation algorithm that has been developed by the UK Government Ministry of Justice (MoJ). The 2021 England and Wales Census dataset provided the Office for National Statistics (ONS) with an excellent opportunity to test the capabilities of Splink and provide feedback to MoJ enabling the development of further functionality. The ONS team used Splink to link the 2021 Census dataset (approximately 58 million records) to itself using the integrated deduplication feature. This resulted in a dataset where each person in the census was scored against a set of relevant candidate records, showing how likely each person is to be a duplicate. To estimate overcount in the census, ONS previously linked samples of the 2021 Census to the full census using a mixture of probabilistic, deterministic, associative and clerical methods. We therefore had a 'gold standard' dataset that we could use for comparison. Initially we used this gold standard to calculate the m and u input parameters for the Splink global model. At a later stage, we used the Splink local models (an implementation of EM) to generate m and u thereby demonstrating that Splink can be used on datasets where no previous linkage has been carried out. Splink provides several benefits over existing methodology including comprehensive data visualisations and greater customisation. Our results support that Splink is computationally fast, methodologically accurate, and allows the user to perform analyses, visualisations, and linkage in an all-in-one solution. To work effectively, Splink requires experienced DL users who are able to set optimal parameters and write meaningful case-statements. It is not a plug-and-play DL solution. However, Splink is an excellent tool which can be used to standardise DL across and between government departments.
-
13:15 – 14:45
Session 7C -- New Developments and Applications of Small Area Estimation on Disaggregated Data
Chairperson: Abel Dasylva
- Multiple deprivation index using small area estimation methods: an application for the adult population in Colombia
Alejandra Arias-Salazar, Freie Universität Berlin, Germany; Andrés Gutiérrez, Stalyn Guerrero-Gómez, Xavier Mancero, Economic Commission for Latin America and the Caribbean; Natalia Rojas-Perilla, United Arab Emirates University and Hanwen Zhang, Universidad Autónoma de Chile, Chile-
Abstract
Based on the new Multiple deprivation index for Latin American countries produced by the Economic Commission for Latin America and the Caribbean, this paper shows the case study of Colombia to obtain small area estimates. This country counts with a recent population census providing most of the information required to compute the multiple deprivation index in small domains. However, for two indicators the information at unit level is not available, for which small area estimation methods are implemented. A parametric bootstrap algorithm is used to provide uncertainty measures.
-
- Labour Force Survey initiatives under Statistics Canada's Disaggregated Data Action Plan
Martin Pantel, Yelly Camara, Andrew Brennan, Tom Haymes, François Verret, Statistics Canada, Canada-
Abstract
In accordance with Statistics Canada's long-term Disaggregated Data Action Plan (DDAP), a number of initiatives have been implemented into the Labour Force Survey (LFS). One of the more direct initiatives was a targeted increase in the size of the monthly LFS sample. The presentation will describe how this additional sample was installed into an existing complex panel design and will describe some early analyses of the impact on data quality. Furthermore, a regular Supplement Program was introduced, where an additional series of questions are asked to a subset of LFS respondents and analyzed in a monthly or quarterly production cycle. This new analytical product focuses on various labour market indicators and can be adapted to address emerging trends. Recent examples of the Supplement objectives are to provide labour market data on people identifying as belonging to a visible minority group, people with disabilities, and people who work from home. As part of the DDAP initiative, the questions on visible minority have in fact been moved from the Supplement content to the main LFS questionnaire. Finally, estimates based on Small Area Estimation methodologies are being reintroduced for the LFS and will include a wider scope with more analytical value than what had existed in the past.
-
- Application of sampling variance smoothing methods for small area proportion estimation
Yong You and Mike Hidiroglou, Statistics Canada, Canada-
Abstract
Sampling variance smoothing is an important topic in small area estimation. In this paper, we propose sampling variance smoothing methods for small area proportion estimation. In particular, we consider the generalized variance function (GVF) and design effect (DEFF) methods for sampling variance smoothing. We evaluate and compare the smoothed sampling variances and small area estimates based on the smoothed variance estimates through analysis of different survey data including CCHS, PALS and LFS from Statistics Canada.
-
- A model-based disaggregation method for small area estimation
Andreea Erciulescu, Weijia Ren, Tom Krenzke, Leyla Mohadjer and Bob Fay, Westat, and Jianzhu Li, FINRA, USA-
Abstract
Estimation at fine levels of aggregation is necessary to better describe society. Small area estimation model-based approaches that combine sparse survey data with rich data from auxiliary sources have been proven useful to improve the reliability of estimates for small domains. Considered here is a scenario where small area model-based estimates, produced at a given aggregation level, needed to be disaggregated to better describe the social structure at finer levels. For this scenario, an allocation method was developed to implement the disaggregation, overcoming challenges associated with data availability and model development at such fine levels. The method is applied to adult literacy and numeracy estimation at the county-by-group-level, using data from the U.S. Program for the International Assessment of Adult Competencies (PIAAC). In this application the groups are defined in terms of age or education, but the method could be applied to estimation of other equity-deserving groups.
-
14:45 – 15:00
Closing Remarks
- André Loranger, Assistant Chief Statistician, Statistics Canada, Canada