The organizers of the 2016 International Methodology Symposium would like to thank the presenters who submitted a paper, agreed to have their presentation slides published or provided a link to an external paper.
The contributions below (see links) include submitted conference papers, conference presentations or external papers.
Tuesday, March 22, 2016
8:00 – 17:00
Registration – Third floor
8:45 – 9:00
Opening remarks
Sylvie Michaud, Assistant Chief Statistician, Statistics Canada
9:00 – 10:00
Session 1 – Keynote address
- Methodological Issues and Challenges in the Production of Official Statistics
Danny Pfeffermann, Government Statistician of Israel, Hebrew University of Jerusalem, Israel, Southampton Statistical Sciences Research Institute, United Kingdom-
Contribution
Abstract
The big advancement in technology, coupled with increased availability of 'big data', and yet increased demand for more accurate, more detailed and more timely official data with tightened budgets, places enormous challenges to producers of official statistics across the world. In this presentation I shall discuss some of the major challenges as I see them and in some cases offer ways of dealing with them. Examples include the potential use of big data; privacy and confidentiality; possible use of data obtained from web-panels; accounting for mode effects; and the integration of administrative data and small area estimation for future censuses. In the last part of my talk I shall confront the question of whether universities train their students to work at National Statistical Offices.
-
10:30 – 12:00
Session 2A – Big Data in Official Statistics
- Methodological Challenges in Official Statistics
Kees Zeelenberg, Statistics Netherlands, Netherlands-
Abstract
We identify several research areas and topics for methodological research in official statistics. We argue why these are important, and why these are the most important ones for official statistics. We describe the main topics in these research areas and sketch what seems to be the most promising ways to address them. Here we focus on: (i) Quality of National accounts, in particular the rate of growth of GNI (ii) Big data, in particular how to create representative estimates and how to make the most of big data when this is difficult or impossible. We also touch upon: (i) Increasing timeliness of preliminary and final statistical estimates (ii) Statistical analysis, in particular of complex and coherent phenomena. These topics are elements in the present Strategic Methodological Research Program that has recently been adopted at Statistics Netherlands.
- Contribution (PDF, 472.9 KB) Archived
-
- Profiling of Twitter data: a Big Data selectivity study
Joep Burger, Quan Le, Olav ten Bosch and Piet Daas, Statistics Netherlands, Netherlands-
Abstract
An ever increasing amount of data about human behavior and economic activity is automatically logged by social media, road sensors, mobile phones and the like. This so-called Big Data is a potential data source for official statistics. Its high volume and rapid availability could be exploited for quick indicators on a diverse number of topics and for more precise estimates about smaller areas. One of the major challenges is to infer unbiased estimates from Big Data. Unlike in sample surveys, the mechanism generating Big Data is not a probability sample. As a result, Big Data typically covers a selective part of the target population. Auxiliary information explaining the missingness could be used to correct for this selectivity. Auxiliary variables might be linked from administrative registers, but often this is not feasible as the units in Big Data sources are difficult to relate to the units in administrative data. We wondered if it would be possible to obtain auxiliary information in another way. In this presentation, we will show how auxiliary information can be derived from the Big Data source itself, an approach called profiling, using Twitter as an example. We will show that we can reliably determine gender from Twitter accounts using the user name, the bio information, the profile picture and public tweets. From an associated LinkedIn account several additional characteristics can be derived.
- Contribution
-
- The Alternative Data Solution – Experience of the Producer Prices Division
Gaétan Garneau and Mary Beth Garneau, Statistics Canada-
Abstract
Over the last decade, Statistics Canada's Producer Prices Division has expanded its service producer price indexes program and continued to improve its goods and construction producer price indexes program. While the majority of price indexes are based on traditional survey methods, efforts were made to increase the use of administrative data and alternative data sources in order to reduce burden on our respondents. This paper focuses mainly on producer price programs, but also provides information on the growing importance of alternative data sources at Statistics Canada. In addition, it presents the operational challenges and risks that statistical offices could face when relying more and more on third-party outputs. Finally, it presents the tools being developed to integrate alternative data while collecting metadata.
- Contribution (PDF, 424.22 KB) Archived
-
10:30 – 12:00
Session 2B – Applications Related to Growth in Statistical Information
- Challenges and results in using Audit trail data to monitor Labour Force Survey data quality
Justin Francis and Yves Lafortune, Statistics Canada-
Abstract
The Labour Force Survey (LFS) is a monthly household survey of about 56,000 households that provides information on the Canadian labour market. Audit Trail is a Blaise programming option, for surveys like LFS with Computer Assisted Interviewing (CAI), which creates files containing every keystroke and edit and timestamp of every data collection attempt on all households. Combining such a large survey with such a complete source of paradata opens the door to in-depth data quality analysis but also quickly leads to Big Data challenges. How can meaningful information be extracted from this large set of keystrokes and timestamps? How can it help assess the quality of LFS data collection? The presentation will describe some of the challenges that were encountered, solutions that were used to address them, and results of the analysis on data quality.
- Contribution (PDF, 724.48 KB) Archived
-
- Statistics Canada's Household Survey Frames Programme – Strategic Research Enabling a shift to increased use of Admin Data as Input to the Social statistics program
Tim Werschler, Edward Chen, Kim Charland and Crystal Sewards, Statistics Canada-
Abstract
Statistics Canada's Household Survey Frames (HSF) Programme provides various universe files that can be used alone or in combination to improve survey design, sampling, collection, and processing in the traditional "need to contact a household model." Even as surveys are migrating onto these core suite of products, the HSF is starting to plan the changes to infrastructure, organisation, and linkages with other data assets in Statistics Canada that will help enable a shift to increased use of a wide variety of administrative data as input to the social statistics programme. The presentation will provide an overview of the HSF Programme, foundational concepts that will need to be implemented to expand linkage potential, and will identify strategic research being under-taken toward 2021.
- Contribution (PDF, 534.99 KB) Archived
-
- Road Congestion Measures Using Instantaneous Information from the Canadian Vehicle Use Study (CVUS)
Émile Allie, Transport Canada, Canada-
Abstract
Traffic congestion is not limited to large cities but is also becoming a problem in medium-size cities and to roads going through cities. Among a large variety of congestion measures, six were selected for the ease of aggregation and their capacity to use the instantaneous information from CVUS-light component in 2014. From the selected measures, the Index of Congestion is potentially the only one not biased. This measure is used to illustrate different dimension of congestion on the road network.
- Contribution (PDF, 402.29 KB) Archived
-
- The Data Warehouse and analytical tools to facilitate the integration of the Canadian Macroeconomic Accounts
Alistair Macfarlane and Jordan-Daniel Sabourin, Statistics Canada-
Abstract
The Data Warehouse has modernized the way the Canadian System of Macroeconomic Accounts (MEA) are produced and analyzed today. Its continuing evolution facilitates the amounts and types of analytical work that is done within the MEA. It brings in the needed element of harmonization and confrontation as the macroeconomic accounts move toward full integration. The improvements in quality, transparency, and timeliness have strengthened the statistics that are being disseminated.
- Contribution (PDF, 197.23 KB) Archived
-
13:30 – 15:00
Session 3A – Total Survey Error
- Using Administrative Records to Evaluate Survey Data
Mary H. Mulry, Elizabeth M. Nichols and Jennifer Hunter Childs, U.S. Census Bureau, USA-
Abstract
After the 2010 Census, the U.S. Census Bureau conducted two separate research projects matching survey data to databases. One study matched to the third-party database Accurint, and the other matched to U.S. Postal Service National Change of Address (NCOA) files. In both projects, we evaluated response error in reported move dates by comparing the self-reported move date to records in the database. We encountered similar challenges in the two projects. This paper discusses our experience using “big data” as a comparison source for survey data and our lessons learned for future projects similar to the ones we conducted.
- Contribution (PDF, 456.73 KB) Archived
-
- An Evaluation of Panel Nonresponse and Linkage Consent Bias in a Survey of Employees in Germany
Joseph Sakshaug, University of Manchester, United Kingdom and Martina Huber, Institute for Employment Research, United Kingdom-
Abstract
Surveys are susceptible to multiple error sources which threaten the validity of inferences drawn from them. While much of the survey methods literature has focused on identifying errors in cross-sectional surveys, errors in panel surveys have received less attention. Administrative records linked to the entire sample (respondents and nonrespondents) can be useful for studying various errors in panel surveys, including nonresponse, which tends to accumulate over multiple waves of the study. Record data can also be used to study errors due to linkage consent, which is commonly asked for in panel surveys, but not provided by all respondents. In this paper, we present bias estimates for both error sources from a panel survey in Germany. The bias estimates are derived from administrative data collected on a sample of employees who were invited to participate in the panel. We find evidence of increasing nonresponse bias over time for cross-sectional and longitudinal measured outcomes. The opposite pattern is observed for linkage consent bias, which decreases over time when respondents who do not provide consent in a prior wave are asked to reconsider their decision in subsequent waves. We conclude the paper with a discussion of the practical implications of these findings and propose suggestions for future research.
- Contribution
-
- Big Data: A Survey Research Perspective
Reg Baker, Marketing Research Institute International, USA-
Abstract
Big data is a term that means different things to different people. To some, it means datasets so large that our traditional processing and analytic systems can no longer accommodate them. To others, it simply means taking advantage of existing datasets of all sizes and finding ways to merge them with the goal of generating new insights. The former view poses a number of important challenges to traditional market, opinion, and social research. In either case, there are implications for the future of surveys that are only beginning to be explored.
- Contribution (PDF, 313.05 KB) Archived
-
13:30 – 15:00
Session 3B – Alternative Data Sources to Replace or Complement Survey Data
- A Case Study in Administrative Data Informing Policy Development
Yves Gingras, Tony Haddad, Stéphanie Roberge, Georges Awad and Andy Handouyahia, Employment and Social Development Canada, Canada-
Abstract
The Labour Market Development Agreements (LMDAs) between Canada and the provinces and territories fund labour market training and support services to Employment Insurance claimants. The objective of this paper is to discuss the improvements over the years in the impact assessment methodology. The paper describes the LMDAs and past evaluation work and discusses the drivers to make better use of large administrative data holdings. It then explains how the new approach made the evaluation less resource-intensive, while results are more relevant to policy development. The paper outlines the lessons learned from a methodological perspective and provides insight into ways for making this type of use of administrative data effective, especially in the context of large programs.
- Contribution (PDF, 281.62 KB) Archived
-
- Towards an integrated census – administrative data approach to item-level imputation for the 2021 UK Census
Steven Rogers and Fern Leather, Office for National Statistics, United Kingdom-
Abstract
In preparation for 2021 UK Census the ONS has committed to an extensive research programme exploring how linked administrative data can be used to support conventional statistical processes. Item-level edit and imputation (E&I) will play an important role in adjusting the 2021 Census database. However, uncertainty associated with the accuracy and quality of available administrative data renders the efficacy of an integrated census-administrative data approach to E&I unclear. Current constraints that dictate an anonymised ‘hash-key’ approach to record linkage to ensure confidentiality add to that uncertainty. Here, we provide preliminary results from a simulation study comparing the predictive and distributional accuracy of the conventional E&I strategy implemented in CANCEIS for the 2011 UK Census to that of an integrated approach using synthetic administrative data with systematically increasing error as auxiliary information. In this initial phase of research we focus on imputing single year of age. The aim of the study is to gain insight into whether auxiliary information from admin data can improve imputation estimates and where the different strategies fall on a continuum of accuracy.
- Contribution (PDF, 278.66 KB) Archived
-
- Comparing Survey Data to Administrative Sources: Immigration, Labour, and Demographic data from the Longitudinal and International Study of Adults
James Hemeon, Statistics Canada-
Abstract
Administrative data, depending on its source and original purpose, can be considered a more reliable source of information than survey-collected data. It does not require a respondent to be present and understand question wording, and it is not limited by the respondent's ability to recall events retrospectively. This paper compares selected survey data, such as demographic variables, from the Longitudinal and International Study of Adults (LISA) to various administrative sources for which LISA has linkage agreements in place. The agreement between data sources, and some factors that might affect it, are analyzed for various aspects of the survey.
- Contribution (PDF, 215.51 KB) Archived
-
- Student Pathways and Graduate Outcomes
Aimé Ntwari, Éric Fecteau, Rubab Arim, Christine Hinchley and Sylvie Gauthier, Statistics Canada-
Abstract
Files with linked data from the Statistics Canada, Postsecondary Student Information System (PSIS) and tax data can be used to examine the trajectories of students who pursue postsecondary education (PSE) programs and their post-schooling labour market outcomes. On one hand, administrative data on students linked longitudinally can provide aggregate information on student pathways during postsecondary studies such as persistence rates, graduation rates, mobility, etc. On the other hand, the tax data could supplement the PSIS data to provide information on employment outcomes such as average and median earnings or earnings progress by employment sector (industry), field of study, education level and/or other demographic information, year over year after graduation. Two longitudinal pilot studies have been done using administrative data on postsecondary students of Maritimes institutions which have been longitudinally linked and linked to Statistics Canada Ttx data (the T1 Family File) for relevant years. This article first focuses on the quality of information in the administrative data and the methodology used to conduct these longitudinal studies and derive indicators. Second, it will focus on some limitations when using administrative data, rather than a survey, to define some concepts.
- Contribution (PDF, 459.1 KB) Archived
-
- Estimating the effects related to the timing of participation in employment assistance services using rich administrative data
Stéphanie Roberge, Andy Handouyahia, Tony Haddad, Georges Awad and Yves Gingras, Employment and Social Development Canada, Canada-
Abstract
This study assessed whether starting participation in Employment Assistance Services (EAS) earlier after initiating an Employment Insurance (EI) claim leads to better impacts for unemployed individuals than participating later during the EI benefit period. As in Sianesi (2004) and Hujer and Thomsen (2010), the analysis relied on a stratified propensity score matching approach conditional on the discretized duration of unemployment until the program starts. The results showed that individuals who participated in EAS within the first four weeks after initiating an EI claim had the best impacts on earnings and incidence of employment while also experiencing reduced use of EI starting the second year post-program.
- Contribution (PDF, 341.44 KB) Archived
-
15:30 – 17:00
Session 4A – Open Data
- An International Overview of Open Data Experiences
Timothy Herzog, World Bank, USA-
Abstract
Open Data initiatives are transforming how governments and other public institutions interact and provide services to their constituents. They increase transparency and value to citizens, reduce inefficiencies and barriers to information, enable data-driven applications that improve public service delivery, and provide public data that can stimulate innovative business opportunities. As one of the first international organizations to adopt an open data policy, the World Bank has been providing guidance and technical expertise to developing countries that are considering or designing their own initiatives. This presentation will give an overview of developments in open data at the international level along with current and future experiences, challenges, and opportunities. Mr. Herzog will discuss the rationales under which governments are embracing open data, demonstrated benefits to both the public and private sectors, the range of different approaches that governments are taking, and the availability of tools for policymakers, with special emphasis on the roles and perspectives of National Statistics Offices within a government-wide initiative.
- Contribution (PDF, 1.24 MB) Archived
-
- Statistics Canada and Open Data
Bill Joyce, Statistics Canada-
Abstract
This paper is intended to give a brief overview of Statistics Canada's involvement with open data. It will first discuss how the principles of open data are being adopted in the agency's ongoing dissemination practices. It will then discuss the agency's involvement with the whole of government open data initiative. This involvement is twofold: Statistics Canada is the major data contributor to the Government of Canada Open Data portal, but also plays an important behind the scenes role as the service provider responsible for developing and maintaining the Open Data portal (which is now part of the wider Open Government portal.)
- Contribution (PDF, 149.99 KB) Archived
-
- Exploring Canada's Open Government Portal
Ashley Casovan, Treasury Board of Canada Secretariat, Canada-
Abstract
Open data is becoming an increasingly important expectation of Canadians, researchers, and developers. Learn how and why the Government of Canada has centralized the distribution of all Government of Canada open data through Open.Canada.ca and how this initiative will continue to support the consumption of statistical information.
- Contribution (PDF, 1.26 MB) Archived
-
15:30 – 17:00
Session 4B – Quality of Administrative Data
- Assimilation and Coverage of the Foreign-Born Population in Administrative Records
Renuka Bhaskar, Leticia Fernandez and Sonya Rastogi, U.S. Census Bureau, USA-
Abstract
The U.S. Census Bureau is researching ways to incorporate administrative data in decennial census and survey operations. Critical to this work is an understanding of the coverage of the population by administrative records. Using federal and third party administrative data linked to the American Community Survey (ACS), we evaluate the extent to which administrative records provide data on foreign-born individuals in the ACS and employ multinomial logistic regression techniques to evaluate characteristics of those who are in administrative records relative to those who are not. We find that overall, administrative records provide high coverage of foreign-born individuals in our sample for whom a match can be determined. The odds of being in administrative records are found to be tied to the processes of immigrant assimilation – naturalization, higher English proficiency, educational attainment, and full-time employment are associated with greater odds of being in administrative records. These findings suggest that as immigrants adapt and integrate into U.S. society, they are more likely to be involved in government and commercial processes and programs for which we are including data. We further explore administrative records coverage for the two largest race/ethnic groups in our sample – Hispanic and non-Hispanic single-race Asian foreign born, finding again that characteristics related to assimilation are associated with administrative records coverage for both groups. However, we observe that neighborhood context impacts Hispanics and Asians differently.
- Contribution (PDF, 1.65 MB) Archived
-
- When Race and Hispanic Origin Reporting are Discrepant Across Administrative Records Sources: Exploring Methods to Assign Responses
Sharon R. Ennis, Sonya Rastogi and James Noon, U.S. Census Bureau, USA-
Abstract
The U.S. Census Bureau is researching uses of administrative records in survey and decennial operations in order to reduce costs and respondent burden while preserving data quality. One potential use of administrative records is to utilize the data when race and Hispanic origin responses are missing. When federal and third party administrative records are compiled, race and Hispanic origin responses are not always the same for an individual across different administrative records sources. We explore different sets of business rules used to assign one race and one Hispanic response when these responses are discrepant across sources. We also describe the characteristics of individuals with matching, non-matching, and missing race and Hispanic origin data across several demographic, household, and contextual variables. We find that minorities, especially Hispanics, are more likely to have non-matching Hispanic origin and race responses in administrative records than in the 2010 Census. Hispanics are less likely to have missing Hispanic origin data but more likely to have missing race data in administrative records. Non-Hispanic Asians and non-Hispanic Pacific Islanders are more likely to have missing race and Hispanic origin data in administrative records. Younger individuals, renters, individuals living in households with two or more people, individuals who responded to the census in the nonresponse follow-up operation, and individuals residing in urban areas are more likely to have non-matching race and Hispanic origin responses.
- Contribution (PDF, 634.89 KB) Archived
-
- The challenges of linking and using administrative data from different sources
Philippe Gamache, Institut national de santé publique du Québec, Canada-
Abstract
At the Institut national de santé publique du Québec, the Quebec Integrated Chronic Disease Surveillance System (QICDSS) has been used daily for approximately four years. The benefits of this system are numerous for measuring the extent of diseases more accurately, evaluating the use of health services properly and identifying certain groups at risk. However, in the past months, various problems have arisen that have required a great deal of careful thought. The problems have affected various areas of activity, such as data linkage, data quality, coordinating multiple users and meeting legal obligations. The purpose of this presentation is to describe the main challenges associated with using QICDSS data and to present some possible solutions. In particular, this presentation discusses the processing of five data sources that not only come from five different sources, but also are not mainly used for chronic disease surveillance. The varying quality of the data, both across files and within a given file, will also be discussed. Certain situations associated with the simultaneous use of the system by multiple users will also be examined. Examples will be given of analyses of large data sets that have caused problems. As well, a few challenges involving disclosure and the fulfillment of legal agreements will be briefly discussed.
- Contribution (PDF, 329.35 KB) Archived
-
- Using data linkage to evaluate the consistency of place of residence between census data and tax data
Julien Bérard-Chagnon and Georgina House, Statistics Canada-
Abstract
Tax data are being used more and more to measure and analyze the population and its characteristics. One of the issues raised by the growing use of these type of data relates to the definition of the concept of place of residence. While the census uses the traditional concept of place of residence, tax data provide information based on the mailing address of tax filers. Using record linkage between the census, the National Household Survey and tax data from the T1 Family File, this study examines the consistency level of the place of residence of these two sources and its associated characteristics.
- Contribution (PDF, 133 KB) Archived
-
- Estimating internal migration: Issues related to using tax data
Guylaine Dubreuil and Georgina House, Statistics Canada-
Abstract
Internal migration is one of the components of population growth estimated at Statistics Canada. It is estimated by comparing individuals' addresses at the beginning and end of a given period. The Canada Child Tax Benefit and T1 Family File are the primary data sources used. Address quality and coverage of more mobile subpopulations are crucial to producing high-quality estimates. The purpose of this article is to present the results of evaluations of these elements using access to more tax data sources at Statistics Canada.
- Contribution (PDF, 372.67 KB) Archived
-
Wednesday, March 23, 2016
8:00 – 17:00
Registration – Third floor
8:45 – 9:45
Session 5 – Waksberg Award Winner Address
- Towards a Quality Framework for Blends of Designed and Organic Data
Robert Groves, Georgetown University, USA-
Abstract
Probability samples of near-universal frames of households and persons, administered standardized measures, yielding long multivariate data records, and analyzed with statistical procedures reflecting the design – these have been the cornerstones of the empirical social sciences for 75 years. That measurement structure has given the developed world almost all of what we know about our societies and their economies. The stored survey data form a unique historical record. We live now in a different data world than that in which the leadership of statistical agencies and the social sciences were raised. High-dimensional data are ubiquitously being produced from Internet search activities, mobile Internet devices, social media, sensors, retail store scanners, and other devices. Some estimate that these data sources are increasing in size at the rate of 40% per year. Together their sizes swamp that of the probability-based sample surveys.Further, the state of sample surveys in the developed world is not healthy. Falling rates of survey participation are linked with ever-inflated costs of data collection. Despite growing needs for information, the creation of new survey vehicles is hampered by strained budgets for official statistical agencies and social science funders.
These combined observations are unprecedented challenges for the basic paradigm of inference in the social and economic sciences. This paper discusses alternative ways forward at this moment in history.
- Contribution (PDF, 1.65 MB) Archived
-
9:45 – 10:00
Speed Advertisement for Posters and Software Demonstration
10:30 – 12:00
Session 6A – New Advancements in Record Linkage
- Statistical Modeling for Errors in Record Linkage Applied to SEER Cancer Registry Data
Michael D. Larsen, The George Washington University, USA-
Abstract
Record linkage joins together two or more sources. The product of record linkage is a file with one record per individual containing all the information about the individual from the multiple files. The problem is difficult when a unique identification key is not available, there are errors in some variables, some data are missing, and files are large. Probabilistic record linkage computes a probability that records from on different files pertain to a single individual. Some true links are given low probabilities of matching, whereas some non links are given high probabilities. Errors in linkage designations can cause bias in analyses based on the composite data base.
The SEER cancer registries contain information on breast cancer cases in their registry areas. A diagnostic test based on the Oncotype DX assay, performed by Genomic Health, Inc. (GHI), is often performed for certain types of breast cancers. Record linkage using personal identifiable information was conducted to associate Oncotype DC assay results with SEER cancer registry information. The software Link Plus was used to generate a score describing the similarity of records and to identify the apparent best match of SEER cancer registry individuals to the GHI database. Clerical review was used to check samples of likely matches, possible matches, and unlikely matches.
Models are proposed for jointly modeling the record linkage process and subsequent statistical analysis in this and other applications.
- Contribution (PDF, 659.73 KB) Archived
-
- Sampling Procedures for Assessing Accuracy of Record Linkage
Paul Smith, University of Southampton, United Kingdom; Shelley Gammon, Sarah Cummins, Christos Chatzoglou and Dick Heasman, Office for National Statistics, United Kingdom-
Abstract
The use of administrative datasets as a data source in official statistics has become much more common as there is a drive for more outputs to be produced more efficiently. Many outputs rely on linkage between two or more datasets, and this is often undertaken in a number of phases with different methods and rules. In these situations we would like to be able to assess the quality of the linkage, and this involves some re-assessment of both links and non-links. In this paper we discuss sampling approaches to obtain estimates of false negatives and false positives with reasonable control of both accuracy of estimates and cost. Approaches to stratification of links (non-links) to sample are evaluated using information from the 2011 England and Wales population census.
- Contribution (PDF, 116.84 KB) Archived
-
- Bayesian Estimation of Bipartite Matchings for Record Linkage
Mauricio Sadinle, Duke University, USA-
Abstract
The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is non-trivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal paper by Fellegi and Sunter (1969). These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador.
- Contribution
-
10:30 – 12:00
Session 6B – Confidentiality
- Finding a Needle in a Haystack: The Theoretical and Empirical Foundations of Assessing Disclosure Risk for Contextualized Microdata
Kevin T. Leicht, University of Illinois, USA-
Abstract
Our study describes various factors that are of concern when evaluating disclosure risk of contextualized microdata and some of the empirical steps that are involved in their assessment. Utilizing synthetic sets of survey respondents, we illustrate how different postulates shape the assessment of risk when considering: (1) estimated probabilities that unidentified geographic areas are represented within a survey; (2) the number of people in the population who share the same personal and contextual identifiers as a respondent; and (3) the anticipated amount of coverage error in census population counts and extant files that provide identifying information (like names and addresses).
- Contribution (PDF, 235.15 KB) Archived
-
- A Modern Job Submission Application to Access IAB’s Confidential Administrative and Survey Research Data
Johanna Eberle, Jörg Heining, Dana Müller and David Schiller, Institute for Employment Research, Germany-
Abstract
The Institute for Employment Research (IAB) is the research unit of the German Federal Employment Agency. Via the Research Data Centre (FDZ) at the IAB, administrative and survey data on individuals and establishments are provided to researchers. In cooperation with the Institute for the Study of Labor (IZA), the FDZ has implemented the Job Submission Application (JoSuA) environment which enables researchers to submit jobs for remote data execution through a custom-built web interface. Moreover, two types of user-generated output files may be distinguished within the JoSuA environment which allows for faster and more efficient disclosure review services.
- Contribution (PDF, 161.94 KB) Archived
-
- Enhancing data sharing via "safe designs"
Kristine Witkowski, University of Michigan, USA-
Abstract
The social value of data collections are dramatically enhanced by the broad dissemination of research files and the resulting increase in scientific productivity. Currently, most studies are designed with a focus on collecting information that is analytically useful and accurate, with little forethought as to how it will be shared. Both literature and practice also presume that disclosure analysis will take place after data collection. But to produce public-use data of the highest analytical utility for the largest user group, disclosure risk must be considered at the beginning of the research process. Drawing upon economic and statistical decision-theoretic frameworks and survey methodology research, this study seeks to enhance the scientific productivity of shared research data by describing how disclosure risk can be addressed in the earliest stages of research with the formulation of "safe designs" and "disclosure simulations", where an applied statistical approach has been taken in: (1) developing and validating models that predict the composition of survey data under different sampling designs; (2) selecting and/or developing measures and methods used in the assessments of disclosure risk, analytical utility, and disclosure survey costs that are best suited for evaluating sampling and database designs; and (3) conducting simulations to gather estimates of risk, utility, and cost for studies with a wide range of sampling and database design characteristics.
- Contribution (PDF, 1.25 MB) Archived
-
- Privacy and Security Aspects Related to the Use of Big Data – progress of work in the European Statistical System (ESS)
Pascal Jacques, EUROSTAT, Luxembourg-
Abstract
Data protection and privacy are key challenges that need to be tackled with high priority in order to enable the use of Big Data in the production of Official Statistics. This was emphasized in 2013 by the Directors of National Statistical Insitutes (NSIs) of the European Statistical System Committee (ESSC) in the Scheveningen Memorandum. The ESSC requested Eurostat and the NSIs to elaborate an action plan with a roadmap for following up the implementation of the Memorandum. At the Riga meeting on September 26, 2014, the ESSC endorsed the Big Data Action Plan and Roadmap 1.0 (BDAR) presented by the Eurostat Task Force on Big Data (TFBD) and agreed to integrate it into the ESS Vision 2020 portfolio.
Eurostat also collaborates in this field with external partners such as the United Nations Economic Commission for Europe (UNECE). The big data project of the UNECE High-Level Group is an international project on the role of big data in the modernization of statistical production. It comprised four 'task teams' addressing different aspects of Big Data issues relevant for official statistics: Privacy, Partnerships, Sandbox, and Quality. The Privacy Task Team finished its work in 2014 and gave an overview of the existing tools for risk management regarding privacy issues, described how risk of identification relates to Big Data characteristics and drafted recommendations for National Statistical Offices (NSOs). It mainly concluded that extensions to existing frameworks, including use of new technologies were needed in order to deal with privacy risks related to the use of Big Data.
The BDAR builds on the work achieved by the UNECE task teams. Specifically, it recognizes that a number of big data sources contain sensitive information, that their use for official statistics may induce negative perceptions with the general public and other stakeholders and that this risk should be mitigated in the short to medium term. It proposes to launch multiple actions like e.g., an adequate review on ethical principles governing the roles and activities of the NSIs and a strong communication strategy.
The paper presents the different actions undertaken within the ESS and in collaboration with UNECE, as well as potential technical and legal solutions to be put in place in order to address the data protection and privacy risks in the use of Big Data for Official Statistics.
- Contribution (PDF, 862.46 KB) Archived
-
- Practical Applications of Secure Computation for Disclosure Control
Luk Arbuckle, Children's Hospital of Eastern Ontario Research Institute, Canada and Khaled El Emam, Children's Hospital of Eastern Ontario Research Institute, University of Ottawa, Canada-
Abstract
Microdata dissemination normally requires data reduction and modification methods be applied, and the degree to which these methods are applied depend on the control methods that will be required to access and use the data. An approach that is in some circumstances more suitable for accessing data for statistical purposes is secure computation, which involves computing analytic functions on encrypted data without the need to decrypt the underlying source data to run a statistical analysis. This approach also allows multiple sites to contribute data while providing strong privacy guarantees. This way the data can be pooled and contributors can compute analytic functions without either party knowing their inputs. We explain how secure computation can be applied in practical contexts, with some theoretical results and real healthcare examples.
- Contribution (PDF, 483.82 KB) Archived
-
13:30 – 15:00
Session 7A – Non-traditional Methods for Analysis of Survey Data
- Empirical Likelihood Confidence Intervals for Finite Population Proportions
Changbao Wu, University of Waterloo, Canada-
Abstract
Empirical likelihood (EL) ratio confidence intervals are very attractive for parameters with range restrictions such as population proportions or distribution functions. Wu and Rao (2006) studied the pseudo EL confidence intervals using complex survey data and Rao and Wu (2010) developed a Bayesian approach based on the pseudo empirical likelihood function. In this paper, we examine the performances of the pseudo EL and Bayesian EL intervals for finite population proportions using complex survey data. We also address a practically important scenario where the basic design weights and second order inclusion probabilities are not available but instead the final adjusted or calibrated weights with suitable replication weights are provided by the data file producers. Results from simulation studies will be reported. The research is joint work with J.N.K. Rao of Carleton University.
-
- Hypotheses Testing from Categorical Survey Data Using Bootstrap Weights
J. N. K. Rao, Carleton University, Canada and Jae Kwang Kim, Iowa State University, USA-
Abstract
Standard statistical methods that do not take proper account of the complexity of survey design can lead to erroneous inferences when applied to survey data. In particular, the actual type I error rates of tests of hypotheses based on standard tests can be much bigger than the nominal level. Methods that take account of survey design features in testing hypotheses have been proposed, including Wald tests and quasi-score tests (Rao, Scott and Skinner, 1998) that involve the estimated covariance matrices of parameter estimates. The bootstrap method of Rao and Wu (1983) is often applied at Statistics Canada to estimate the covariance matrices, using the data file containing columns of bootstrap weights. Standard statistical packages often permit the use of survey weighted test statistics and it is attractive to approximate their distributions under the null hypothesis by their bootstrap analogues computed from the bootstrap weights supplied in the data file. Beaumont and Bocci (2009) applied this bootstrap method to testing hypotheses on regression parameters under a linear regression model, using weighted F statistics. In this paper, we present a unified approach to the above method by constructing bootstrap approximations to weighted likelihood ratio statistics and weighted quasi-score statistics. We report the results of a simulation study on testing independence in a two way table of categorical survey data. We studied the relative performance of the proposed method to alternative methods including the Rao-Scott corrected chi-squared statistic for categorical survey data.
- Contribution (PDF, 301.87 KB) Archived
-
- Better Adjusted Weights for Respondents in Skewed Populations
Glen Meeden, University of Minnesota, USA-
Abstract
In the standard design approach to missing observations, the construction of weight classes and calibration are used to adjust the design weights for the respondents in the sample. Here we use these adjusted weights to define a Dirichlet distribution which can be used to make inferences about the population. Examples show that the resulting procedures have better performance properties than the standard methods when the population is skewed.
- Contribution (PDF, 220.28 KB) Archived
-
13:30 – 15:00
Session 7B – Applications of record linkage and statistical matching
- Record Linkage between the 2006 Census of the Population and the Canadian Mortality Database
Mohan Kumar and Rose Evra, Statistics Canada-
Abstract
Vital statistics datasets such as the Canadian Mortality Database lack identifiers for certain populations of interest such as First Nations, Métis and Inuit. Record linkage between vital statistics and survey or other administrative datasets can circumvent this limitation. This paper describes a linkage between the Canadian Mortality Database and the 2006 Census of the Population and the planned analysis using the linked data.
- Contribution (PDF, 186.99 KB) Archived
-
- Estimating the Impact of Active Labour Market Programs using Administrative Data and Matching Methods
Andy Handouyahia, Tony Haddad, Stéphanie Roberge and Georges Awad, Employment and Social Development Canada, Canada-
Abstract
In this paper, we discuss the impacts of Employment Benefit and Support Measures delivered in Canada under the Labour Market Development Agreements. We use linked rich longitudinal administrative data covering all LMDA participants from 2002 to 2005. We Apply propensity score matching as in Blundell et al. (2002), Gerfin and Lechner (2002), and Sianesi (2004), and produced the national incremental impact estimates using difference-in-differences and Kernel Matching estimator (Heckman and Smith, 1999). The findings suggest that, both Employment Assistance Services and employment benefit such as Skills Development and Targeted Wage Subsidies had positive effects on earnings and employment.
- Contribution (PDF, 282.26 KB) Archived
-
- An Overview of Business Record Linkage at Statistics Canada: How to link the "Unlinkable"
Javier Oyarzun and Laura Wile, Statistics Canada-
Abstract
Statistics Canada's mandate includes producing statistical data to shed light on current business issues. The linking of business records is an important technique used in the development, production, evaluation and analysis of these statistical data. As record linkage can intrude on one's privacy, Statistics Canada uses the technique only when the public good is clearly evident and outweighs the intrusion.
Record linkage is experiencing a revival triggered by a greater use of administrative data in many statistical programs. There are many challenges to business record linkage. For example, many administrative files do not have common identifiers, information being recorded in non-standardized formats, information containing typographical errors, administrative data files usually large in size, and finally the evaluation of multiple record pairings make absolute comparison impractical and sometimes impossible.
Due to the importance and challenges associated with record linkage, Statistics Canada has been developing a record linkage standard to help users optimize their business record linkage process. For example, this process includes building on a record linkage blocking strategy that will reduce the amount of record-pairs to compare and match, and create a standard business name that will be available on Statistics Canada's Business Register. This presentation will give an overview of the business record linkage methodology and will look at the various economic projects which use record linkage at Statistics Canada, these include projects in the National Accounts, International Trade, Agriculture and the Business Register.
- Contribution (PDF, 570.42 KB) Archived
-
- Linking Canadian Patent Records from the U.S. Patent Office to Statistics Canada's Business Register, 2000 to 2011
Paul Holness, Statistics Canada-
Abstract
This paper describes the Quick Match System (QMS), an in-house application designed to match business microdata records, and the methods used to link the United States Patent and Trademark Office (USPTO) dataset to Statistics Canada's Business Register (BR) for the period from 2000 to 2011. The paper illustrates the record-linkage framework and outlines the techniques used to prepare and classify each record and evaluate the match results. The USPTO dataset consisted of 41,619 U.S. patents granted to 14,162 distinct Canadian entities. The record-linkage process matched the names, city, province and postal codes of the patent assignees in the USPTO dataset with those of businesses in the January editions of the Generic Survey Universe File (GSUF) from the BR for the same reference period. As the vast majority of individual patent assignees are not engaged in commercial activity to provide taxable property or services, they tend not to appear in the BR. The relatively poor match rate of 24.5% among individuals, compared to 84.7% among institutions, reflects this tendency. Although the 8,844 individual patent assignees outnumbered the 5,318 institutions, the institutions accounted for 73.0% of the patents, compared to 27.0% held by individuals. Consequently, this study and its conclusions focus primarily on institutional patent assignees.
The linkage of the USPTO institutions to the BR is significant because it provides access to business micro-level data on firm characteristics, employment, revenue, assets and liabilities. In addition, the retrieval of robust administrative identifiers enables subsequent linkage to other survey and administrative data sources. The integrated dataset will support direct and comparative analytical studies on the performance of Canadian institutions that obtained patents in the United States between 2000 and 2011.
- Contribution (PDF, 893.56 KB) Archived
-
- Measuring the Quality of a Probabilistic Linkage through Clerical-Reviews
Abel Dasylva, Melanie Abeysundera, Blache Akpoué, Mohammed Haddou and Abdelnasser Saïdi, Statistics Canada-
Abstract
Probabilistic linkage is susceptible to linkage errors such as missed links and false links. In many cases, these errors may be reliably measured through clerical-reviews, i.e. the visual inspection of a sample of record pairs to determine if they are matched. A framework is described to effectively carry-out such clerical-reviews. It is based on a probabilistic sample of pairs, repeated independent reviews of some pairs and latent class analysis to account for clerical errors.
- Contribution (PDF, 479.95 KB) Archived
-
15:30 – 17:00
Session 8A – Paradata
- On the Utility of Paradata in Major National Surveys: Challenges and Benefits
Brady West, University of Michigan, USA and Frauke Kreuter, University of Maryland, USA-
Abstract
This presentation will begin with Dr. West providing a summary of research that has been conducted on the quality and utility of paradata collected as part of the United States National Survey of Family Growth (NSFG). The NSFG is the major national fertility survey in the U.S., and an important source of data on sexual activity, sexual behavior, and reproductive health for policy makers. For many years, the NSFG has been collecting various forms of paradata, including keystroke information (e.g., Couper and Kreuter 2013), call record information, detailed case disposition information, and interviewer observations related to key NSFG measures (e.g., West 2013). Dr. West will discuss some of the challenges of working with these data, in addition to evidence of their utility for nonresponse adjustment, interviewer evaluation, and/or responsive survey design purposes. Dr. Kreuter will then present research done using paradata collected as part of two panel surveys: the Medical Expenditure Panel Survey (MEPS) in the United States, and the Panel Labour Market and Social Security (PASS) in Germany. In both surveys, information from contacts in prior waves were experimentally used to improve contact and response rates in subsequent waves. In addition, research from PASS will be presented where interviewer observations on key outcome variables were collected to be used in nonresponse adjustment or responsive survey design decisions. Dr. Kreuter will not only present the research results but also the practical challenges in implementing the collection and use of both sets of paradata.
- Contribution (PDF, 590.35 KB) Archived
-
- A Bayesian analysis of survey design parameters
Barry Schouten, Joep Burger, Lisette Bruin and Nini Mushkudiani, Statistics Netherlands, Netherlands-
Abstract
In the design of surveys a number of parameters like contact propensities, participation propensities and costs per sample unit play a decisive role. In on-going surveys, these survey design parameters are usually estimated from previous experience and updated gradually with new experience. In new surveys, these parameters are estimated from expert opinion and experience with similar surveys. Although survey institutes have a fair expertise and experience, the postulation, estimation and updating of survey design parameters is rarely done in a systematic way.
This paper presents a Bayesian framework to include and update prior knowledge and expert opinion about the parameters. This framework is set in the context of adaptive survey designs in which different population units may receive different treatment given quality and cost objectives. For this type of survey, the accuracy of design parameters becomes even more crucial to effective design decisions.
The framework allows for a Bayesian analysis of the performance of a survey during data collection and in between waves of a survey. We demonstrate the Bayesian analysis using a realistic simulation study.
- Contribution (PDF, 1.04 MB) Archived
-
- Statistics Canada's Experiences in Using Paradata to Manage Responsive Collection Design for Computer-Assisted Telephone Interview Household Surveys
François Laflamme, Sylvain Hamel and Dominique Chabot-Hallé, Statistics Canada-
Abstract
Paradata research has focused on identifying opportunities for strategic improvement in data collection that could be operationally viable and lead to enhancements in quality or cost efficiency. To that end, Statistics Canada has developed and implemented a responsive collection design (RCD) strategy for computer-assisted telephone interview (CATI) household surveys to maximize quality and efficiency and to potentially reduce costs. RCD is an adaptive approach to survey data collection that uses information available prior to and during data collection to adjust the collection strategy for the remaining in-progress cases. In practice, the survey managers monitor and analyze collection progress against a predetermined set of indicators for two purposes: to identify critical data-collection milestones that require significant changes to the collection approach and to adjust collection strategies to make the most efficient use of remaining available resources. In the RCD context, numerous considerations come into play when determining which aspects of data collection to adjust and how to adjust them. Paradata sources play a key role in the planning, development and implementation of active management for RCD surveys. Since 2009, Statistics Canada has conducted several RCD surveys. This paper describes Statistics Canada's experiences in implementing and monitoring this type of surveys.
- Contribution (PDF, 509.81 KB) Archived
-
15:30 – 17:00
Session 8B – Use of Administrative Data
- Redesign of the longitudinal immigration database (IMDB)
Rose Evra, Statistics Canada-
Abstract
The Longitudinal Immigration Database (IMDB) combines the Immigrant Landing File (ILF) with annual tax files. This record linkage is performed using a tax filer database. The ILF includes all immigrants who have landed in Canada since 1980. In looking to enhance the IMDB, the possibility of adding temporary residents (TR) and immigrants who landed between 1952 and 1979 (PRE80) was studied. Adding this information would give a more complete picture of the immigrant population living in Canada. To integrate the TR and PRE80 files into the IMDB, record linkages between these two files and the tax filer database, were performed. This exercise was challenging in part due to the presence of duplicates in the files and conflicting links between the different record linkages.
- Contribution (PDF, 298.82 KB) Archived
-
- Creating a longitudinal database based on linked administrative registers: An example
Philippe Wanner, Université de Genève et NCCR On The Move, Switzerland and Ilka Steiner, Université de Genève, Switzerland-
Abstract
This paper describes the creation of a database developed in Switzerland to analyze migration and the structural integration of the foreign national population. The database is created from various registers (register of residents, social insurance, unemployment) and surveys, and covers 15 years (1998 to 2013). Information on migration status and socioeconomic characteristics is also available for nearly 4 million foreign nationals who lived in Switzerland between 1998 and 2013. This database is the result of a collaboration between the Federal Statistics Office and researchers from the National Center of Competence in Research (NCCR)–On the Move.
- Contribution (PDF, 167.29 KB) Archived
-
- Use of Administrative Data to Increase the Efficiency of the Sample Design for the New National Travel Survey
Charles Choi, Statistics Canada-
Abstract
As part of the Tourism Statistics Program redesign, Statistics Canada is developing the National Travel Survey (NTS) to collect travel information from Canadian travellers. This new survey will replace the Travel Survey of Residents of Canada and the Canadian resident component of the International Travel Survey. The NTS will take advantage of Statistics Canada's common sampling frames and common processing tools while maximizing the use of administrative data. This paper discusses the potential uses of administrative data such as Passport Canada files, Canada Border Service Agency files and Canada Revenue Agency files, to increase the efficiency of the NTS sample design.
- Contribution (PDF, 221.23 KB) Archived
-
- Using Administrative Data to Study Education in Canada
Martin Pantel, Statistics Canada-
Abstract
The Educational Master File (EMF) system was built to allow the analysis of educational programs in Canada. At the core of the system are administrative files that record all of the registrations to post-secondary and apprenticeship programs in Canada. New administrative files become available on an annual basis. Once a new file becomes available, a first round of processing is performed, which includes linkage to other administrative records. This linkage yields information that can improve the quality of the file, it allows further linkages to other data describing labour market outcomes, and it's the first step in adding the file to the EMF. Once part of the EMF, information from the file can be included in cross-sectional and longitudinal projects, to study academic pathways and labour market outcomes after graduation. The EMF currently consists of data from 2005 to 2013, but it evolves as new data become available. This paper gives an overview of the mechanisms used to build the EMF, with focus on the structure of the final system and some of its analytical potential.
- Contribution (PDF, 201.86 KB) Archived
-
Thursday, March 24, 2016
8:00 – 12:00
Registration – Third floor
8:45 – 10:15
Session 9A – Scanner Data
- Challenges Associated with Using Scanner Data for the Consumer Price Index
Catherine Deshaies-Moreault and Nelson Émond, Statistics Canada-
Abstract
Practically all major retailers use scanners to record the information on their transactions with clients (consumers). These data normally include the product code, a brief description, the price and the quantity sold. This is an extremely relevant data source for statistical programs such as Statistics Canada's Consumer Price Index (CPI), one of Canada's most important economic indicators. Using scanner data could improve the quality of the CPI by increasing the number of prices used in calculations, expanding geographic coverage and including the quantities sold, among other things, while lowering data collection costs. However, using these data presents many challenges. An examination of scanner data from a first retailer revealed a high rate of change in product identification codes over a one-year period. The effects of these changes pose challenges from a product classification and estimate quality perspective. This article focuses on the issues associated with acquiring, classifying and examining these data to assess their quality for use in the CPI.
- Contribution (PDF, 475.07 KB) Archived
-
- The QU-method: A new methodology for processing scanner data
Antonio G. Chessa, Statistics Netherlands, Netherlands-
Abstract
This paper presents a new price index method for processing electronic transaction (scanner) data. Price indices are calculated as a ratio of a turnover index and a weighted quantity index. Product weights of quantities sold are computed from the deflated prices of each month in the current publication year. New products can be timely incorporated without price imputations, so that all transactions can be processed. Product weights are monthly updated and are used to calculate direct indices with respect to a fixed base month. Price indices are free of chain drift by this construction. The results are robust under departures from the methodological choices. The method is part of the Dutch CPI since January 2016, when it was first applied to mobile phones.
- Contribution (PDF, 421.01 KB) Archived
-
- A look into the future
Muhanad Sammar, Statistics Sweden, Sweden-
Abstract
The fact that the world is in continuous change and that new technologies are becoming widely available creates new opportunities and challenges for National Statistical Institutes (NSIs) worldwide. What if NSIs could access vast amounts of sophisticated data for free (or for a low cost) from enterprises? Could this facilitate the possibility for NSIs to disseminate more accurate indicators for the policy-makers and users, significantly reduce the response burden for companies, reduce costs for the NSIs and in the long run improve the living standards of the people in a country? The time has now come for NSIs to find the best practice to align legislation, regulations and practices in relation to scanner data and big data. Without common ground, the prospect of reaching consensus is unlikely. The discussions need to start with how to define quality. If NSIs define and approach quality differently, this will lead to a highly undesirable situation, as NSIs will move further away from harmonisation. Sweden was one of the leading countries that put these issues on the agenda for European cooperation; in 2012 Sweden implemented scanner data in the national Consumer Price Index after it was proven through research studies and statistical analyses that scanner data was significantly better than the manually collected data.
- Contribution (PDF, 397.68 KB) Archived
-
8:45 – 10:15
Session 9B – Health Data
- Comparing Canada's Healthcare System: Benefits and Challenges
Katerina Gapanenko, Grace Cheung, Deborah Schwartz and Mark McPherson, Canadian Institute for Health Information (CIHI), Canada-
Abstract
Background: There is increasing interest in measuring and benchmarking health system performance. We compared Canada's health system with other countries in the Organisation for Economic Co-operation and Development (OECD) on both the national and provincial levels, across 50 indicators of health system performance. This analysis can help provinces identify potential areas for improvement, considering an optimal comparator for international comparisons.
Methods: OECD Health Data from 2013 was used to compare Canada's results internationally. We also calculated provincial results for OECD's indicators on health system performance, using OECD methodology. We normalized the indicator results to present multiple indicators on the same scale and compared them to the OECD average, 25th and 75th percentiles.
Results: Presenting normalized values allow Canada's results to be compared across multiple OECD indicators on the same scale. No country or province consistently has higher results than the others. For most indicators, Canadian results are similar to other countries, but there remain areas where Canada performs particularly well (i.e. smoking rates) or poorly (i.e. patient safety). This data was presented in an interactive eTool.
Conclusion: Comparing Canada's provinces internationally can highlight areas where improvement is needed, and help to identify potential strategies for improvement.
- Contribution (PDF, 2.55 MB) Archived
-
- A Systematic Review: Evaluating Extant Data Sources for Potential Linkage
Erin Tanenbaum, NORC at the University of Chicago, USA; Michael Sinclair, Mathematica Policy Research, USA; Jennifer Hasche, NORC at the University of Chicago, USA and Christina Park, National Institute of Child Health and Human Development (NICHD), USA-
Abstract
The National Children's Study Vanguard Study was a pilot epidemiological cohort study of children and their parents. Measures were to be taken from pre-pregnancy until adulthood. The use of extant data was planned to supplement direct data collection from the respondents. Our paper outlines a strategy for cataloging and evaluating extant data sources for use with large scale longitudinal. Through our review we selected five evaluation factors to guide a researcher through available data sources including 1) relevance, 2) timeliness, 3) spatiality, 4) accessibility, and 5) accuracy.
- Contribution (PDF, 144.13 KB) Archived
-
- Providing Meaningful and Actionable Health System Performance Information: CIHI's ‘Your Health System’ Tools
Jeanie Lacroix and Kristine Cooper, Canadian Institute for Health Information (CIHI), Canada-
Abstract
How can we bring together multidimensional health system performance data in a simplified way that is easy to access and provides comparable and actionable information to accelerate improvements in health care? The Canadian Institute for Health Information has developed a suite of tools to meet performance measurement needs across different audiences, to identify improvement priorities, understand how health regions and facilities compare with peers and support transparency and accountability. The pan-Canadian tools [Your Health System (YHS)] consolidates reporting of 45 key performance indicators in a structured way, and are comparable over time and at different geographic levels. This paper outlines the development and the methodological approaches and considerations taken to create a dynamic tool that facilitates benchmarking and meaningful comparisons for health system performance improvement.
- Contribution (PDF, 452.61 KB) Archived
-
- Epidemiological observatory on Brazilian health data
Raphael de Freitas Saldanha and Ronaldo Rocha Bastos, Universidade Federal de Juiz de Fora, Brazil-
Abstract
The Unified Brazilian Health System (SUS) was created in 1988 and, with the aim of organizing the health information systems and databases already in use, a unified databank (DataSUS) was created in 1991. DataSUS files are freely available via Internet. Access and visualization of such data is done through a limited number of customized tables and simple diagrams, which do not entirely meet the needs of health managers and other users for a flexible and easy-to-use tool that can tackle different aspects of health which are relevant to their purposes of knowledge-seeking and decision-making. We propose the interactive monthly generation of synthetic epidemiological reports, which are not only easily accessible but also easy to interpret and understand. Emphasis is put on data visualization through more informative diagrams and maps.
- Contribution (PDF, 211.79 KB) Archived
-
- Data surveillance on the clinical data used for health system funding in Ontario
Lori Kirby and Maureen Kelly, Canadian Institute for Health Information (CIHI), Canada-
Abstract
Several Canadian jurisdictions including Ontario are using patient-based healthcare data in their funding models. These initiatives can influence the quality of this data both positively and negatively as people tend to pay more attention to the data and its quality when financial decisions are based upon it.
Ontario's funding formula uses data from several national databases housed at the Canadian Institute for Health Information (CIHI). These databases provide information on patient activity and clinical status across the continuum of care. As funding models may influence coding behaviour, CIHI is collaborating with the Ontario Ministry of Health and Long-Term Care to assess and monitor the quality of this data.
CIHI is using data mining software and modelling techniques (that are often associated with "big data") to identify data anomalies across multiple factors. The models identify what the "typical" clinical coding patterns are for key patient groups (for example, patients seen in special care units or discharged to home care), so that outliers can be identified, where patients do not fit the expected pattern. A key component of the modelling is segmenting the data based on patient, provider and hospital characteristics to take into account key differences in the delivery of health care and patient populations across the province.
CIHI's analysis identified several hospitals with coding practices that appear to be changing or significantly different from their peer group. Further investigation is required to understand why these differences exist and to develop appropriate strategies to mitigate variations.
- Contribution (PDF, 1.46 MB) Archived
-
10:45 – 11:45
Session 10 – Plenary Session
- Data Science for Dynamic Data Systems: Implications for Official Statistics
Mary E. Thompson, University of Waterloo, Canada-
Abstract
Many of the challenges and opportunities of modern data science have to do with dynamic aspects: evolving populations, the growing volume of administrative and commercial data on individuals and establishments, continuous flows of data and the capacity to analyze and summarize them in real time, and the deterioration of data absent the resources to maintain them. With its emphasis on data quality and supportable results, the domain of Official Statistics is ideal for highlighting statistical and data science issues in a variety of contexts. The messages of the talk include the importance of population frames and their maintenance; the potential for use of multi-frame methods and linkages; how the use of large scale non-survey data as auxiliary information shapes the objects of inference; the complexity of models for large data sets; the importance of recursive methods and regularization; and the benefits of sophisticated data visualization tools in capturing change.
- Contribution (PDF, 309.76 KB) Archived
-
11:45 – 12:00
Session 11 – Closing remarks
Claude Julien, Director General of the Methodology Branch, Statistics Canada
Poster session
Wednesday, March 23, 2016
10:00 – 10:30, 13:00 – 13:30, 15:00 – 15:30
- Handling survey feedback in business statistics
Jörgen Brewitz, Eva Elvers and Fredrik Jonsson, Statistics Sweden, Sweden-
Abstract
The Swedish business register is updated continuously by several different sources. Coordinated samples are drawn from frames which are built from the business register. Coordination is done over time and between different surveys, using permanent random numbers. This technique has many benefits but suffers from a drawback concerning updating of register data. Survey feedback means information on sampled units fed back from a sample survey to a register which is used to build a frame for future surveys. It may seem evident to update a register with survey feedback in order to make it as suitable as possible. However, methodological problems may occur due to the dependency of samples.
This paper explores the bias in estimators that arise from survey feedback and points out some ways to reduce the bias. The feedback comprises data on status, business size, sector, and industry as well as contact information. The survey feedback effect on sampling design, auxiliary information for estimation and distribution by domains of study, respectively, is studied. Industry updates in the register, and by that in the frame, is shown to have a bias effect on the estimators.
It seems hard to adjust the estimators for the presence of survey feedback. Another approach is to implement source and time stamps in the business register. Survey feedback can then be used for contact information and distribution by domains of study, but removed when frames are created for sampling purposes. Thereby, the estimation will not be disturbed by survey feedback.
-
- Measuring Data Quality of Price Indexes: the Producer Prices Division's Performance Measure Grading Scheme
Kate Burnett-Isaacs, Statistics Canada-
Abstract
The Producer Prices Division (PPD) has developed a Performance Measure Grading Scheme to evaluate each PPD index on key performance indicators to promote sound methodological practices and convey overall quality and reliability of published index numbers. This grading scheme was developed to meet the recommendations of the agency wide Quality Assurance Review Committee and aid the divisional Performance Measurement Strategy. Its components were drawn from the OECD Generic Statistical Business Process Model and Statistics Canada's six dimensions of quality. Assessing the quality of an index is multi-faceted because of the complexities of index numbers and calculations and the different components of index compilation. An index number is comprised of price relatives, weights and a variety of forms of treatments to these data. The quality of an index must be assessed on the individual parts as well as the whole. The grading scheme intends to capture this and provide a measure of quality for the entire index, as well as its individual components, starting from a qualitative conceptual assessment, to a quantitative processing perspective. PPD produces 25 indexes that cover a wide scope of the business sector including goods production and manufacturing, construction, financial, transportation and professional services. These industries each have their own sources of data and standards of price measurement. The diversity of PPD's index coverage brings with it a complexity when developing a standard method of assessing data quality. This paper will discuss the complexities of measuring data quality for indexes, explain the development of the grading scheme and choice of performance measures and discuss the difficulties and considerations required when making a standard measure of quality covering a broad range of industries and data sources.
-
- Creation and use of large synthetic data sets in 2021 Census Transformation Programme
Cal Ghee, Rob Rendell, Orlaith Fraser, Steve Rogers, Fern Leather, Keith Spicer and Peter Youens, Office for National Statistics, United Kingdom-
Abstract
The Census Transformation Programme of the Office for National Statistics in England and Wales is investigating the use of synthetic data for a number of potential uses. One potential application is the creation of a household microdata sample. Due to disclosure concerns, we have not been able to provide a 2011 Census household microdata sample accessible outside secure research environments, with sufficient utility for users, using standard disclosure control techniques.
The method being tested to create a household microdata file punches holes in a microdata sample, and uses the edit and imputation process from 2011 Census (using CANCEIS) to fill in the holes. This attempts to preserve the relationships between variables within households and individuals, but introduces sufficient uncertainty to mitigate disclosure risk. Methods are being investigated to test utility and risk in the resultant data.
This poster will demonstrate issues we have to overcome, and the methods we are investigating to come up with a solution to provide useful, non-disclosive, microdata for a wider range of users. Given these issues, this is just a proof of concept at this stage, to test whether the approach is feasible.
-
Software demonstration
Wednesday, March 23, 2016
10:00 – 10:30, 13:00 – 13:30, 15:00 – 15:30
- The use of a SAS Grid at Statistics Canada
Yves Deguire, Statistics Canada-
Abstract
A SAS grid is a sophisticated computing platform that provides load balancing, high availability, and scalability. This presentation will demystify the SAS grid and show how it has been deployed at Statistics Canada to support a large number of users as well as to perform huge amounts of statistical processing. Several use cases will also be proposed.
-
- SAS® High-Performance Forecasting Software at Statistics Canada
Frédéric Picard, Statistics Canada-
Abstract
Statistics Canada recently started to use SAS® HIGH-PERFORMANCE FORECASTING (HPF). SAS®HPF is a large-scale automatic system that can evaluate and select appropriate models and generate a high number of Time Series predictions rapidly. It can be used by writing SAS code or through its graphical user interface. We will give an overview of some of the system's features and useful options and a presentation of the Graphical User Interface. We will also briefly describe examples of projects at Statistics Canada that have already benefitted from the software.
-
- High Performance Analytics – How SAS can help you save time and make better decisions with modern analytics!
Steve Holder, SAS Canada, Canada-
Abstract
What would you do with an extra 269 minutes? The SAS High Performance Analytics (HPA) framework helps organizations move from dated and inefficient processes to modern analytics. In one case reducing the time it takes to make critical business decisions from 4.5 hours to just 60 seconds. Grab your coffee, come around and meet Steve Holder, National Lead, Analytics, SAS Canada. As an analytics practitioner you can find out how you can make decisions in real time, transforming your big data into relevant business value and ensure you do this in an easy to use governed way with the SAS analytics portfolio.
-
- Machine learning in the service of official statistics
Valentin Todorov, United Nations Industrial Development Organization (UNIDO), Austria-
Abstract
Machine learning (ML) is a very popular, data-intensive, computer science discipline. It is fairly generic and can be applied in various settings, however applications in official statistics became known only recently. To throw insight into this, to identify the techniques that have been explored, to investigate the opportunities for extending the links between official statistics and machine learning in particular and data science in general, a survey across the national statistical offices was conducted recently by Statistics Canada. In a paper presented at the workshop of the Modernisation Committee on Production and Methods in 2014 an overview of the machine learning techniques currently in use or in consideration at statistical agencies worldwide were discussed and the main reasons why statistical agencies should start exploring the use of machine learning techniques were outlined. Best choices of software tools for practical implementation of ML algorithms are Python and R. The purpose of this contribution is to present an update of the survey mentioned above, to map its findings to the currently available in the public domain R packages and to sketch the possible way forward. Mini-tutorial of R packages and illustration with several particular applications will be presented: automatic coding of item responses, outlier detection and imputation and record linkage, which will reduce manual examination of records.
-
- Common Statistical Production Architecture and "confidentiality-on-the-fly"
Robert McLellan and Predrag Mizdrak, Statistics Canada-
Abstract
The Common Statistical Production Architecture (CSPA) is a framework for "plug and play" statistical components. This presentation will showcase the work done at Statistics Canada in the CSPA context for a confidentiality tool developed by the Australian Bureau of Statistics (ABS). "Confid-on-the-fly" is an analytical tool where results are automatically and instantly made confidential. The presenters will illustrate how they turned the R-code implementation from ABS into a set of services that allow for model exploration and generation by researchers with access to confidential microdata.
-