01A-1 - Statistics without surveys? About the past, present and future of data collection in The Netherlands
02A-1 - The OCTOPUSSE System of the New Master Sample of the Institut National de la Statistique et des Études Économiques (INSEE)
02A-2 - The National Statistical Registry on Persons in Italy: the project, the potentiality
02A-3 - Cycle 2 of the Canadian Health Measures Survey: Combining Census and administrative data to improve the efficiency of the survey frame
02B-1 - Accounting for temporal effects, sampling variability, incomplete data and reporting error in the integration of consumer expenditure data from survey and administrative-record sources
02B-2 - The potential of administrative education data used to measure subnational migration in England
02B-3 - From Neighbourhood Statistics to Beyond 2011: Making better use of administrative data
02B-4 - Validating financial data using tax registers
02C-2 - Using administrative data to find the best medium: Examples of mixed sources and mixed modes
02C-3 - Accounting for the effects of data collection modes in population surveys
03A-1 - How survey data and LifePaths' microsimulation model can interact to produce policy-relevant results
03A-2 - Using multiple data sources to project U.S. Social Security finances and the federal budget in the long term
03A-3 - Integrating infectious and chronic disease information to model cervical cancer
03B-1 - Building the Dwelling Frame in Gauteng Province using a combination of statistical methods, namely census, administrative and survey data
03B-2- Improving and using the Statistical Administrative Records System
03B-3 - Use of the Canada Child Tax Benefit as a frame for the Survey of Young Canadians
03B-4 - Targeting populations of interest in the sample design using the Census of Population: An example from the Ontario Survey on the Prevalence of Hypertension
03C-2 - Using a ‘community survey’ and previous census estimates to produce composite estimators for the South African Index of Multiple Deprivation 2007
03C-3 - Using complete administration data for non-response analysis: The PASS survey of low-income households in Germany
04A-1 - Operational Needs versus Data Requirements: Challenges in Providing Policing Data to Statistics Canada
04A-2 - The Impact of Changing Concepts on Estimates from Administrative Data: The Case of the Integrated Criminal Courts Survey
04A-3 - Opportunities and Challenges in Using Administrative Justice Data for Longitudinal and Network Research
04B-2 - Use of address databases as a basis for household frames: Confirmation bias in housing unit listing
04B-3 - Experiences using the Census as a sampling frame: The case of the Access and Support to Education and Training Survey
04B-4 - Using administrative data sources to target sampling in the Wealth and Assets Survey in the UK
04C-1 - Impact of identifying information requested on consent rates
04C-2 - When the Census and administrative data combine to illustrate social inequalities in health in Canada
04C-3 - Indigenous life expectancy using multiple Australian data sources
04C-4 - Census data linkage: 1991 Census of Population, Canadian Mortality Database, and Canadian Cancer Database
05A-1 - The organisation of statistical methodology and methodological research in national statistical offices
06A-1- Glink - A Probabilistic Record Linkage System
06A-2 - Record Linkage Methods and Techniques as Proposed in RELAIS
06A-3 - Obtaining Estimates of Parameters for Probabilistic Record Linkage Using the EM Algorithm
06B-1 - The optimal estimator in dual frame surveys using indirect sampling
06B-2 - Demographic analysis in 2010: Methodological challenges
06B-3 - Dual-frame weighting and estimation challenges for the Canadian Community Health Survey
06C-1- The Research Data Centre-in-Research Data Centre approach
06C-2 - The Canadian Forces Portrait: A case study on integrating administrative and survey data to develop a personnel management resource
07A-1 - Imputing Underreported Treatments Using Multiple Sources of Treatment Information in a Cancer Services Study
07A-2 - Proxy Pattern-Mixture Analysis for Survey Non-response
07A-3 - Calibrated Robust Imputation in Surveys
07B-1 - Methodology in the Swedish register-based census
07B-2 - Making better use of administrative data: How far can New Zealand go?
07B-3 - First register-based census in Slovenia: How to convert administrative sources to statistics
07C-1- Historical linkage of tax data on labour (1990-2007): a case applied to the pilot survey of the Canadian Household Panel Survey
07C-2 - Creation of a child-centric inter-agency data warehouse: The Longitudinal Study of Early Development
07C-3 - The interplay between administrative data and survey data in the Early School Leaver Monitor for the City of Rotterdam
08A-1- Combining Information from Two Independent Surveys through the Pseudo Empirical Likelihood Method
08A-2 - Estimation of Correlations between Cross-sectional Estimates from Repeated Surveys: An Application to the Variance of Change and the Variance of the Composite Estimator
08A-3 - Use of Auxiliary Sources in Weighting: The Case of Integrating Social Survey Data
08B-1 - Integrating census and household surveys listing operations
08B-2 - Outline of the post-censal surveys program
08B-3 - An Integrated Approach to Collection Census Data in Canada
08B-4 - The 2009 Census Test supplementary study of the new census collection methodology
08C-1 - An empirical comparison of approaches to approximate string matching in private record linkage
08C-2 - GLink – Constructing an avatar
08C-3 - Privacy-preserving record linkage using Bloom filters
09A-1 - Estimation of Average Design-based Mean Squared Error of Synthetic Small Area Estimators
09A-2 - Estimation of Poverty Measures for Small Areas
09A-3 - Alternative Forms of Small Area Models for Using Survey, Census and Administrative Records Data
09B-1- Combining administrative and survey data in a German research data centre: Linkage, quality and future developments
09B-2 - Employee flows to study firm and employment dynamics
09B-3 - Dealing with inconsistencies in linking administrative data
09B-4 - You can match my data! Biasing effect in the use of linked administrative and survey data
09C-1 - Toward a more robust non-response propensity model: The interplay among NAEP restricted-use data, contact history information and school administrative data
09C-2 - Multivariate fractional imputation in survey sampling
09C-3 - Low response rates and sample representativeness in an online survey of college students
10A-1 - A Platform of Services Allowing Remote Access to Research Data of Interest
10A-2 - The Real Time Remote Access at Statistics Canada
10A-3 - Methods for Permitting Differential Access to Sensitive Data while Maintaining Confidentiality
10B-1 - Measuring Alberta’s shadow populations
10B-2 - More for less? Using statistical modelling to combine existing data sources to produce sounder, more detailed, and less expensive Official Statistics
10B-3 - Estimation of monthly unemployment figures in a rotating panel; the use of auxiliary series in a structural time series model
10B-4 - Comparison of intercensal updating techniques for local level poverty statistics
10C-1 - Use of mental health services in Québec assorting to population survey data and administrative data: Study on the under-reporting and over-reporting of services
10C-2 - Comparing the Uniform Crime Reporting Survey and the General Social Survey on Victimization, 2010
10C-3 - School education data in India: Emerging issues in increasing multiplication of sources of data
10C-4 - Reconciling conflicting administrative and survey data reports of diabetes status in a Longitudinal Study of Older Americans
11A-1 - Use of the Census as a Dual Survey Frame in the Canadian Survey of Household Spending
11A-2 - The Contribution of Hashing to the Treatment of Surveys and Administrative Files in France
11A-3 - American Community Survey Design and Statistical Methodology
11C-2 - Linking the National Health and Nutrition Examination Survey to Food Stamp Assistance Program administrative data: A one-state pilot study
11C-3 - Quality and quantity: Using administrative data for scientific purposes in labour market research
12A-1 - A Strategic Vision for the use of Administrative Data at Statistics Canada
Jelke Bethlehem, Statistics Netherlands
Registers play an increasingly important role in the production of statistics by Statistics Netherlands. Already from the early beginnings, the population register formed the backbone of a system of social statistics. It is used as a source of data for population statistics, as a sampling frame and as a source of auxiliary information for weighting adjustment.
The census of 1971 was the last traditional census in The Netherlands. Since then, virtual censuses are conducted. Statistics Netherlands has developed the Social Statistical Database (SSD), in which the population register, other registers and surveys are combined. The SSD is the central data source for the virtual censes and many of the social statistics published by Statistics Netherlands. The SSD turns out to be also very useful for nonresponse analysis and nonresponse correction.
Political pressure to reduce the response burden, the high costs of surveys and nonresponse problems have forced Statistics Netherlands to rethink its data collection strategy. The focus changed from surveys to registers. This approach has advantages, but there are also potential risks. The future of data collection is not clear. Creative thinking may be required to find new and smart solutions.
Marc Christine, INSEE (France)
Sébastien Faivre, INSEE (France)
Since the 1960s, the two-stage sample design used in INSEE household surveys traditionally used exhaustive population censuses as a survey frame.
In January 2004, France implemented a radically different census system, accomplished each year by surveying a fraction of the territory, with a five-year rotation of the census zones (called “rotation groups”).
This system, named OCTOPUSSE (Organisation coordonnée de tirages optimisés pour une utilisation statistique des échantillons), required the redefinition of the methodology used for the composition of household survey samples.
The primary innovation consisted of benefiting from the « freshness » of the new census, that is, using the lists of lodgings surveyed in year n as a sample frame for surveys conducted in year n+1.
As a result, the primary units had to be redefined to consider the freshness principle and to allow an enumerator assigned to a specific zone to conduct a survey without incurring excessive costs for travelling to respondents’ homes. Therefore, only the fraction of the primary unit (or EAZ, enumerator action zone) belonging to the last rotation group s6urveyed is drawn for a given survey.
The EAZ is constituted by aggregating communes, under minimal constraints, while minimizing their geographical range. An innovative automated solution was implemented to this effect.
A sample EAZ was then taken with appropriate stratification and balancing conditions. The new system created has been operational since mid-2009. It should be noted that most surveys are touched by this new system, except the employment survey that uses a selection system of clusters, which uses sampling frames from tax files.
Roberta Vivio, Italian National Statistical Institute, ISTAT, Rome (Italy)
The “Statistical Registry on Persons in Italy” is now arriving at the production starting point. It’s going to be the first statistical register on entire Italian population, and will imply important consequences in the official statistics whether using it alone (by raw data) or together with other archives (by matched data),
Why now? The General Government’s administrative registers, and consequently their statistical potential, have been recently increased. The new norms regulating the access and use of such registers have established the preconditions for cooperating among administrative authorities and ISTAT.
Which sources? The Register consists of multiple administrative nationwide sources whose keepers are Central authorities: e.g. Registry Database of physical persons of the Tax Registry’s informative system (Ministry of Finances, 70 millions records), Pensioner centre (INPS, 16 million records), etc.
Which challenges?
What do we expect to do? We are projecting and testing some implementation of our register, among which the most important will be the cooperation with the next population census in order to control the undercount rate. The other implementations could be:
Suzelle Giroux, Statistique Canada
France Labrecque, Statistique Canada
The Canadian Health Measures Survey (CHMS) uses a multi-stage sample design. For each sampled collection site, dwellings were selected from the 2006 Census using the household composition to better reach the target age groups. This sample design was a success for Cycle 1, hence was used again for Cycle 2 of the survey. Cycle 2 targets people aged 3 to 79 years old. Since its collection is taking place from fall 2009 to fall 2011, the 2006 Census frame deteriorates and must be updated to cover new dwellings and to be able to identify dwellings with youths 3 to 5 years old that are no longer identifiable using the Census. This presentation will begin with an overview of the survey design of the CHMS. Next, the update of the frame with the Address Register and the T1 Family File to improve coverage and reach the target population will be explained. Finally, results on the efficiency of this approach will be presented based on completed sites of Cycle 2.
John L. Eltinge, Bureau of Labor Statistics, USA
A primary goal of the U.S. Consumer Expenditure Survey (CE) is to produce estimates of mean consumer expenditures at a relatively fine level of aggregation defined by the Universal Classification Code. At present, CE produces its estimates based on data collected through diary and interview surveys. As part of the effort to improve the balance cost, burden and quality, one could consider supplementing the abovementioned survey sources with administrative-record data. Two examples of prospective administrative data sources are large retail outlets and financial transaction intermediaries. Evaluation of that balance involves three principal issues: (1) for surveys - costs, burden and error sources (including population variability, incomplete data and measurement error); (2) for administrative records - the same factors, plus additional factors related to operational risk; (3) for both - variability of factors in (1) and (2) over time.
This paper presents a general methodological framework for the evaluation of issues (1) through (3). This framework leads to several suggestions regarding components of cost, burden, quality and risk for which it is especially important to collect solid empirical information.
Stephen Jivraj, University of Manchester, UK
Migration is an inherently difficult phenomenon to measure as people who move are difficult to track. In the absence of a population register in the UK, two main datasets have been used to measure subnational migration: the national Census and National Health Service Central Register (NHSCR) of all patients registered with a general practitioner. This paper introduces and examines the potential of the English School Census, an administrative source of subnational migration data that can provide more up to date information than the Census and more detailed socioeconomic and geographical detail than NHSCR. Since 2002, the School Census data has been derived from an electronic form completed by each state school in England to cover all enrolled pupils in January of each year. Through the inclusion of a unique pupil identifier, which remains the same throughout a pupil’s school career, the data can be matched over time and the change of home address can indicate migrant pupils. Empirical comparison with the 2001 Census and NHSCR shows that the School Census provides similar levels and patterns of subnational migration over time when allowing for the different way in which migration is measured in each dataset. The paper concludes that the data has potential for monitoring subnational migration reflected by the Office for National Statistics in England and Wales, exploring its use for local population estimation.
Louise Morris, ONS, U.K.
Minda Phillips, ONS, U.K.
The paper will outline both the overall approach and specific measures taken by the Office for National Statistics (ONS) to make better use of data from administrative sources. Whereas initial efforts focused on the provision of aggregate outputs, since 2008 priority has been given to accessing record level data to support work on improving population statistics.
Early work concentrated on the acquisition, processing and dissemination of information to support the National Strategy for Neighbourhood Renewal. The development of new policies, procedures and statistical tools enabled ONS to provide reliable and consistent information from a local to a national level via the Neighbourhood Statistics Service.
Building on these foundations and new legislative opportunities, efforts are being directed at obtaining access to record level information from administrative sources to take forward a major programme of improvements to population and migration statistics. As part of this work the Beyond 2011 Project has been established to assess whether it will be feasible, in the longer term, to use administrative data, either alone, or in conjunction with extended survey approaches or alternative census methods, to meet ongoing requirements for both population and wider socio-demographic statistics.
This new phase of work has created further challenges most notably on data quality, the interplay between administrative data and other data sources and the linkage of data in a country without a single personal identifier. The challenges and experiences of using administrative sources for statistical purposes will be discussed.
Philippe Wanner, University of Geneva, Switzerland
The capture of detailed financial dimensions from computer-assisted telephone interview (CATI) surveys is often difficult due to problems with estimating income, such as income on assets, income declared from memory, from social desirability or from sample representativity. For this reason, this information must be validated prior to use. The use of administrative data to control and correct survey data is thus indispensable. Based on a study we conducted in 2009-2010 under the mandate of the Federal Office of Statistics, our presentation will show the matching done between the 2007 Swiss Statistics on Income and Living Conditions (SILC) and the fiscal registers of three cantons. This matching aims to compare, validate or correct financial data (salaried and independent income, rents, distributions, income on assets, etc.) collected over the course of the survey. In the methodology section, we will discuss the matching methods used, the results of the matching, and problems with coherence between concepts that needed to be resolved before linking the survey and the administrative registers. In the second section, we will describe the principal sources of errors and discrepancies identified. In the last section, we will present a few recommendations based on the work done, related to both the flow of the survey and the validation work that follows it.
Jannes Hartkamp, DESAN Research Solutions, (The Netherlands)
Hans Rutjes, DESAN Research Solutions (The Netherlands)
The use of administrative or register data as a sampling frame for surveys is a common practice in the worlds of both official and non-official statistics. The rise of mixed mode surveys has raised the question of how information from registers can be used optimally to determine what would be the best survey mode or the best combination of options for different groups. Personal characteristics available beforehand can be used to make fairly accurate response estimations for different modes for each group. If, as is the case in some common survey designs, telephone interviews are used as a ‘last resort’ after approach by post and/or email has failed to yield a response, one needs to estimate in advance for what proportion of a given group a correct postal address, e-mail address and telephone number will be available, not just what the willingness to respond will be among those who can be reached.
Notwithstanding the recent debate about the possible negative effect on response rates offering a choice could have in some cases, on the whole, clever mixed mode designs certainly do optimize response rates. However, exactly what constitutes the best mode mix varies from survey to survey and within a survey from group to group. Moreover, what may have been the best mix three years ago may be no longer the best mix now. In most Western societies, the proportion of the general population for which a telephone number can be found in a publicly available telephone register is steadily decreasing (in the Netherlands, by a few percentage points each year). Internet coverage is still expanding, but now the novelty of web surveys has seen some respondents rather return to paper. E-mail addresses are hard to come by and notoriously volatile. Finally, population characteristics sometimes change quite rapidly in some areas or some sectors. Survey experts often have good reasons for opposing unnecessary changes; however, in survey design, stagnation nowadays soon means decline.
Y.Celia Huang, University of Waterloo, Canada
Mary E. Thompson, University of Waterloo, Canada
Christian Boudreau, University of Waterloo, Canada
Geoffrey T. Fong, University of Waterloo, Canada
Increasingly, survey data is being collected by more than one mode. Some surveys use a combination of web (CAWI) and telephone (CATI) collection; others may use a combination of telephone and face-to-face interviewing. For many types of questions, the distribution of response options may be different in the two (or more) mode samples. The difference is partly attributable to selection effects, for example, due to web and telephone respondents being recruited in different ways, or using different frames. Typically, web and telephone samples differ with respect to their distributions of age, sex, education and variables related to personal outlook. Another source of difference in the response option distribution is "technical" in origin, having to do with respondents’ tendency to process the options differently depending on whether the options are heard or seen. Thus, for example, it is often found that telephone respondents are more likely than web respondents to use the extremes of a Likert scale. The purpose of this presentation is to illustrate an approach to modeling in a mixed mode survey that takes into account both selection and technical mode effects, using data from the International Tobacco Control (ITC) surveys in the Netherlands and Malaysia. The model uses a propensity score for selection effects, and up to two parameters for technical effects. It can be extended to apply to cross-country comparison data, or to longitudinal data where the collection mode may change from wave to wave.
Jacques Légaré, Université de Montréal
The goal of this presentation is to show from our research how Statistics Canada survey data, such as the General Social Survey and health surveys, can be used in conjunction with LifePaths' projections to obtain useful projections in the domain of home care and services in Canada.
Jonathan A. Schwabish, Congressional Budget Office (USA)
The United States Congressional Budget Office’s Long-Term (CBOLT) microsimulation model projects individual demographic and economic behaviour of the U.S. population, the finances of the U.S. Social Security system and the finances of the rest of the U.S. federal government for more than 75 years into the future. For each individual in the model, CBOLT simulates a wide range of demographic and economic characteristics, including birth, death, immigration and emigration, labour force participation, earnings, and Social Security taxes paid and benefits received.
The core individual-level data used in CBOLT come from the Continuous Work History Sample (CWHS), an administrative data set provided by the U.S. Social Security Administration (SSA), and contains administrative longitudinal earnings and Social Security benefit information. Because the CWHS data include limited demographic and labour market information, CBO uses data from other sources to expand the CWHS record and project behaviour in the future. Additional information comes from other data sets such as the Survey of Income and Program Participation (SIPP), the SIPP matched to other administrative earnings records, the Panel Study of Income Dynamics and the Current Population Survey. CBOLT also incorporates a wide range of aggregate data provided by the U.S. Social Security Administration and the Centers for Medicare and Medicaid Services. CBOLT also uses economic projections and projections of federal outlays and revenues produced by other divisions within CBO as part of its mandate to provide the U.S. Congress with neutral estimates of the federal budget and forecasts of the economy.
Michael Wolfson, University of Ottawa
There is now clear evidence that Human Papilloma Virus (HPV) is causal for cervical cancer. As a result, there is considerable interest in the optimal strategy for screening not only for cervical cancer, but also for HPV infection, in conjunction with the best strategy for HPV immunization. A very powerful method for answering this set of questions is microsimulation modeling, where the simulation model serves as the platform for integrating essential data from diverse sources. Indeed, such data integration requires the construction of a simulation model of some sort.
However, modeling in this case is challenging analytically because the usual approaches to modeling infectious and chronic diseases are quite different. For infectious diseases, it is best to model contacts and transmission of the pathogen between individuals explicitly. This entails a microsimulation model with co-evolving agents. For chronic disease incidence, progression and case fatality, on the other hand, inter-individual interactions are not important. Rather, the emphasis is more appropriately on survival times and sojourn times in various disease states, and their implications for health care utilization.
This paper will describe current work on a model of HPV infection and cervical cancer, being undertaken on behalf of the Canadian Partnership Against Cancer, that not only integrates data from a wide variety of sources, but also involves the construction of two distinct submodels and their integration. The first sub-model simulates the spread of HPV and its control by screening and vaccination, while the second sub-model draws on the outputs of the first and then models cervical cancer incidence, progression and treatment.
Mahlape Mohale, Statistics South Africa
Statistics South Africa is developing a Dwelling Frame (DF), a comprehensive source of information on dwelling units that forms the benchmark for statistical planning in a country. Under the conditions of stable human settlements, the task would not be insurmountable; however, Gauteng Province is characterized by a fast-growing population, a high rate of informal settlements and a hype of economic activities, and the DF project presented serious challenges as a result. The 2001 Census had a high undercount and therefore incomplete information on housing; the quality of the administrative data on housing is poor, incomplete and often outdated; and the data collected directly from households in the DF project was incomplete and of poor quality.
Lessons learned in collecting data from the three sources and the analysis thereof provided valuable information for future statistical planning such as Census 2011. The contributing factors to poor quality, incomplete data, high data collection costs and undercount from each data source were identified. Collection methods were reviewed and modified, which included using cheaper modes of transport, using several teams to sweep through an Enumeration Area (EA) instead of one team per EA, mapping the province and systematically identifying and annotating all the structures in the province, and modifying the data collection method to the requirements of each of the four settlement types: farms, traditional, formal and informal settlements.
By using an integrated data collection method, the quality of DF information in Gauteng Province was greatly improved and more efficient data collections techniques were developed.
Amy O’Hara, U.S. Census Bureau, USA
The U.S. Census Bureau uses an integrated system of administrative records to improve censuses and surveys. I will describe the demographic side of the system, containing data on persons and households, and our plans to improve the data system. I will also describe planned indirect and direct uses of the data. The administrative record research program, established over 15 years ago, benefits analysts developing survey frames and questions and is used to evaluate survey responses. Successful projects include interagency research on public health insurance programs and tax credits for the working poor using linked survey and administrative data. A new large-scale evaluation is the 2010 Census Match Study, which examines person and housing unit coverage of administrative data. All projects rely on the availability of quality source data and record linkage techniques. I will discuss the challenges of acquiring data from federal, state, and private providers. Our assessments of source data timeliness and completeness will also be described. I will provide an overview of our research and production record linkage techniques, noting how our expertise has benefitted other federal agencies. The Census Bureau is poised to assist other agencies, particularly other statistical agencies, with data for program evaluation.
Martin Pantel, Statistics Canada
The Canada Child Tax Benefit (CCTB) is a non-taxable amount paid monthly to help eligible families with the cost of raising children under 18 years of age. The Canada Revenue Agency (CRA) manages this program and maintains a file identifying all applicants to the CCTB. Interest in this file as a potential sampling frame for the Survey of Young Canadians (SYC) has grown as few other options are available for this difficult to target population. Given Statistics Canada’s limited experience with the CCTB data, two concerns with the use of this file were potential coverage issues (for example, undercoverage of higher income families) and the quality of the contact and auxiliary information. The coverage aspect of the CCTB will be presented first where population and subpopulation totals from the CCTB and the Canadian Census of Population are compared. Distributions for variables common to both sources are also compared. Then, using probabilistic linkages, profiles of Census families that could not be linked to the CCTB will be drawn. The quality of contact and auxiliary information was assessed via a SYC Contact Test that was carried out in February 2010 with a sample of 1000 children drawn from the CCTB. The presentation will discuss the design, implementation and results of both the probabilistic linkage and the contact test. In light of the encouraging results, final recommendations on the use of the CCTB as a frame for the SYC will be made.
Lori Stratychuk, Statistics Canada
Jean Dumais, Statistics Canada
When designing a survey, there are two traditional options for targeting specific populations. The first option is two-phase sampling, which tends to be expensive. The other option is to enrich the frame using external administrative data. For the Ontario Survey on the Prevalence of Hypertension (OSPH), the latter method was used. One of the main objectives of the OSPH was to look at the relationship between hypertension and ethnicity. Given the rarity of the ethnic groups of interest, special attention was necessary in the sample design in order to allow comparison of the ethnic groups of interest to the rest of the population. A multi-stage sampling plan was created using traditional area methods. When possible, ethnic strata were created at the second state of stratification using the information from the 2001 Census of Population, because the 2006 Census data was not yet disseminate. One of the challenges in the sample design was creating rules to classify the ethnicity of the primary sampling unit, in this case, the census dissemination area. In order to augment the ethnic diversity of the final sample, the ethnic strata of interest were then oversampled. Ultimately, the sampling plan succeeded at enhancing the ethnic representation of the sample.
Chris Dibben, University of St Andrews, U.K.
Gemma Wright, University of Oxford, U.K.
Michael Noble, University of Oxford, U.K.
A number of countries carry out large sample ‘community surveys’ to enable small area estimates to be made for inter-census periods. This paper discusses the use of the South African Community Survey, a survey covering some 274,000 dwelling units, to produce the South African Index of Multiple Deprivation (SAIMD) 2007. The SAIMD, a small area index, was originally census-based and produced by the University of Oxford’s Centre for the Analysis of South African Social Policy (CASASP) for the South African Department for Social Development. However, because of the desire to assess change since the census was carried out in 2001, a methodology capable of updating the previous index using information from the South African Community Survey was explored and developed. The potential for using 2007 administrative data was also examined but no suitably robust datasets were found. A composite estimator was, therefore, produced within a multilevel logistic modeling framework using both the original 2001 census variables and the 2007 community survey. Measures of uncertainty were also derived from the model, with some estimators of the different dimensions of the index being stronger than others. The estimators were then combined to produce an overall index. The pattern revealed underlines the continuing problem of inequality across South Africa and especially the clear spatial legacy of apartheid.
Tobias Gramlich, University of Duisburg-Essen, (Germany)
Rainer Schnell, University of Duisburg-Essen, (Germany)
Alexander Mosthaf, Institute for Employment Research, (Germany)
Stefan Bender, Institute for Employment Research, (Germany)
In order to study non-response effects in a German large-scale mixed mode (CATI/CAPI) survey of welfare receiving households (PASS), complete social security data on households and individuals for both respondents and non-respondents were linked to the survey data. Characteristics of participating and non-responding households and individuals were examined. Finally, weighted and imputed survey data were compared with administrative data. For the first time in Germany, the effect of compensatory techniques for non-response is evaluated and compared with known administrative data for non-respondents. About 26% of the 49,215 persons contacted responded to the survey. However, the differences between respondents and non-respondents were small. Bias due to refusals (51% of all non-respondents) was even lower. Refusals tend to end the receipt of benefits slightly faster than non-respondents due to language problems and the receipt of social welfare benefits significantly slower. Estimated propensities (for contact and cooperation) using complete coverage data show typical effects on survey participation (e.g., with respect to age and household size). Facilitated by the unique availability of complete covariates, some surprising results are reported (e.g. no effect of employment status on participation, but significant effects of recent changes in household composition).
Tracesandra J. McDonald, Royal Canadian Mounted Police
The Canadian Centre for Justice Statistics (CCJS), in partnership with Canadian police agencies, collects and disseminates police-reported crime statistics with the ultimate objective of providing academics, governments, the media and citizens in general with a local and national picture of the nature and extent of crime in Canada. The data is collected via the Uniform Crime Reporting (UCR) Survey and Homicide Survey. The data collected from these surveys represent substantiated reports of criminal activity that include descriptive information surrounding the incident, the victim (in cases of crimes against persons) and the offender.
At the national, provincial and municipal levels, crime statistics are used to support a variety of purposes including policy direction, program evaluation, legislative development, allocation of criminal justice resources and performance measurement. This also holds true from a law enforcement perspective with Intelligence-Led policing, an approach whereby key intelligence data and statistics are analyzed and used to inform management of specific problem trends within their jurisdiction, as well as the most efficient and effective operational response.
That being said, operational conditions in a policing environment are not always conducive to the legislated requirement to provide Statistics Canada with timely and accurate crime data. This paper explores the various issues and challenges to the collection of crime data from the perspective of the police at the respondent level. It will also explore ongoing strategies to enhance the quality and relevance of this very important source of information for both law enforcement and the general public.
Anthony Matarazzo, Statistics Canada
The Integrated Criminal Court Survey (ICCS) is a national criminal court database containing statistical information on appearances, charges, and cases for youths and adults for all Criminal Code and other federal statute charges heard in criminal courts in Canada. In any given year, approximately 13 million court appearance records are received by the Canadian Centre for Justice Statistics (CCJS) from the provinces and territories. Basic units of count derived by the ICCS from these microdata include court appearances, criminal charges (approximately 1.5 million) and, ultimately, criminal cases (approximately 500,000).
This presentation describes the basic structure of the ICCS and highlights the process by which charge and cases estimates are derived from the microdata. Particular attention is paid to recent changes to the ICCS and its extraction process, which has moved from an “end-date” case to a “snapshot” person-based definition. The impact of these changes on derived estimates as well as the various mechanisms in place to address them are also discussed.
Peter J. Carrington, University of Waterloo (Canada)
This paper explores ways in which data on the administration of justice held by the Canadian Centre for Justice Statistics (CCJS) have been and could be used for longitudinal and network research on delinquency and crime. It also explores some of the problems associated with such research. The paper focuses mainly on the Uniform Crime Reporting Incident-Based Survey (UCR), but also considers the Youth Court Survey (YCS) and Adult Criminal Court Survey (ACCS).
Although none of these surveys has a unique identifier for individuals, longitudinal research on the recorded “criminal careers” of accused offenders has been done by linking the records over multiple years for each accused person, matching on soundex code, sex and date of birth. This has been done using the UCR, YCS, and ACCS. Some results are presented. This record matching method could also be used to study “victimization careers” of serial victims. Problems with this kind of record matching are discussed.
Organizing the data for research on co-offending, or commission of crimes by multiple accused persons, is straightforward with the UCR because there is a unique incident identifier that can be used to identify co-accused in an incident. Research on delinquent and criminal networks would require linking together incidents involving overlapping sets of co-accused in different incidents. Some results of this research and examples of potential network analyses are presented. The feasibility of this kind of record matching is discussed.
Stephanie Eckman
Frauke Kreuter
Field listing of housing units is an expensive and time-consuming stage of the survey process. In recent years, survey researchers have begun using external data sources, such as administrative records or commercial address databases, as a basis for housing unit listing. Listers update the existing list in the field, adding and deleting units as necessary. This method of listing, which we call dependent listing, is believed to be more accurate and less expensive than listing from scratch, and is used by several survey organizations including the U.S. Census Bureau. This paper uses an experimental repeated listing design to demonstrate the presence of confirmation bias in dependent listing. We find that when provided with a list to update in the field, listers tend not to add missing units or delete inappropriate units. The listers are biased towards confirming the initial list as correct. We call this phenomenon confirmation bias. Frames developed with dependent listing are for this reason much more reliant on the underlying quality of the initial listing than has been thought. Furthermore, if the kinds of units undercovered and overcovered by the input list are different than those which are properly covered, confirmation bias can contribute to coverage bias in survey estimates. Survey researchers should have a good understanding of the source and quality of the records they use as a basis for dependent listing, as listers are not able to fix all of the errors in the field.
Elisabeth Neusy, Statistics Canada
Yves Bélanger, Statistics Canada
The purpose of the 2008 Access and Support to Education and Training Survey (ASETS) was to provide information on access to post-secondary education including the role of student loans and savings in the financing of post-secondary education, as well as information on participation in adult education and training. A telephone list frame was constructed from two sources: telephone numbers from the 2006 Census of Population, supplemented with residential numbers from administrative files to improve coverage. The Census portion of the frame was stratified according to the age composition of the households, allowing households containing youths to be oversampled. Interviews were conducted with the households reached by the telephone numbers selected in the sample. Following collection, the survey design was evaluated. In the presentation, we will discuss the frame, sample design, and post-collection evaluation results.
Paul Smith, Office for National Statistics, U.K.
Karl Ashworth, Office for National Statistics, U.K.
Charles Lound, Office for National Statistics, U.K.
The Wealth and Assets Survey is a recent addition to the UK’s social survey portfolio, and is a longitudinal survey designed to follow a panel of respondents. In order to facilitate oversampling of certain categories with particular wealth characteristics, sample areas were first selected from the UK’s household survey frame, the Postal Address File, and the selected areas were matched to administrative data to indicate households likely to have certain characteristics. These households were then sampled at different rates. The survey was weighted to population totals (derived from census and administrative sources in a standard way) to compensate for non-response. The paper will describe the survey design and the use of administrative data, and evaluate how well the sampling worked in practice using information from the first wave of the survey. It will also comment on how effective the weighting adjustments were for this type of data.
Jenna Fulton, University of Maryland, U.S.A.
As administrative records become more accessible, surveys are increasingly requesting the permission of respondents to access and link their administrative data. Access to these records can only occur with respondents’ consent and, in most cases, in the U.S., respondents’ agreement to provide linking information such as their Social Security Number. With evidence from several national surveys indicating that the public is becoming less willing to provide personally-identifying information in the survey context, researchers are developing techniques to link without identifiers. There is potential for bias in the resulting linked dataset if some respondents fail to consent, and if those who consent are not representative of the entire sample. To date, no research has investigated the impact of the specific information requested to facilitate record linkage on consent rates, and how they might differ from surveys that do not request an identifier. Using a meta-analysis, this research will investigate the effect of the specific information requested on the proportion of respondents who give consent. This will compare surveys that request personal identifiers, such as Social Security Number and Medicare Number, as well as those which are able to link without requesting any identifying information. Existing surveys that request respondent consent to link records will be included in the analyses, for example, the Health and Retirement Survey, National Health Interview Survey, and Panel Survey of Income Dynamics. This research will help inform future linkage efforts by informing researchers which requests elicit the highest rates of consent to linkage.
Denis Hamel, Institut national de santé publique du Québec, Canada
Robert Pampalon, Institut national de santé publique du Québec, Canada
Philippe Gamache, Institut national de santé publique du Québec, Canada
Most health administrative databases in Canada contain no socioeconomic information that allows social inequalities in health to be tracked. One solution often proposed in the literature is to resort to an ecological proxy, using census data for small areas. The creation of a Canadian deprivation index on the scale of the dissemination area, the smallest territorial unit used in the Census, follows from this perspective. After briefly introducing the concept of deprivation, we will discuss the methodological aspects of the construction of such an index. After an analysis of the primary components integrating six census indicators, we will see that this index has two dimensions: one material and one social. We will also treat the problems related to matching administrative databases and census data, notably as the result of lacking key variables in common. Finally, we will see how the extent of health inequalities in Canada can vary depending on whether the deprivation is measured on the territorial or individual scale. For this purpose, the results of a comparison illustrating the variations in premature mortality, according to ecological and individual versions of the Canadian deprivation index, will be presented. A website containing all of the products linked to the deprivation index will be shown briefly.
Richard Madden, University of Sydney, Australia
Leonie Tickle, Macquarie University, Australia,
Lisa Jackson Pulver, University of NSW, Australia
Ian Ring University of Wollongong, Australia
Lee Taylor, NSW Department of Health, Australia
In 2009, the Australian Bureau of Statistics (ABS) released new estimates of Indigenous life expectancy for Australia. The estimates, which were substantially higher than previously published estimates, were based on linkage between Indigenous deaths registered in the period from August 7, 2006 to June 30, 2007 and 2006 census records. State estimates were also produced for some states, showing substantial variations between states; the life expectancies are inversely related to the calculated completeness of Indigenous identification of death registrations.
Analysis based on more comprehensive linkage of death records in New South Wales (NSW) over 5 years suggests that the ABS methods have understated Indigenous deaths and so overstated life expectancy.
The paper will report the NSW results, based on several linkage algorithms, including a comparison of the algorithms. Resulting changes in life expectancy estimates will be reported. Suggested improvements to ABS methods will be discussed.
Paul A. Peters, Statistics Canada
Michael Tjepkema, Statistics Canada
Census mortality linkages are proven to be powerful tools for analysing the mortality differences for numerous population groups. In a recently approved record linkage, the 1991 Census of Population, Canadian Mortality Database, and Canadian Cancer Database will be linked in order to examine cancer incidence and causes of death in conjunction with socio-demographic and neighbourhood characteristics. The linkage of the 1991 Census cohort to these databases will allow for the analysis of mortality using the CMDB in conjunction with the extensive information from the 1991 Census of Population long forms (2B and 2D), the recording of individual mobility over time using postal codes of tax filers from the Tax Summary Files, and the inclusion of important analyses of cancer morbidity via the CCDB. This presentation reviews the previous census mortality linkage, describes the new linkage, outlines the linkage process, and presents some initial linkage results.
Ivan Fellegi, Statistics Canada
The paper explores and assesses the approaches used by statistical offices to ensure effective methodological input into their statistical practice. The tension between independence and relevance is a common theme: generally, methodologists have to work closely with the rest of the statistical organisation for their work to be relevant; but they also need to have a degree of independence to question the use of existing methods and to lead the introduction of new ones where needed. And, of course, there is a need for an effective research program which, on the one hand, has a degree of independence needed by any research program, but which, on the other hand, is sufficiently connected so that its work is both motivated by and feeds back into the daily work of the statistical office.
The paper explores alternative modalities of organisation; leadership; planning and funding; the role of project teams; career development; external advisory committees; interaction with the academic community; and research.
Antoine Chevrette, Statistics Canada
Michael Wenzowski, Statistics Canada
It is commonly the case that data originating from disparate sources, and even from the same source over time, exists in incompatible formats. The consequence is that the analysis of such data must often be preceded by some form of record linkage operation in order to assemble the data into a well-structured form. When reliable and invariant identifiers are available, such linkages become a relatively easy task to perform. In other cases, the identifier values may vary, requiring the reformatting of the data and possibly the execution of a fuzzy match. In both of these scenarios, “off the shelf” software is readily available that is well-suited to the task at hand. However, in cases in which a greater degree of variance must be accommodated, especially in cases in which multiple fields must be examined in order to identify a linkage, the task falls to software capable of performing a sophisticated and complex probabilistic record linkage. Unfortunately, software capable of this degree of sophistication is very highly specialized and only sparsely available.
We present the results of a recent Statistics Canada initiative to re-engineer our generalized record linkage system in order to enhance its applicability across a wide range of processing problem and subject matter domains. The software faithfully implements the probabilistic record linkage methodology first described by Fellegi and Sunter, and includes many extensions and enhancements to increase the utility of the application. We will demonstrate how we have improved the software by offering more intuitive controls over managing the complexity of internal processing; by extending and enhancing the software’s capabilities; and by simplifying the installation, setup and processing models.
Nicoletta Cibella, ISTAT, Italy
Marco Fortini, ISTAT, Italy
Nowadays in official statistics, bringing together for statistical purpose of large amount of data from different sources is largely widespread in a context of increasing demand of statistical information on one side and stricter budgetary constraints on the other side.
Record linkage techniques are a multidisciplinary set of methods and practices aiming at identifying the same real world entity at the individual micro level, even when represented differently in data sources. The complexity of the whole linkage process relies on several aspects: for example, the lack of unique identifiers requires sophisticated statistical procedures the huge amount of data to process involves complex IT solutions; and constraints related to a specific application may require the solution of difficult linear programming problems.
Several tools have been proposed to deal with record linkage in both the academic and private sectors, In this paper, we propose the RELAIS (Record Linkage at Istat) system. The basic idea of RELAIS is to handle the record linkage complexity by decomposing the whole problem in its constituting phases and dynamically adopting the most appropriate technique for each step, in order to define the most suitable strategy based on application and data requirements. RELAIS is configured as an open source project, a winning choice for sharing techniques and software. The methodological core of RELAIS is based on the well-known Fellegi-Sunter theory, allowing its usage by both researchers (who can easily enrich it) and non-experts (who have it embedded in the software).
William E.Yancey, US Census Bureau, Washington, DC (USA)
Following Fellegi and Sunter, probabilistic record linkage uses the conditional probabilities for agreement patterns. Since these probabilities are conditioned on the true match status of the agreement pattern types, they are presumably unknown, so that implementation of agreement weight calculations requires estimating these probability parameters. The EM algorithm enables us to calculate maximum likelihood estimates for these parameters conditioned on the latent classes of true matches and true non-matches without the use of training data. If a sufficiently large proportion of matching record pairs is present in the total sample of record pairs, then under the conditional independence assumption, the EM algorithm is straightforward to implement, converges rapidly, and provides useful parameter estimates that are specific to the particular set of record pairs to be linked. In the case of census data person records, improved parameter estimates can be obtained by using a three latent class model. It is possible to include frequency data to adjust the parameter values. The conditional independence assumption can also be replaced by including interactions between the matching fields within each of the latent classes.
Manuela Maia, Catholic University of Portugal
Paula Vicente, Lisbon University Institute, (Portugal)
Undercoverage is one of the most common problems with sampling frames and a likely cause of coverage error. To reduce the impact of undercoverage on survey estimates, several frames can be combined in order to achieve a complete coverage of the target population. Multiple frame estimators have been developed to be used in the context of multiple frame surveys. Sampling frames may overlap in some situations, which is the case when some units from the different sampling frames are related to the same units of the target population. Indirect sampling (Lavallée, 1995) is an alternative approach to classical sampling theory in dealing with the overlapping issue of sampling frames on survey estimates.
In this paper, a new class of estimators is presented that is the result of merging dual frame estimators with indirect sampling estimators in order to bring together in a single estimator the effect of several frames on survey estimates. These estimators are compared with the optimal estimator from Deville and Lavallée (2006), and we try to obtain the general formula for the variance of these estimators. Additionally, a practical case of estimation in the context of the Eurobarometer is presented.
Kirsten West, U.S. Census Bureau, USA
J. Gregory Robinson, U.S. Census Bureau, USA
Jason Devine, U.S. Census Bureau, USA.
Renuka Bhaskar, U.S. Census Bureau, USA
In the United States, demographic analysis methods have historically been used to develop estimates of the population for comparison with decennial census counts. The estimates are developed from various types of demographic data independent of the census, such as administrative statistics (births, deaths, and Medicare data) and estimates of immigration and emigration. The paper focuses on the methodological challenges when following birth cohorts since 1935 to create an estimate of the size of the U.S. population on April 1, 2010. Specifically, the paper addresses the uncertainties in the administrative birth and death records, the reliance on survey data to estimate immigration, and the use of the Medicare enrollment database to estimate the elderly population. Also, we discuss the key issues associated with assigning race and ethnicity to the multiple data sources so that the estimates are consistent with the race categories developed for the decennial census.
Kate Wilder, Statistics Canada
Steven Thomas, Statistics Canada
The Canadian Community Health Survey (CCHS) is a continuous-collection cross-sectional survey with an annual sample of 65,000 respondents. It uses a complex multi-stage dual-frame design. In most strata, both an area frame and a telephone list frame are used to select the sample, with the corresponding interviews being personal (CAPI) or telephone (CATI) interviews. This use of multiple sources for the sampling frames means design weights must be calculated separately for each frame and then combined through a process called integration. The integration of the survey weights poses several challenges. First, the telephone frame covers only a portion of the target population, while the area frame covers most (98%) of it. Only the units that are common to both frames should have their weights adjusted during integration, but for area frame units that do not provide a phone number, it is not known whether they are also covered by the telephone frame. Another challenge stems from the use of multiple collection modes. The challenge is to produce consistent and interpretable estimates in the presence of a mode effect bias. Previous studies have shown that the data collected through personal and telephone interviews may differ for certain variables. Overall, the integration process should ensure that the weights are adjusted to properly reflect the target population and take into account any mode effect bias. This paper will focus on the CCHS’s response to these and other challenges.
Stefan Bender, Institute for Employment Research, Germany
Jörg Heining, Institute for Employment Research, Germany
Remote data access, defined as the possibility for a researcher to access and evaluate even weakly anonymised data via a secure Internet connection from his home desktop computer at any time, has not been implemented by a German RDC so far. Privacy regulations and especially the problem of admission control are reasons why German RDCs are not able to offer their data via remote data access to the research community. Therefore, weakly anonymised data may still only be accessed through on-site use with the consequence of time-consuming and costly guest stays at a RDC.
In order to facilitate data accessibility, the Research Data Centre of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB) in Nuremberg, Germany, in cooperation with the Research Data Centres of the statistical offices of the Länder, developed the so-called Research Data Centre-in-Research Data Centre (RDC-in-RDC) approach. The basic idea of the RDC-in-RDC approach is to offer researchers the ability to access Research Data Centre via a secure Internet connection from locations outside of Nuremberg. Vice versa, the microdata of the statistical offices of the Länder may be accessed through a guest stay at the Research Data Centre. Moreover, a branch of the Research Data Centre at the Institute for Social Research (ISR) of the University of Michigan in Ann Arbor is planned to enable more researchers from North America to access Research Data Centre data.
Karen Daley, National Defence, Ottawa (Canada)
The Canadian Forces (CF) has a unique workforce, with its own governance, policy and occupational requirements. As such, the personnel management system must place the right person, with the right qualifications, in the right place at the right time. In order to do this, decision-makers, planners and policy developers must have accurate, timely and appropriate personnel data at their disposal. The CF Portrait was designed to provide these data in a single-source, high-level, sociodemographic summary. The CF Portrait combines administrative and survey data on Regular Force members on a broad spectrum of personnel dimensions to inform policy and program development. Compiling and integrating these data presented several methodological challenges that required a variety of solutions. This presentation will discuss the CF Portrait as a case study of using both survey and administrative data to create a practical resource in an applied setting.
Alan M. Zaslavsky, Harvard Medical School (USA)
Yulei He, Harvard Medical School (USA)
Cancer registry records, patient surveys, and administrative systems record adjuvant therapies (chemotherapy and radiation) for cancer patients; however, subject to underreporting, this could bias analyses. We propose to impute true treatment status using sample validation data from medical records and analyze the imputed data. We extend earlier studies with a single outcome (provision of chemotherapy) and base data system (the registry), to multiple measures (provision of chemotherapy and radiation therapy) and multiple data systems (the registry, a patient survey, and Medicare claims). Bayesian hierarchical models for provision and reporting of multiple cancer therapies take into account their associations and multilevel structure, using related multivariate probit models for reporting of each therapy. The methodology is applied to data for patients with colorectal cancer in California.
Roderick Little, University of Michigan School of Public Health (USA)
We propose proxy pattern-mixture analysis (PPMA), a simple method for assessing the impact of non-response for the mean of a survey variable Y subject to non-response, when there is a set of covariates observed for non-respondents and respondents. The covariates are reduced to a proxy variable X that has the highest correlation with Y, estimated from a regression analysis of respondent data. The impact of non-response depends primarily on three factors: the non-response rate, the strength of the proxy variable in predicting Y, and the difference in proxy mean for respondents and non-respondents. The PPMA method combines all three elements in an intuitively reasonable way. Adjusted estimators of the mean of Y are based on a pattern-mixture model with different mean and covariance matrix of Y and X for respondents and non-respondents, assuming missingness to be an arbitrary function of a known linear combination X + lambda(Y) of X and Y. The method does not assume the missing-data mechanism is missing at random (lambda=0), and provides a sensitivity analysis for different values of lambda. Maximum likelihood, Bayesian and multiple imputation versions of PPMA are described. Properties are examined through simulation and with data from the third National Health and Nutrition Examination Survey. This is joint work with Rebecca Andridge of Ohio State University.
David Haziza, Université de Montréal (Canada)
Pierre Duchesne, Université de Montréal (Canada)
Deterministic regression imputation within classes that include ratio and mean imputation within classes as special cases is widely used in surveys. It consists of replacing a missing value by its predicted value obtained under an assumed linear regression model. However, in the presence of outliers, deterministic regression imputation will potentially lead to very unstable estimators. To overcome this problem, we propose to derive a set of imputed values such that the imputed estimator (defined as the weighted sum of the observed and imputed values) is calibrated on an estimator that is known to have good properties in the presence of outliers. More precisely, the idea is to start with initial imputed values and find a final set of imputed values as close as possible to the initial ones so that the imputed estimator is calibrated on an appropriate estimator. This is closely related to the reverse calibration proposed by Chambers and Ren (2003). Results from an empirical study on the performance of the resulting estimator will be shown.
Martin Axelson, Statistics Sweden
Dan Hedlin, Statistics Sweden
Anders Holmberg, Statistics Sweden
Ingegerd Jansson, Statistics Sweden
All European Union members will conduct a census with a reference day in 2011. Statistics Sweden faces the challenge of conducting Sweden’s first fully register-based census. The existing population register, the real property register and the new register of dwellings, which is not yet complete, will be matched to allow us to estimate, for example, distributions of variables such as the ratio of living space to the number of occupants. Issues include statistical matching, disclosure control, effective editing methods for categorical data and ‘unit editing’ (i.e., whether the right units have been identified) and the evaluation of model assumptions and other quality aspects of the register-based methodology.
Christine Bycroft, Statistics New Zealand
The traditional model of census and household surveys supported by administrative data is coming under increasing funding pressure. Statistics New Zealand is developing a long-term view of the overall design or architecture of data sources as a key step toward securing ongoing sustainable funding for social and population statistics. A major theme of the proposed architecture is making more use of data we already have, especially using administrative data to enhance or supplement the information from census and social surveys, through combining data directly with unit record linkage or through statistical models that combine aggregate data.
Several data integration projects such as Linked Employer Employee Data (LEED) have already proven to be of high value, and we are looking to extend these. In contrast, while we have carried out some exploratory work on small area estimation, none has so far proceeded to implementation. At least two examples appear likely to offer excellent results. We are also exploring the use of hierarchical Bayesian models for the estimation of sub-national populations between censuses.
It is the census, however, that is the major budget item. We are considering options for Census 2016 that include changes to the frequency or content, as well as whether some kind of register-based system could be feasible for New Zealand. While there is no Population Register in NZ, we are challenged as to why we cannot use the several administrative registers or lists that do exist (e.g. from tax, health and education systems) in place of a census.
The paper will give an overview of the interactions between the census, surveys and administrative data that exist now in New Zealand, outline where we see the best opportunities for extending these interactions, and report on progress in the statistical modelling and other initiatives that are underway.
Danilo Dolenc, Statistical Office of the Republic of Slovenia
The first register-based census in Slovenia will depend on compiling about 30 administrative and statistical sources. The main disadvantage of this method is the fact that data have to rely exclusively on the content, methodology and quality of data in registers and other sources. The presentation will focus on three basic administrative registers that form the framework of the input data (Central Population Register, Household Register and Real Estate Register). The last two will be used in the statistical process for the first time. The statistical definition of population itself differs significantly from the administrative one. A new concept of population was developed in 2008 (permanent vs. temporary residence is the main issue). In addition, in Slovenia a household register is available, which is as far as we know rather unique even for register countries. However, several methodological solutions are being prepared to overcome the administrative concepts (for example, legislation does not determine institutional households, data on households are not available for temporarily present population). A new definition of households has had to be formulated that differs from the one used in previous censuses, and the consequences are structural changes in the number and composition of households and families. All household and family data are derived from only one variable of the Household Register. From the quality point of view, the weakest source is the Real Estate Register, which was established as late as 2008. The new approach to census-taking implies the intensified use of statistical methods in the process and overall quality assurance.
Andrew Heisz, Statistics Canada
Manon, Langevin, Statistics Canada
Jeff Randle, Statistics Canada
Cathy Underhill, Statistics Canada
Matching data is a common practice that allows reducing response burden, as well a improving the quality of the information collected from the respondents when the linkage method does not introduce a bias. However, historical linkage, which consists in linking external records from past years with the initial wave of a survey, is a method relatively unknown and that was, up to now, never used at Statistics Canada. The present paper describes the method used for linking the records from the pilot survey of the Canadian Household panel Survey (CHPS) with historical tax data on labour (T4 forms). We will also discuss the characteristics of the records where linkage was a success or not, and the impact of these characteristics on the linkage rates through time. To demonstrate the new possibilities of analysis brought by historical data matching, the study also compares the profile of earnings (according the age and sex) of workers having different paths in terms of education and family history, using information on labour already available in the pilot survey of the CHPS and linked labour information.
Melissa Riley Pfeiffer, NYC Dept of Health and Mental Hygiene
M. Mavinkurve, M.E. Slopen, S. Sedlar, A.E. Curry, J. Ho and K.H. McVeigh
Government agencies are increasingly seeking to leverage existing administrative data for use in research, programming and policy. Complex data linkages have been conducted previously, but linkage methodologies have rarely been published. We present methods employed to link five administrative data sets from the New York City (NYC) Department of Health and Mental Hygiene and Department of Education (DOE).
We utilized probabilistic matching technology to link children and siblings across five data sources: Early Intervention Program (n=156,834), DOE (n=617,934), Lead Poisoning Prevention Program Registry (n=1,469,265), Birth Certificate Registry (n=1,380,608) and Death Certificate Registry (n=8,331). Statistical sampling techniques with human reviews were used to evaluate the match. The Longitudinal Study of Early Development (LSED) relational data warehouse was created from the child and sibling linkages. We describe the linkage process, including data preparation, threshold setting and quality assurance, and results including match rates, comparisons to expected yields and cohort characteristics.
The LSED data warehouse contains data for 1,942,942 children born 1994–2004, with an estimated false match rate of 0.6%. Over half (57%) were found in more than one source, and 20% have at least one sibling.
The matching process created the LSED data warehouse for a diverse population of NYC children, which will enable research on factors of early childhood development and health and educational outcomes. Techniques developed for this project, such as the threshold setting process and false match rate estimation, may be replicated with slight modifications to link existing data sources pertaining to various populations.
Jannes Hartkamp, DESAN Research Solutions, The Netherlands
At the Lisbon Summit of the European Union in 2000, the ministers of education of the member states agreed to aim for a 50% reduction in the number of early school leavers in 10 years. By a standard definition, early school leavers are youngsters who have not (yet) completed upper secondary education and are currently not enrolled in school. The ‘Lisbon target’ has not only been adopted by national governments throughout the European Union, but also by several regions and cities. The City of Rotterdam, the second largest city in the Netherlands, followed the ‘Lisbon line’ in its local policy goals and launched an annual early school leaver monitor in 2000.
The Rotterdam Early School Leaver (ESL) monitor combines administrative data from the municipal pupil administration system and data from an additional survey among youngsters for whom the information available from the register is insufficient. The deficiencies in the administrative data mostly apply to youngsters who moved to Rotterdam at a higher age, from abroad or from elsewhere in the Netherlands. Interplay between administrative data and survey data in the ESL-monitor is threefold. Firstly, an analysis of the data (or rather lack of data) in the municipal pupil administration system reveals which youngsters should be contacted for the additional survey. Secondly, the administrative data and the (weighted) survey data are combined into one integrated research dataset to arrive at all relevant statistics and background analyses. Thirdly, the information obtained through the survey is used to complement and update the administrative data. Because of this feedback, the scale of the additional survey could decrease year by year.
Changbao Wu, University of Waterloo (Canada)
Jae-kwang Kim, Iowa State University (USA)
Recent work on using a pseudo empirical likelihood (PEL) approach to finite population inferences with complex survey data focused primarily on a single survey sample, non-stratified or stratified, with considerable effort devoted to computational procedures. This paper presents PEL methods for combining information from two independent surveys. We consider two scenarios commonly encountered in practice: (1) the study variable y and the vector of auxiliary variables x are observed from the sampled units for both surveys; and (2) the vector of auxiliary variables x is observed from a large survey while both the study variable y and auxiliary variables x are observed from a smaller survey. Our main focus is on the optimal point estimation of finite population parameters and the construction of PEL ratio confidence intervals, using either a chi square approximation or a bootstrap calibration method. The proposed approach has several advantages over the traditional approach and also provides a general tool for survey weight construction using auxiliary sources of information. Simulation results on the performance of the PEL method relative to the traditional method will be reported.
This is joint work with Jae-kwang Kim of Iowa State University and J.N.K. Rao of Carleton University.
Yves G., Berger, University of Southampton (UK)
Estimation of correlations would be relatively straightforward if cross-sectional estimates were based upon the same sample. Unfortunately, samples at different waves are usually not completely overlapping sets of units because of rotations used in repeated surveys. This implies that cross-sectional estimates are not independent. Correlation plays an important role in estimating the variance of a change between cross-sectional estimates. The unbiasedness of an estimator of a correlation is crucial, because a small bias can significantly overestimate or underestimate the variance of change. Calibration on control totals is commonly used for survey weighting. It is usually assumed that these totals are values known without sampling errors. However, they can be estimated from a previous wave. For example, the regression composite estimator can be viewed as an estimator calibrated on control totals estimated from previous wave totals. Variance estimation of the composite estimator depends on correlation between cross-sectional estimates. Several methods can be used to estimate correlations, some of which use re-sampling and/or Taylor linearization. We propose to compare existing methods used for variance estimation via simulation. Our aim is to focus on social survey data. The methods will be evaluated via simulation based upon the UK labour force survey.
Takis Merkouris, Athens University of Economics and Business (Greece)
Various forms of integrating social survey data are on the rise in contemporary survey practice. They all involve some combination of information from different survey sources that are used as auxiliaries to each other. The auxiliary sources may be other surveys, subsamples of a single survey, or past data of a repeated survey. The auxiliary information used in the data integration may be auxiliary variables of the surveys involved or common target variables, and may be in the form of sample microdata or sample estimates. Objectives of survey integration may include reduced cost, reduced response burden, improved data quality, data consistency, improved estimation and analysis. In this talk, a compilation of examples of integration of social surveys will be given, categorized by survey objective, type of auxiliary source and sample dependence. The focus will be on the improvement of estimates of interest, and more specifically on the incorporation of auxiliary information into the weighting structure of the integrated surveys. Micro-integration through the adjustment of survey weights can be accomplished by suitable calibration schemes, which are equivalent to regression procedures based on the principle of best linear unbiased estimation. Issues related to the planning of survey integration, the harmonization of the integrated auxiliary data and the practicality of weighting procedures will also be discussed.
Tom Haymes, Statistics Canada
Jean-Luc Bernier, Statistics Canada
The 2011 Census will once again utilize a mailout methodology and the Address Register will be the source of the list of addresses. The major change over the 2006 Census is that a 100% field verification of the address lists will not be done. Instead, field verification will be done on only a portion of the listing units. A new targeting methodology was developed to select the listing units that will undergo field verification with the goal of minimizing dwelling undercoverage. The main component of the score function developed is the growth in addresses resulting from updating the Address Register with administrative sources. Another change for 2011 is that the listing activity will be conducted over a two-year period prior to the Census instead of the one-time approach utilized for 2006.
The Household Survey Program also conducts a field activity to create and/or verify address lists for in-sample clusters. In order to avoid both programs conducting listing activities in the same area, a coordination of listing activities has been done and the verified address lists are shared between the programs. This has involved promoting in time Census Listing Units so the results can be used by household surveys, and avoiding clusters that have been done by household surveys and updating the Address Register with the cluster address lists. In addition, refinements to the listing units were required to handle the spatial intersection between listing units and clusters.
Éric Langlet, Statistics Canada
Since 1986, a post-censal surveys program has been in place in Canada in order to conduct follow-up interviews on a particular theme among a sample of respondents who completed the long census questionnaire. The long questionnaire is administered to one out of five households in Canada except in remote areas and Indian reserves, where all households must complete it. After this first stage, a stratified sample of respondents to the long questionnaire is then selected according to the characteristics observed at the first stage. Using the census as a survey frame enables the study of rare populations belonging to small areas and provides a rich set of characteristics that can be used during sample selection, non-response adjustments and poststratification. However, this two-stage survey design poses specific challenges for variance estimation when the sampling fraction of the first stage is not negligible. This article gives a brief overview of the post-censal surveys program since its creation and presents in more detail the methodological challenges related to Aboriginal post-censal surveys.
Pamela Tallon, Statistics Canada
Linda Ramsey, Statistics Canada
The 2006 Canadian Census saw the successful implementation of the biggest methodological changes since 1971. The 2011 Census will largely repeat the 2006 approach; however, methodologies must continue to be enhanced to address the challenges that arise and to work towards new efficiencies and streamlining of processes. Firstly, to further reduce the reliance on a large and decentralized workforce, the plan is to increase the target for mail-out from the 70% of dwellings achieved in 2006 to 80% in 2011. Secondly, based on the strength of the Internet responses in 2006, the plan is to reduce printing and handling of paper questionnaires by targeting a 40% Internet take-up rate in 2011 compared to the world-leading 18% achieved in 2006. This will be facilitated by mailing a letter instead of a questionnaire to areas indicating a high probability of internet connectivity, as well as implementing a “Wave Methodology” whereby questionnaires and letters are sent out at predefined times coordinated tightly with a corresponding communication strategy. A number of new tools and strategies are being implemented to help achieve these goals, and all of this has meant an integrated approach to collecting Census data in Canada. This paper presents an overview of the 2011 Canadian Census. It focuses on the various response channels for collecting data, how they are integrated into a single strategy, and the innovative tools and methodologies that are being introduced to conduct the first Canadian Census where the majority of Canadians may never receive a paper questionnaire.
Jennifer Taylor, Statistics Canada
A new collection methodology is planned for Canada's 2011 Census of Population in order to increase the response by Internet without increasing the amount of non-response. The wave collection methodology is an integrated collection strategy involving a series of response stimuli. Under this plan, the majority of dwellings will receive a letter with a unique access code for on-line completion of the questionnaire rather than a paper copy. Non-responding households will receive reminders and, if needed, a paper questionnaire. This plan requires efficient delivery of communication material and fast notification that responses are received. This is feasible when, based on Statistics Canada's Address Register, material is mailed rather than hand-delivered. The Address Register is a list of addresses that is updated periodically by administrative sources. During the 2009 Census Test, we observed the course of collection operations under the new collection methodology. In addition to the two main test sites, a supplementary sample of 25,000 dwellings was selected across Canada. Its purpose was the evaluation of the wave collection methodology for dwellings that are initially sent a letter promoting on-line completion of the questionnaire. The study compared two applications of the wave collection methodology and two versions of the first wave letter. This paper presents an overview of the collection methodology planned for the 2011 Census, a description of the five panels of the supplementary sample and the results of the test. The impact of these findings on the 2011 Census plan is discussed.
Tobias Bachteler, University of Duisburg-Essen, Germany
Rainer Schnell, University of Duisburg-Essen, Germany
Jörg Reiher, University of Duisburg-Essen, Germany
Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string matching in private record linkage.
This study reports an empirical evaluation of three promising approaches to compute similarities between identifiers in a privacy preserving manner (Pang & Hansen 2006, Scannapieco et al. 2007, Schnell et al. 2009). The performances of these algorithms were compared with each other and with those of a hashed phonetic encoding and an edit-distance. Using experimentally generated human data (100 different audio-presented names keyed by 280 students, producing a dataset of 28,000 names), the performance of the proposed methods were compared by examining precision-recall plots.
The overall performance of the Bloom-Filters method (Schnell et al. 2009) is acceptable for private record linkage. The performance of the method of Pang and Hansen depends strongly on chosen comparison strings.
Antoine Chevrette, Statistics Canada
At Statistics Canada, matching data without unique identifiers is a common practice. The probabilistic method for record linkage developed by Ivan Fellegi and Allan Sunter is the principal method recommended by Statistics Canada for doing this type of matching.
Over the last few decades, work has been undertaken to generalize the Fellegi-Sunter algorithm in order to offer our community the ability to use this methodology within a computer application. The most recent version of this application is called GLink and is part of Statistics Canada’s package of generalized systems.
By definition, a generalized system must be user-friendly, robust, flexible and able to respond adequately to user demands. It will be interesting to discover from reading this article how it was possible to achieve these last criteria using the latest user interface and development technologies.
To do this, a global review of the Fellegi-Sunter algorithm will be done to put it in perspective with its computer avatar. Certain critical methodology components present a challenge, as much in terms of the technology as of the user interface, and these will be studied in detail. The solution used in GLink for each challenge studied will also be presented in a detailed manner.
Finally, we will review practical techniques regarding the use of GLink in order to facilitate its use. We will then begin a discussion on future computational developments in GLink so that it achieves even greater heights.
Rainer Schnell, University of Duisburg-Essen, Germany
Tobias Bachteler, University of Duisburg-Essen, Germany
Jörg Reiher, University of Duisburg-Essen, Germany
In many record linkage applications, identifiers have to be encrypted to preserve privacy. Therefore, a method for approximate string comparison in privacy-preserving record linkage is needed. Although some intriguing approaches are proposed in the literature, they suffer from high computing demands or high error rates. We describe a new method of approximate string comparison in private record linkage. The main idea is to store q-grams sets derived from identifier values in Bloom filters and compare them bitwise across databases. This exploits the cryptographic features of Bloom filters while nevertheless allowing one to calculate string similarities.
For experimental evaluation, we used a list of 1,000 German surnames and introduced typographical errors artificially. In a second simulation study we matched 500,000 artificially generated data records with a database containing 125,000 modified records. We conducted an additional test matching two German private administration databases containing identifiers of about 15,000 people each.
We show that the proposed method compares quite well to evaluating string comparison functions with plain text values of identifiers and outperforms an alternative method using hashed phonetic encodings of identifier values.
P. Lahiri, University of Maryland, USA
S. Pramanik, NORC at the University of Chicago, USA
Many survey organizations use synthetic methods to produce estimates for small areas because of their simplicity and wide applicability. A synthetic method generally assumes an implicit or explicit model that links the main variable(s) from the primary survey data to a set of auxiliary variables that are available from the sampling frame or from various administrative and census databases. The assumed model is used to produce estimates for different small areas. One advantage of the synthetic method is that it can produce estimates even if the survey does not provide data for the small area of interest. However, the estimates can be subject to high bias due to model misspecification and/or the unavailability or poor quality of auxiliary information. Design-based mean square error (MSE) is a robust measure of uncertainty for the synthetic estimator since it does not use any model to incorporate both the variance and bias of the estimator. However, the estimation of design-based MSE for a single small area seems difficult primarily because of either non-existent or highly imprecise bias estimate. To get around the problem, we consider the estimation of average design-based MSE, where the average is taken over all similar small areas. We propose a new estimator of average design-based MSE and establish its superiority over an existing method through simulation.
J. N. K. Rao, Carleton University, Ottawa (Canada)
Eradication of extreme poverty and hunger is the first of the Millennium Development Goals established by the United Nations. Availability of reliable statistics on people’s living conditions is a basic requirement for the achievement of this goal. However, information collected from national surveys is often limited and allows reliable estimation only for large regions or large population subgroups. For areas with small (or even zero) sample sizes it is necessary to employ indirect estimation methods that can lead to reliable estimates by borrowing information across related areas through linking models In this talk, our focus is on small area estimation of poverty measures that are complex non-linear functions of the values of a welfare variable, in particular on a class of measures used by the World Bank that covers poverty incidence, poverty gap and poverty severity as special cases. We use the empirical Bayes (EB) method to obtain efficient EB estimators of small area poverty measures, based on a nested error linear regression linking model relating a transformed welfare variable to auxiliary variables available for all population units. These estimators are calculated through simulation methods, as they do not admit closed form expressions. The mean squared prediction error (MSPE) of the EB estimators is estimated through a parametric bootstrap method. We study the performance of the proposed estimators relative to direct area-specific estimators and widely used synthetic estimators, based on a “simulated census” approach, using model-based and design-based simulation studies. We also apply the proposed method to estimate poverty incidences and poverty gaps in Spanish provinces by gender. The proposed methodology is applicable to general non-linear parameters.
This talk is based on joint work with Isabel Molina, Universidad Carlos III de Madrid, Getafe (Madrid), Spain.
William R. Bell, U.S. Census Bureau, Washington D.C. (USA)
Elizabeth T. Huang, U.S. Census Bureau, Washington D.C. (USA)
We start with the basic Fay-Herriot area level model for small area estimation and then review its generalizations to bivariate, multivariate, and measurement error models. A primary motivation for using a more general model may be the potential for achieving greater improvements in small area estimators by making use of additional data. Our focus, however, is more on reviewing (1) the assumptions made by these models about the various data sources (survey, census and administrative records) being used, (2) how the models thus use the various types of data, and (3) what assumptions seem generally appropriate for which types of data sources. Conversely, we also consider how choosing a model whose assumptions do not match the data could lead to problems, or at least to suboptimal data usage. We provide illustrative examples using data from the U.S. Census Bureau’s Small Area Income and Poverty Estimates (SAIPE) program.
Stefan Bender, Institute for Employment Research, Germany
Jörg Heining, Institute for Employment Research, Germany
Tanja Hethey, Institute for Employment Research, Germany
Patrycja Scioch, Institute for Employment Research, Germany
One of the challenges in data production is to get richer data sets by linking different data sources. The Research Data Centre (FDZ) of the German Federal Employment Agency at the Institute for Employment Research tries to link administrative data to other data sources like survey data. Linkage has been done at the individual, establishment and regional level.
Jointly with the RWI Essen and the SOEP (Socio Economic Panel) group, the FDZ currently works on the linkage of labour market microdata to the SOEP on a disaggregated level, i.e. postal code areas. One aim of this project is the evaluation of non-market interactions and neighborhood effects.
One of the most visible data sets is the Linked-Employer-Employee data set called LIAB. In this data set, administrative individual employment history data are linked to an establishment survey. Because neither the survey nor the administrative data are able to fulfill all of the researchers´ needs, the FDZ constructed a double linked employer employee data set for further training where both dimensions (individuals and establishment) are measured with survey and administrative data at the same time (WeLL). By comparing the same variables stemming from different data generating processes, we also learn more about the quality of data.
Future developments will not end at linking survey with administrative data. In the near future, German administrative data will be linked across different data producers. For example, the KombiFiD project aims to combine firm data from different public data producers. Another project focuses on the linkage of patent data (innovators) with administrative employment data.
Karen Geurts, HIVA – Katholieke Universiteit Leuven, Belgium
Access to large administrative business registers has opened up enormous possibilities for micro economic research on labor market dynamics. The use of these data sets, however, also raises some problems. A major drawback arises from failures in the longitudinal linking of firm records. This results in the false identification of firm openings and closings and leads to an upward bias in dynamics measures.
Traditional methods to overcome longitudinal record linkage problems, such as probabilistic matching, require careful manual review. Moreover, they do not capture firm restructuring events. We propose an alternative method based on information on the continuity of the firm’s work force. The method allows more effective longitudinal linking and is easily reproducible. Our point of departure is a linked employer-employee dataset covering all private employment in Belgium. We present a longitudinal linkage algorithm that makes use of information on clustered employee flows between firms. This allows the re-establishment of broken links between records of the same firm and of parts of the same firm. We then propose a correction formula to adjust measures of job creation and the destruction of the firms involved.
The main result of the employee flow approach is a substantial quality improvement of the measures on labour market dynamics. The upward bias in statistics of firm demography and of job creation and destruction is significantly reduced and strong annual fluctuations are considerably flattened. An additional application of the method is the identification of firm restructurings such as mergers and acquisitions, as well as the estimation of the number of employees involved.
Daniela Hochfellner, Institute for Employment Research, (Germany)
Axel Voigt, Institute for Employment Research, (Germany)
Using administrative data for research is becoming more and more popular because of their richness of information on individuals. Due to this rising interest, we have started a project to merge German administrative data of two social security agencies, the Federal Employment Agency and the Pension Insurance, to improve the quality of register-based data. In Germany, these agencies collect data on employment histories on individuals who are relevant for their own field of activity. The linkage will lead to a unique employment biography dataset for researchers worldwide.
Although there is a uniform identifier for the linkage, serious specific problems occur. There are inconsistencies, which result either from different editing procedures at the agencies or possible mistakes made during data collection that lead to the existence of multiple states in the data. The cleansing procedures for the first type of inconsistencies can be easily executed with mathematical conversion algorithms, whereas the second type is much harder to handle. To cope with these multiple states, we analyse the sequences of all possible states to identify misfits in the combined data. Finally, these identified misfits are corrected with heuristics based on certain assumptions regarding the Code of Social Law.
Our presentation will deal with how we cope with theses various types of inconsistencies and provide information on the cleansing procedures we use to improve the content of administrative data.
Martina Huber, Institute for Employment Research, Germany
Alexandra Schmucker, Institute for Employment Research, Germany
Stefan Bender, Institute for Employment Research, Germany
Surveys often cope with special problems: gaps in retrospection appear or respondents could not provide details. Sometimes these problems can be solved by using additional information from process-generated data. These administrative data offer valid and exact information, but also include potentially less valid variables with a higher share of missing values. By linking the data, the data quality can be improved by creating a dataset that balances the disadvantages of the administrative and survey data using the advantages of these types of data.
The survey data contain information about 6400 employees from 150 establishments and are sampled from and linked to administrative data via social security records.
The following topics will be analyzed:
Asaph Young Chun, University of Chicago (USA)
Fritz Scheuren, University of Chicago (USA)
The U.S. National Assessment of Educational Progress (NAEP) conducted by the National Center for Education Statistics (NCES) is the only nationally representative and continuing assessment of what USA students know and can do in various subject areas, including mathematics and science. NAEP data at grade 12, in particular, is subject to potential non-response bias due to both a low participation rate and the correlation between NAEP survey variables of interest and non-participation likelihood.
The purpose of this paper is to evaluate the impact of nonparticipation bias on estimates of student performance in NAEP. To address this purpose we will utilize a non-response propensity model inspired by Groves and Couper (1998) and an analytical approach by Abraham, Maitland and Bianchi (2006).
We construct non-response propensity models by taking advantage of the interplay among the three datasets: NAEP restricted-use data at grade 12, NAEP survey contact history data and school administrative data from the High School Transcript Studies (HSTS). We merge them all at an individual student level. Because the transcripts for the HSTS are collected from all students in the same NAEP sample of schools, regardless of the individual student’s participation status in NAEP, the data merged between NAEP and HSTS provide an extended list of key correlates of non-response (including course-taking pattern and academic background). Thus, a more robust modeling of non-response propensity is feasible. We demonstrate how alternative non-response weighing adjustments derived from non-response propensity models affect NAEP estimates of student performance, and evaluate potential improvement over the current practice of NCES adjustment for non-response, relying, as now, just on sampling frame variables in NAEP.
Jae-kwang Kim, Iowa State University, U.S.A.
Wayne Fuller, Iowa State University, U.S.A.
Imputation is a frequently used technique for handling missing data in survey sampling. Fractional imputation is proposed as a way of achieving efficient estimates and efficient variance estimation. We consider multivariate data with arbitrary missing patterns and show that calibration techniques can be used to create fractional weights. The proposed imputation method provides efficient estimates for the parameters of the imputation model and also provides reasonable estimates for other parameters. The proposed method is applicable to two-phase sampling where the second phase can be treated as the respondents. Variance estimation using a replication method is discussed and results from simulation studies are presented.
M. Louise Lawson, Kennesaw State University, U.S.A.
Erin O’Connor, Daniel Street and Crystal Rouse
A recent article in the Chronicle of Higher Education indicating that non-response bias is unrelated to response rates has led some college administrators to insist that any survey response rate should be considered valid, particularly if the distribution of respondent demographics is similar to the college demographics. We recently conducted an email invitation online survey of our enrolled college students with a response rate of 2%. The distribution of respondents was very similar to our college’s distribution in terms of classification (freshman-graduate) and gender. The largest difference was that, in the survey, 20.3% of respondents reported being seniors, while records show that 27% of students were seniors. We calculated the Schouten/Cobben/Bethlehem R-indicator based on known response propensities. As expected for a low response rate, the R-indicator was close to 1 (.9901 for class level and .9984 for gender). Subsequent to the volunteer sample, we are conducting the same survey in classrooms when the instructor is absent, with a response rate of close to 100%. Interim analysis indicates that the opinions on key questions are significantly different in the classroom survey, confirming that this particular R-indicator is not useful in determining sample representativeness when the response rate is low. In addition, as the responses to the classroom survey would lead to a different policy decision than was made based on the email survey, our results indicate that extremely low rates and volunteer response can lead to severe non-response bias even when the sample appears to match demographically to the source population.
Annie Giguère, Institut de la statistique du Québec (Canada)
Annie Bélanger, Institut de la statistique du Québec, Canada
Carol Gilbert, Institut de la statistique du Québec, Canada
An important challenge has appeared over the course of the last few years, in Québec, Canada and elsewhere in the world. It concerns access—for research or governing purposes—to files derived from matching administrative files or from surveys already available from public organizations. Such files contain high analytical potential, but are underutilized due to the complexity involved in treating them, administrative measures limiting access and the delays that result.
In Québec, this confidential information can be accessed under the Loi sur l’accès à l’information and, in certain cases, under other laws as well. This is a relatively simple process when access is requested by a single, file-owning organization. In contrast, the process is complex for access to data that are crossed or have no unique identifier. In addition to technical pitfalls with file matching, retrieval of pertinent information and quality evaluation, there are also legal, ethical administrative and institutional problems. In short, the process is long and laborious, with no guarantee of the quality of the results obtained.
The EPSEBE (Environnement pour la promotion de la santé et du bien-être) was created in this context, to facilitate access to data held by different departments or public organizations in Québec, perform matching and retain the time needed for analysis, all while respecting the principles of protecting personal information and data security. The EPSEBE is not a data warehouse but rather a secure work environment allowing remote access. The types of data include data held by departments or public organizations, data from surveys done by the Institut de la statistique du Québec and even data held by individual researchers. The EPSEBE is aimed at researchers from all disciplines, from both universities and the public service, and from Québec as well as Canada or other countries when requested. The EPSEBE aims particularly at supporting social programs in Québec by reusing information and expertise while making the information publicly available.
The EPSEBE is a centre of expertise coupled with an infrastructure for treating information. Its aims are as follows:
David Price, Statistics Canada
Martin Lessard, Statistics Canada
Patrick Gallifa, Statistics Canada
Like many national statistical organizations (NSOs), Statistics Canada is facing increasing national and international demands from researchers for access to detailed microdata. Statistics Canada has recently put in place a new service that will improve access while at the same time protecting the confidentiality of the various data. The Real Time Remote Access system is essentially an on-line remote access facility that allows users to create tables, more or less in real time, on microdata or lightly masked microdata sets kept in a central and secure location. There are many challenges in the development of an on-line remote access tool, particularly regarding how to maintain confidentiality knowing the various output products released with their different disclosure rules.
Peter Meyer, National Center for Health Statistics, Hyattsville MD (USA), http://www.cdc.gov/nchs/r&d/rdc.htm
The Research Data Center (RDC) at the National Center for Health Statistics (NCHS) was developed to provide researchers access to data that are too sensitive to be released publicly because they carry a degree of disclosure risk that is considered too dangerous by the owner of the data system or data product. The source of this comes from various characteristics of the data that include the potential for reidentification of subjects, respondents, or institutions. This paper describes the major data systems at NCHS, identifies some of the confidentiality issues associated with each system, and then focuses on the methods developed to provide researchers access to these data while maintaining confidentiality. We also introduce the data product and explain the difference between linking and merging data. For instance, linked data products, e.g., National Health Interview Survey Mortality files, are a probabilistic match between survey and administrative data to provide more information at the unit level. Alternately, merged data are typically survey data that have been appended with community or some other contextual variables. Finally, we will discuss the interplay between mathematical, sociological, and technological dimensions in protecting different types of sensitive data.
Judy Lee, Alberta Office of Statistics and Information, (Canada)
Michael Haan, University of Alberta, (Canada)
Bradley Brooks, Statistics Canada
Alberta has long had ‘shadow populations’ in its midst, a non-permanent portion of the population that spends a significant amount of time in the province but is not counted in provincial or national headcounts.
In principle, a person is a member of a shadow population if their actual or ‘de facto’ location is within Alberta, but their regular or ‘de jure’ location is elsewhere. Although relatively easy to define, however, the shadow population is much more difficult to mesure because of the dynamic nature of its members and their purposes for migration (labour, education, etc.). As a result, shadow populations may not be fully accounted for in terms of planning and development, thereby creating a gap between existing infrastructure and the people that use and depend on it.
Given that workers in a global economy are increasingly mobile, we argue that some of the issues surrounding shadow populations in Alberta also apply to a growing part of Canada’s labour force. In this presentation, we discuss some of the theoretical and methodological complexities of Alberta’s shadow population. We then present a methodology jointly developed by the Alberta Office of Statistics and Information, the University of Alberta Population Research Laboratory, and Special Surveys Division of Statistics Canada for measuring shadow populations.
Alasdair Noble, Massey University, New Zealand
Stephen Haslett, Massey University New Zealand
Geoff Jones, Massey University New Zealand
Dimitris Ballas, University of Sheffield, New Zealand
The three techniques of small area estimation, spatial microsimulation and mass imputation are generally not seen as being similar. They are not all used within the same discipline – the technical literature on the first and last is mostly in Statistics, while the second has mostly been used by Human Geographers. However, there is the similarity that, in certain forms, all aim to provide predictions, which can be amalgamated to form estimates for subgroups. In an effort to better understand the links and differences between the three techniques, an extensive simulation study was carried out. The results of this study will be reported, followed by a discussion of the approaches to the simulations that helped gain insight into deeper similarities between what seemed, initially, very different methods.
Jan van den Brakel, Statistics Netherlands, Netherlands
Sabine Krieg, Statistics Netherlands, Netherlands
The Dutch Labor Force Survey (LFS) is based on a rotating panel design. Each month, a sample of addresses is drawn and data are collected by means of computer assisted personal interviewing of the residing households. The sampled households are re-interviewed by telephone four times at quarterly intervals. There are two major problems with the design of the LFS. First there are substantial systematic differences between the subsequent waves of the panel due to mode and panel effects (rotation group bias). A second problem is that the monthly sample size of the LFS is too small to rely on the generalized regression estimator to produce sufficiently reliable official statistics about monthly employment and unemployment. A multivariate structural time series model is developed to improve the accuracy of the monthly estimates. This model explicitly estimates the rotation group bias between the first wave and the four other waves. Furthermore, the time series model borrows strength from data observed in preceding periods via the assumed time series model for the population parameter and the autocorrelation between the subsequent waves of the panel. A further reduction of the standard error can be accomplished by using auxiliary information available from the register of the Office for Employment and Income. In this paper, the time series model is extended by incorporating an auxiliary series about the registered unemployed labor force. The effects on the estimates and standard errors for the monthly unemployed labor force are studied.
Marissa Isidro
Stephen Haslett, Massey University, New Zealand
Geoffrey Jones, Massey University, New Zealand
The World Bank has implemented poverty mapping projects in collaboration with national statistical agencies in various developing countries employing a small area estimation method known as ELL (Elbers et al., 2003). This method and its variants require a survey, census and/or administrative data and assume that the data sets are gathered at the same time period. In most developing countries, a census is only conducted once in every decade. This poses a problem in the generation of updated small area estimates during non-census or intercensal years. It is therefore important to develop an updating method to provide policymakers and/or stakeholders in developing countries with updated estimates of local level poverty statistics.
We propose an updating method called Extended Structure PREserving Estimation (ESPREE), an extension of the Structural PREserving Estimation (SPREE) method. In this paper, we compare the ESPREE with the ELL-based updating method used by the World Bank for generating updated small area estimates in the Philippines. Substantial differences in the estimates generated from the ESPREE and ELL-based updating methods were observed at the small area level (municipality) and at higher levels of aggregation, e.g., provincial/regional levels. The ESPREE method seemed to work better than the ELL-based method in that the ESPREE method generated unbiased estimates. An in-country validation exercise has been conducted to give a better assessment of the acceptability and consistency of the estimates generated with the available indicators at the municipal level as well as with the expert opinion of key informants.
Aline Drapeau, Université de Montréal, Canada
In the mental health domain, the validity of data collected in a survey is admittedly compromised by the prejudices surrounding mental illness and the memory bias causing under-reporting of the use of services. The problem of over-reporting is itself rarely studied.
This seminar is based on the results of a recent study that compared the information on mental health services provided by respondents in cycle 1.2 of the Canadian Community Health Survey (CCHS 1.2) with data from the Régie de l’assurance maladie du Québec (RAMQ). The sample was composed of 4,459 respondents aged 18 years and over and residing in Québec.
According to the CCHS 1.2 data, 5.8% of respondents received mental health services in the previous year, compared to 15.0% according to the RAMQ data. The under-reporting affected 75.5% of people who received services according to the RAMQ, while the over-reporting represented 36.6% of respondents who mentioned using services in CCHS 1.2. Under-reporting was linked to the context of the medical consultation according to the RAMQ (following a chronic physical illness or a hospitalization) and to certain respondent characteristics in CCHS 1.2. The majority of over-reporting cases (83.6%) were explained by telescoping (i.e. service received before the year preceding the survey). The results of this study highlight the importance of taking into consideration the context of data collection to appreciate the information yielded by a population survey and an administrative register.
Amanda Halladay, Statistics Canada
Maire Sinha, Statistics Canada
Why are victimization rates going up while police-reported crime is steadily decreasing? This is a question that many Canadians may be wondering as they compare trends of the General Social Survey (GSS) on victimization with trends of police-reported crime indicators. The answer stems from the fact that these two measures of crime come from two very different surveys, each with their own strengths and weaknesses. The GSS is sample survey conducted every five years that collects self-reported victimization data from approximately 25,000 households across Canada. On the other hand, the Uniform Crime Reporting (UCR) survey is an administrative census that collects police-reported criminal activity from virtually every police department across the country. The purpose of this talk is to reduce the level of confusion arising from the use of crime data from these two sources and to highlight the major differences between the two surveys. We will discuss items such as the scope of each survey, the target populations, sources of error, non-response bias, frame issues and survey definitions.
Anugula N. Reddy, National University of Educational Planning and Administration, India
In India, data on school education are being collected and collated by multiple agencies. The Ministry of Education responsible for the administration of education at both the federal and provincial levels publishes annual data on various dimensions. An autonomous government organization, the National Council of Educational Research (NCERT), was also entrusted the responsibility of providing somewhat detailed data on school education recurrently for every five years. Many household surveys like the Census, National Sample Surveys (NSSs) and National Family and Health Surveys (NFHSs) also provide periodic data on education, particularly on attendance status. In the recent past, owing to several factors, many agencies such as civil society/NGOs and autonomous research/advisory organizations of government including the National University of Educational Planning and Administration (NUEPA) are also beginning to collect and collate data on school education through household and school surveys. The emergence of multiple agencies supplying data on school education raises several questions. Are the gaps in data filled and time lag reduced by multiplying the sources of data? Are the data compatible? How do we reconcile the wide divergence between administrative data on the one hand and household data on the other, particularly with respect to enrolment and attendance? How can we supplement the administrative and household data on school education in monitoring the progress towards goals like Education for All, Education MDGs and the Right to Education in the Indian context? The present paper gives a panoramic view of educational statistics in India and examines the context in which the divergence of sources of data is taking place.
Joseph W. Sakshaug, University of Michigan, U.S.A.
David Weir, University of Michigan, U.S.A.
Lauren H. Nicholas, University of Michigan, U.S.A.
Administrative data records are increasingly being linked to sample survey records to consider new research questions and validate survey responses. Researchers often treat administrative data as a “gold standard” to which survey responses can be compared; however, there is little guidance on how to proceed when administrative and survey data provide conflicting reports of health status. This paper combines recently released survey and biomarker data from the Health and Retirement Study (HRS) with administrative Medicare claims data. The HRS is a nationally-representative panel study of older Americans conducted since 1992. Longitudinal Centers for Medicare and Medicaid Services (CMS) claims data are appended to the HRS records of consenting participants. We validate self-reported and claims data measures of diabetes with measured Hemoglobin A1c levels. HRS respondents self-report in each wave whether they have ever been diagnosed with diabetes. Chronic condition warehouse algorithms are used to identify diabetics in claims data based on inpatient and outpatient diagnostic codes. We use blood sample data collected on two separate sub-samples of respondents in 2006 and 2008 to reconcile discrepancies in diabetes status between the HRS self-reports and Medicare claims. Specifically, we examine 1) which data source provides more accurate diabetes reporting based on the blood sample data, and 2) the characteristics of study participants most likely to be misclassified in either the self-reported or administrative claims data. The findings have important implications for researchers integrating survey and administrative data or considering relying on only one measure of diabetes.
Denis Malo, Statistics Canada
The Survey of Household Spending (SHS) aims mainly to provide reliable provincial estimates on household spending. The multi-stage stratified sample is selected from an area frame covering the entire population, and data are collected annually through personal interviews. Improving the quality of estimates on spending on renovations and home repair has long been a key challenge for the Canada System of National Economic Accounts. The difficult economic situation in 2009, as well as the fiscal incentives offered to Canadians in this area, contributed to the quick implementation of the Renovation and Home Repair Survey, which was introduced in February and March 2010. To complete the area frame used for the SHS, a supplementary sample of high-income households was selected from the 2006 Census database and administrative data were used to update the frame. We present the different scenarios considered, the sampling plan used and the collection results, as well as the preliminary results regarding the improvement in data quality.
Benoît Riandey, Institut National d’Études Démographiques (France)
In the 1970s, the computerization of administrative files gave hope for major progress in statistics due to matching between registers based on unique identifiers. Also expected was a resultant relief of respondent burden and improvement in data quality. This was the case in northern European countries; however, the Informatique et Libertés law blocked this statistical progress in France. Fear of the “big brother” demonized the notion of matching up until the current technological revolution.
Today, we know how to match files anonymously thanks to hashing techniques for identifiers. Epidemiologists have access to these files to count AIDS infections anonymously without duplication, as well as for interhospital statistics and the enrichment of patient cohorts.
French public statistics are now interested in these techniques but without any real movement toward implementing them. We give examples of potentialities thus offered in surveys on job searching, health or discriminatory measures in businesses and administrations: matching administrative files, anonymously enriching surveys with administrative information, enriching through sensitive data an anonymous statistical file issued from administrative data or management. This is one way to resolve the dilemma frequently faced by statisticians, that is, providing results for data that they are forbidden from collecting due to the protection of private life.
Alfredo Navarro, U.S. Census Bureau, Washington, DC (USA)
The American Community Survey is the Census Bureau's alternative for replacing the decennial long form in the 2010 Census. New ventures are usually accompanied by challenges. Census long form data are heavily used for model development by transportation planners and grants allocation, as well as in a variety of other economic and social applications. Data users will now have to start developing these applications based on ACS estimates. The ACS design, data collection, and statistical methodology are somewhat different to methods used by the traditional census long form. The paper will touch on key design decisions, such as residence rules, data collection methodology, and the construction of the sampling frame. It also includes a description of key aspects of the ACS statistical methodology for the production of multi-year estimates. The switch to continuous measurement and the production of multiple sets of estimates result in a whole set of new issues regarding the use and interpretation of these estimates. The paper addresses some of these issues from a data user perspective.
Lisa B. Mirel, Centre for Disease Control, U.S.A.
Vicki Burt, Donna Miller, Michael Wiseman, John Kirlin and Christine Cox
The National Health and Nutrition Examination Survey (NHANES) is a United States nationally representative survey designed to assess the health and nutritional status of adults and children in the United States. The NHANES household interview includes food security questions about the receipt of Supplemental Nutrition Assistance Program (SNAP) benefits, formerly called Food Stamp Program, and can be used to estimate the prevalence of food stamp receipt. However, estimates of receiving food stamp benefits based upon reported NHANES information are lower than those indicated by the United States Department of Agriculture administrative database. To examine potential reasons for discordances in the estimates, the 2005-06 NHANES data were linked to SNAP administrative records for one U.S. state. From these data, 36% (n=321) of eligible NHANES participants were linked to the administrative database. Eligible participants were defined as all 2005-06 NHANES participants in the state used for linkage. We assessed the demographic characteristics of individuals reporting and not reporting SNAP benefits in NHANES and the matched administrative record database. These results will be discussed in the context of challenges encountered in the linkage, including accommodating changes in survey questionnaire design over the study period and linking survey and administrative records with missing or incomplete identification information. The results of this pilot study provide a basis for evaluating the utility of NHANES food security data and offer preliminary insights into issues of accurately collecting this information in a self-reported survey.
Patrycja Scioch, IAB - Institute for Employment Research, Germany
Administrative data are gaining increasing attention in labour market research, in combination or as sole base data. Despite their advantages, such as large sample sizes, long time periods and extreme wealth of information, they suffer from such shortcomings as inconsistencies and missing values. Unfortunately, the research in quality and the improvement of the data is scarce. I focus on the education variable in widely used German administrative data, the Integrated Employment Biographies (IEB), which are very important for analyses of wage inequalities and opportunities for employment. The quality of this variable has decreased significantly over the last 10 years, with 35% individuals having no information on educational attainment. Some studies have tried to improve quality by developing imputation rules to fill the gaps in dependence on existing information without using additional sources. I use an additional data source: German patent data. The educational information in these data is very reliable and detailed. Using record linkage methods (via name, address) these data are linked to the IEB to verify the information. Therefore, the highly-educated are identifiable via their name or titles (professor, doctor or diploma). The educational information is compared to the original information and to the outcomes of the replicated imputation rules.
Multivariate models explain the mechanism of measurement errors of the original and the various imputed education variables. For the first time, the quality of an administrative data set will be quantified by highly reliable external data. To conclude the presentation, practical recommendations for a better quality in administrative data will be given.
Michel Cloutier, Statistics Canada
Over the past ten years and in particular in the last five years, Statistics Canada has made important progress in the greater use of Administrative data. These data are currently used in a very large number of economic and social programs. Administrative data currently in use includes tax records, customs data, health files and various registries, license databases of all kinds, education records, employment insurance records, address information and billing files, justice records and public accounts files, etc. More sources are being explored by various programs (for example credit card data).
This paper presents a proposed vision to guide the future use of administrative data. The Vision paper starts with some background and definitions. It then outlines a possible strategic direction for Statistics Canada, including increased coordination, further research and possible efficiencies. Short-term recommendations are presented including the development of a policy for the use of administrative data, an inventory of administrative data, a research program and governance. Finally, some long-term options are discussed.