Statistics Canada
Symbol of the Government of Canada

Abstracts and proceedings

Proceedings: Available fall 2010

Abstracts: PDF

Workshops

(F) Workshop 2: Over 15 Years of Longitudinal Surveys at Statistics Canada: Lessons and Innovations

Michelle Simard and François Brisebois, Statistics Canada

Longitudinal surveys are relatively recent at Statistics Canada. In fact, it wasn’t until the mid-1990s that we saw development of major projects such as the National Longitudinal Survey of Children and Youth (NLSCY), the National Population Health Survey (NPHS) and the Survey of Labour and Income Dynamics (SLID), three surveys that are still active today. Each survey has particular objectives. The NLSCY tries to identify the factors influencing the development of Canadians, from birth to adulthood. The NPHS collects information related to the health of the Canadian population and related socio-demographic information. SLID looks at the economic well-being of Canadians.

At the turn of the current century, Statistics Canada developed three new longitudinal surveys: The Youth in Transition Survey (YITS), the Longitudinal Survey of Immigrants to Canada (LSIC), and the Workplace and Employee Survey (WES) which all focus on specific populations. LSIC studies how new immigrants adapt to Canadian society, in addition to identifying the factors that support, as well as hinder, efforts to integrate. YITS has as its main objective documenting the transition of young adults between school and the labour market. WES explores a wide range of topics related to labour (employers and employees) to determine how employers and their personnel react and adapt to changes in a technologies based competitive environment.

This workshop reviews Statistics Canada’s six longitudinal surveys and puts the emphasis on lessons the organization has learned over the years. It also looks at innovations arising from the design of these surveys and their challenges.

The day begins with a review of the basic principles of longitudinal surveys and the main difficulties involved. Next, individual descriptions will be discussed for the six surveys: a summary of the objectives and the sampling frame, followed by a detailed presentation of the lessons and innovations for each survey.

Each of these surveys is distinct due to its mandate and its methodology. The themes studied during the day will cover various steps in survey methodology, and therefore, should meet the needs of all participants.

(E) Workshop 3: Multilevel Modeling of Longitudinal Data

Sophia Rabe-Hesketh from Graduate School of Education, University of California (Berkeley), U.S.A. and
Anders Skrondal from the Division of Epidemiology, Norwegian Institute of Public Health, Norway

In longitudinal studies the same individuals provide responses at several occasions or panel waves. If each person-occasion combination is viewed as a unit of observation, it is straightforward to regress the corresponding response variable on both time-varying and time-constant covariates. However, even after controlling for covariates, some unobserved between-person heterogeneity typically remains, leading to within-person dependence. This unobserved heterogeneity can be accommodated by including person-specific intercepts and possibly person-specific regression coefficients in the model. Such an approach can easily be extended to handle further levels of nesting, for instance individuals nested in schools.

We start by discussing random effects (or multilevel) and fixed effects approaches for different types of responses including continuous, dichotomous, ordinal and counts. These models allow investigation of mean growth curves as well as between-person variability in different aspects of the growth trajectories. Fixed and random effects models can be viewed as conditional or person-specific models. Such models are compared with marginal or population averaged models. Different methods for accommodating or modelling within-subject dependence in the marginal approach are outlined and the distinction between conditional and marginal effects is discussed. Finally, we consider the incorporation of sampling weights in complex survey data.

Session 1 – Keynote Address

(E) A Methodological Research Agenda for Longitudinal Surveys

Peter Lynn, University of Essex, UK

This presentation will propose an agenda for future methodological research on issues pertinent to longitudinal surveys. The agenda will be informed by a consideration of the methodological challenges that are unique to longitudinal surveys and a review of research designed to address those challenges. There will be a particular focus on recent and current research and discussion of the implications of recent research findings, of technological changes and of other innovations. The objectives are both to stimulate methodological research and to raise awareness of the limitations of current methodological knowledge.

Topics discussed will include sample design, between-wave intervals, keeping track of sample members and maintaining co-operation, adjustment for non-response and panel attrition, panel conditioning, and instrument design to minimise measurement error in measures of change.

Session 2 – Event History Data Collection

(E) A Triangulated Approach to Evaluating ELSA’s Event History Calendar

Alice McGee and Hayley Cripps, National Centre for Social Research, UK
Joanne Pascale, U.S. Census Bureau, U.S.A.

While the Event History Calendar (EHC) technique has been used in surveys for several decades, its popularity has begun to increase in part owing to recent research demonstrating that it can produce higher data quality than conventional questionnaires for some topic areas. However, there has been a call for systematic methodological research in this area in addressing two main areas: how the EHC works in practice, particularly with regard to the interviewer-respondent interaction, the use of, and receptivity to, the technique. The second area relates to landmarks, both internal and external. Little is known about the ‘mechanics’ of these landmarks, when they are introduced as memory aids, who introduces them, and how successful they are in helping respondents recall dates accurately.

This paper details a study that sought to address these unknowns through the evaluation of an EHC used as part of the English Longitudinal Study of Ageing (ELSA). A sample of 124 interviews was audio recorded and analysed using a range of quantitative and qualitative evaluation methods: an interviewer questionnaire; a respondent debriefing questionnaire; an interviewer debriefing; and behaviour coding.

This paper draws together and triangulates findings from these four evidence sources, addressing 6 specific research questions. It delves into the precise nature of the interaction between interviewer and respondent, examining receptivity to the method and how particular features played out in the field. It will also make reference to how landmarks were used, whether they assisted respondent recall and the nature of their use.

(E) Tracing Life Courses with Prospective Panel Surveys – Lessons from the German Family Panel Study

Josef Brüderl, Laura Castiglioni, Ulrich Krieger, Volker Ludwig and Klaus Pforr, University of Mannheim, Germany

Collecting valid information on the occurrence, timing and spacing of events during the life course is a crucial issue for life course researchers. A prospective panel design can help to improve data quality, because only short periods of time between panel waves have to be recalled. At the "seam" of the two panel waves, however, misstated episodes are a common problem (seam effect).

In this presentation we argue that the seam effect can be reduced considerably by combining life history calendar and dependent interviewing techniques. Drawing on pretest data from the recently started German Family Panel Study, we provide evidence on the power of this approach in reducing the seam effect.

In wave two of our pretest study we administered a paper-and-pencil retrospective life history calendar from the age of 14 up until the date of the interview, covering relationship biography, fertility history, and residential mobility. In wave three, we asked for the events that occurred in between interviews. Using a split-ballot design we either presented a blank calendar or we presented a calendar, where the status at wave two was filled in (dependent interviewing). Differences between the two groups reveal how dependent interviewing helps to reduce the seam effect and thus can improve data quality.

(E) A Comparison of Survey Reports Obtained via Standard Questionnaire and Event History Calendar

Jeffrey Moore, Jason Fields, Joanne Pascale, Gary Benedetto, Martha Stinson and Anna Chan, U.S. Census Bureau, U.S.A.

The US Census Bureau’s Survey of Income and Program Participation (SIPP) provides monthly information about the nation’s income, wealth, and program usage. Currently, SIPP administers an interview three times a year to each sample member; each interview’s reference period covers the preceding four calendar months. In 2006 the Census Bureau initiated a SIPP re-engineering effort, a key component of which is a shift to a single annual interview covering the preceding calendar year. To accomplish this shift, the Census Bureau proposes to employ event history calendar (EHC) methods. Prior research, however, has raised some questions about EHC data quality for topics of key importance to SIPP, such as need-based program participation. In addition, the research base does not address the main SIPP design issue – the proposed shift from a four-month to a twelve-month reference period. To examine the implications of the switch to EHC methods and the expansion of the survey’s reference period, the Census Bureau implemented the SIPP EHC Field Test in the spring of 2008. The essential feature of the test was an EHC reinterview of expired SIPP 2004 panel households. The reference period for the reinterview was calendar year 2007; the primary sample component consisted of cases which had already provided information about calendar year 2007 in the normal course of their final three SIPP interviews. The field test thus permits a direct comparison of standard questionnaire and EHC reports by the same people, about the same characteristics, and for the same period. The subject of this paper is the main component of the evaluation of the field test results: an examination of the correspondence of the two reports – one obtained from a standard questionnaire, the other from an EHC instrument – for several of the key characteristics of interest to SIPP, and for each month of 2007.

Session 3 –- Attrition Bias and Nonresponse Weighting

(E) Evaluation and Selection of Models for Attrition Nonresponse Adjustment

Eric Slud and Leroy Bailey, U.S. Census Bureau, U.S.A.

In this talk, we consider a longitudinal survey like the U.S. Survey of Income and Program Participation (SIPP), with successive "waves" of data collection from sampled individuals, in which nonresponse attrition occurs and is treated by weighting adjustment, either through adjustment cells or a model like logistic regression in terms of auxiliary covariates. We measure the biases in estimated initial-wave ('Wave 1') attribute totals between the survey-weighted estimator in the first wave and the weight-adjusted estimator for the same Wave-1 item total based on later-wave respondents. Three new metrics of quality are defined for models used to adjust a longitudinal survey for attrition. The metrics combine estimated between-wave adjustment biases based on subsets of the sample, relative to the estimated total, for various survey items. The maximum of the biases for estimated totals of a survey item are calculated from the weight-adjusted subtotal of the first j sample units, as j ranges from 1 to the size of the entire (Wave-1) sample, after a random re-ordering either of the whole sample or of the units within distinguished cells (which are then also randomly re-ordered); and the average over re-orderings of the maximal adjustment bias is divided by the estimated wave-1 attribute total to give the metric value. Confidence bands for the metrics are estimated, and the metrics are applied to judge the quality of and to select among a collection of logistic-regression models for attrition nonresponse adjustment in SIPP 96.

(E) Nonresponse Weight Adjustments Using Multiple Imputation for the UK Millennium Cohort Study

John W. McDonald and Sosthenes C. Ketende, University of London, UK

This paper discusses nonresponse weight adjustments for sweep 3 of the UK Millennium Cohort Study (MCS). Weight adjustments are available for monotone patterns of nonresponse, where the nonresponse weight is the inverse of the estimated probability of response based on a logistic regression model, which uses data from previous sweeps to predict response at the current sweep. For non-monotone patterns, some cases have missing data for previous sweeps and this approach cannot be easily applied. For MCS, 7.5% of the families took part in sweeps 1 and 3, but not sweep 2, i.e., a non-monotonic pattern of nonresponse for 1444 families.

Our approach to estimate a nonresponse weight for MCS sweep 3 was to use multiple imputation to impute the required missing values at sweep 2 for these 1444 families for the logistic model for response at sweep 3. This imputation used information from sweeps 1 and 3 and only involved imputing the missing values for time-varying variables shown to be predictive of nonresponse in MCS. This resulted in the multiple imputation of nonresponse weights at sweep 3, which can be averaged to produce a single nonresponse weight or the 10 imputed nonresponse weights can used for separate analyses and the results combined using Rubin’s rules. We discuss the advantages and disadvantages of both approaches.

(E) Analysis of Nonresponse in the National Longitudinal Survey of Children and Youth

Mike Tam and Agnes Waye, Statistics Canada

In this paper, we examine the nonresponse of the original cohort of the National Longitudinal Survey of Children and Youth (NLSCY). The NLSCY is a longitudinal survey that collects information on characteristics and factors which may affect the development and well-being of Canadian children and youth over time. An original cohort of children aged between 0 and 11 years old was sampled in 1994, with follow-ups occurring every two years. In view of the amount of nonresponse that has occurred over time, an extensive analysis is under way to determine the extent of nonresponse bias that may be present and how such bias may be affecting analyses based on NLSCY data. We are also interested in investigating whether the initial nonresponse differs from subsequent nonresponse. To address these two issues, we identify potential determinants of nonresponse for different patterns of attrition, as well as of different components of nonresponse, such as non-contact or refusal. We also examine the extent to which the nonresponse adjustments to the survey weights are able to correct for bias. Identifying predictors of nonresponse and evaluating our current weighting methodology with respect to those characteristics established to be related to nonresponse will be helpful to us when we implement the weighting for cycle 8.

Session 4 – Data Collection and Linkage

(E) Maintaining Contact with PSID Families between Waves: An Experimental Test of a New Strategy

Katherine McGonagle, Mick Couper and Robert Schoeni, University of Michigan, U.S.A.

Since 1997 the PSID has conducted biennial interviews. Families are routinely sent a “contact information update” mailing to keep track of their whereabouts between waves. Families that provide an update or verification are sent $10 after returning the postcard. Slightly more than half of families typically respond to this mailing. Analysis shows that having updated information prior to the start of data collection is cost effective. During 2007 production interviewing, families providing an update were much less likely to require tracking or special refusal conversion efforts, and needed half as many contacts to be interviewed. Given these advantages, a study was designed in advance of 2009 production to improve the response rate of the contact update mailing. Families were randomly assigned to several conditions including: pre- or post-paid incentive, mailing design (old vs. updated), whether they were also sent a respondent report, and timing of the mailing (Summer versus Fall). This paper reports on the initial findings with regards to response rates to the contact update mailing by these different conditions. Overall, there is no effect of incentive type. Across all conditions, the old design performed better than the new design. Families who received a second mailing had significantly higher response rates than those in the one-time mailing only conditions. There were some interaction effects by timing-of-mailing, such that families in the Fall mailing condition had higher response rates when they also received a prepaid incentive. Various hypotheses for these findings and next steps for analysis are described.

(E) Mixed and Multiple Collection Modes: The HILDA Survey Experience

Mark Wooden, University of Melbourne, Australia

Like other household panel surveys, the Household, Income and Labour Dynamics in Australia (HILDA) Survey collects data via multiple modes. Moreover, the composition of these modes has been gradually changing over time; by wave 8 about 10% of all interviews were conducted by telephone, whereas in wave 1 the incidence of telephone interviews was negligible. This mixed mode approach raises a number of issues, not least of which is the concern that changes in data collection method may be introducing errors in the measurement of change.

This paper uses the HILDA Survey experience to examine how serious these issues might be. More specifically, it examines five key questions. First, what reasons explain the decision to employ mixed (and multiple) modes of data collection? Second, what characteristics distinguish those interviewed by telephone from those interviewed in person? Third, does mode have any affects, either favourable or harmful, on sample attrition? Fourth, does mixed and multiple modes affect the quantity of data collected? Finally, is there any evidence that the mixed mode approach has seriously affected the accuracy of responses, and more importantly the longitudinal consistency of those responses?

(E) Managing Complex Inexact Matching in Coding and Linkage Applications

Michael J. Wenzowski, Statistics Canada, Canada

As the pressure increases to indirectly acquire data from administrative and other sources, the requirement to identify, and possibly even link these records expands dramatically. For example, data collected for other purposes may not be coded as required, and may also be incomplete without performing a record linkage. Both of these activities fall into the realm of inexact and probabilistic matching. As examples: the data present may not have been elicited in a manner which facilitates coding to the appropriate standard; similarly, we are rarely presented with a unique key with which to perform a deterministic linkage. These scenarios lead us to a requirement to perform inexact, or “fuzzy” matching, coupled with a probability-based method for identifying “correct” matches.

We present the results of a recent Statistics Canada initiative to re-engineer our generalized coding and record linkage systems in order to enhance their applicability across a wide range of processing problem and subject matter domains. They are typically adapted for a particular use by the methodology team responsible for creating the application, and are routinely run in production by survey operations staff. We will demonstrate how we have increased the general usability of these packages by offering more intuitive controls over managing the complexity of their internal processing, and have simplified their installation, set-up and processing models. The approach taken will be to highlight the “end-user” perspective on using this software – whether that user be IT, methodology or survey operations staff.

(E) Managing Respondent Relations on the National Population Health Survey

Andrew MacKenzie and Natasha Zaletel, Statistics Canada

The National Population Health Survey (NPHS) is a longitudinal survey that has been collecting information on the health of the Canadian population and related socio-demographic information since 1994. By the fall of 2009, the NPHS project will have collected eight cycles of data spanning 15 years and will be preparing to collect the ninth cycle in 2010. Response rates to the NPHS have remained high throughout the first 8 cycles of data collection but have trailed off similar to most other social surveys in recent years. Being a longitudinal survey, maintaining a sample of willing participants is critical, particularly as respondents are lost through mortality, relocation and other forms of non-response. NPHS has enjoyed very positive respondent relations and the purpose of this presentation is to share the methods that have been used to foster and develop these ongoing relations with respondents. NPHS has invested heavily in respondent relations over the years including focus group testing of introductory letters and brochures as well as gifts and thank you letters for survey participants. This presentation will also discuss the balance sought by NPHS in deciding when to remove a respondent from the collection sample when they are untraceable or have repeatedly refused to participate.

Session 5 – Analysis of Longitudinal Survey Data

(E) A Simulation Study of Calibration Methods for Estimation of Gross Flows

Marcel de Toledo Vieira, Federal University of Juiz de Fora, Brazil
Gad Nathan, Hebrew University of Jerusalem, Israel

Methods traditionally used for the analysis of longitudinal data, such as those based on the application of generalized linear models or multilevel models to repeated measures and the use of generalized estimating equations, are primarily model-based. We consider the application of calibration methods for the estimation of gross flows from longitudinal data. Calibration can be carried out to known totals of cross-sectional variables or to longitudinal auxiliary variables and the choice of suitable distance functions allows a wide range of both design-based and model-based estimators, such as GREG estimators. The simulation study is based on data from the British Household Panel Survey and is planned to compare the efficiency of calibration to cross-sectional variables with calibration to longitudinal auxiliary variables as well as of traditional estimators of gross flows.

(E) Loss to Follow-up and Cox PH Modeling of Jobless Spells from SLID

Dagmar Mariaca Hajducek and Jerry Lawless, University of Waterloo, Canada

The Survey of Labour and Income Dynamics (SLID) provides information to help understand the dynamics in the Canadian population pertaining to the employment, income, health, and many other aspects of human life. As a multipurpose survey, which intends to provide information on a diverse spectrum of interests, SLID offers many statistical challenges and in particular, an opportunity for the development of the extension of duration analysis theory to longitudinal survey data. As an illustration, the study of the duration of jobless spells from SLID requires the extension of classical duration analysis techniques to survey data. One of the features that characterize these data pertains to loss to follow-up (LTF), which may be related to the durations under study. For example, the probability that an individual is lost to follow-up may be related to their unemployment experience. On the one hand, Cox PH modeling techniques deserve attention due to their appeal, both in the classical and survey settings. Conditioning on past event history may be used to deal with within-individual correlation in spell durations. On the other hand, weighting techniques such as inverse probability weighting (IPC) based on LTF modeling allow us to accommodate dependent LTF. This talk will discuss the use of LTF based IPC weights in Cox PH conditional modeling of jobless spells durations.

(E) Issues in the Use of Structural Equation Modeling with Longitudinal Public-Release Data Files

Laura Stapleton, University of Maryland, U.S.A.

This presentation will outline the issues faced by the applied researcher when analyzing longitudinal public-release data. Research questions that address the amount of growth over time, the shape of growth, and differences in growth across groups can be answered with such data; however the applied researcher must first determine how to accommodate the sampling design when undertaking analyses. Public-release data may come with specific sampling information on the data file: stratum indicator(s), primary sampling unit indicator, one or more panel weights (depending on the number of data collection waves), and replicate weights. Additionally, if more than one type of respondent was contacted at each wave, panel weights might exist for each respondent type (e.g. parent and child). Making appropriate use of this information can be difficult for the applied researcher. This presentation will cover the issues of multistage sampling (including effects on sampling variance estimates) and design- and model-based approaches to modeling the sampling design, as well as the use of panel weights and how to choose among a set of possible weights depending on the analytic strategy chosen to handle any missing data. Additionally, weights are typically calculated to reflect the inverse of the probability of an individual being selected but, with multilevel modeling, choices in weight scaling are available to reflect the cluster-level selection probability as well. The issues will be addressed within the context of conducting latent growth modeling and current software options for accommodating the sampling design within structural equation models will be highlighted.

Session 6 – Issues in Economic Surveys

(E) The Construction of a Prototype of the Italian LEED Based on Administrative Data: Main Methodological Aspects

Carla Congia and Roberta Rizzi, ISTAT, Italy

The construction of the first prototype of an (official) Italian Linked Employer Employees Data base (LEED) realized by the Italian Institute of Statistics, has followed the study of previous experiences of many other countries like USA, Canada, France, Denmark, New Zealand, etc..

The administrative archive of the declarations that employers have to transmit every year to the Italian Tax Office to communicate deductions of tax from wages, social security contributions and insurance paid for each employee, has represented the first administrative source of data on the basis of which has started the design of a LEED, with a coverage of the entire population of employers and employees both in the private and public sectors. The LEED has been the result of a complex integration process with other relevant administrative sources. Furthermore, the linkage with the Italian Statistical Business Register, by a unique code (the fiscal code of the employer), has permitted to give an official statistical relevance to the project.

As the entities within the LEED are linked longitudinally, it is possible to track workers over time and link this to longitudinal firm dynamics. It means that the LEED gives the opportunity to answer to many different informative needs. Actually, through the LEED it is possible to study some important aspects concerning the labour market like workers and job flows, the employment tenure, the multiple jobholding, the wages dynamic, etc..

This paper describes the relevant methodological aspects faced in the construction of a first prototype of this Italian LEED. In particular, the normalizing process of the administrative data that have to be checked and translated into statistical variables. Then, all the aspects concerning the problem of the longitudinally identification of the enterprises, considering information on mergers, acquisitions and splits, are analysed. Last, the issues related to the longitudinally identification of the jobs, especially when you have to treat with multiple job holders, are discussed.

(E) Are prices surveys sample designs robust to aging weights? A simulation study

Zdenek Patak, Statistics Canada
Daniele Toninelli, University of Bergamo, Italy

The importance of the prices movements’ study has increased very fast in the last years especially in a global economy affected by the world financial crisis. Many national statistical agencies started developing new projects based on longitudinal studies to measure the movements of the prices of products and services; in particular, this latter is a relatively new field, but its importance is increasing quickly. In this context, Statistics Canada is currently implementing new projects within the Service Producer Price Indexes (SPPI).

Many methodological issues regarding the data collection are subject of intensive research. These are mainly finalized to improve the quality of the whole indexes’ production process. The target of this work is to contribute, in particular, to improve the quality of the collection data process. This is done by focusing on the earliest stage of the research and studying the temporal evolution of the survey data, using size of the selected units as an experimental factor.

Starting from a simulated population generated from the data collected through the wholesale services price survey, the objective of this work is to make a comparative study of the main sampling selection methods underlining their relative efficiencies, in term of precision of the estimates. The results obtained using different kind of Probability Proportional to Size methods are compared with the results obtained with the SRS and with the judgmental sampling methodologies, evaluating how and how much the change in the size measure over time affects the estimates and the bias of the results.

(E) Adding a longitudinal Component to the Statistics Canada Agriculture Tax Data Program

Terri Blanchard and Peter Xiao, Statistics Canada

The Agriculture Tax Data Program (TDP) at Statistics Canada is primarily designed to produce cross-sectional estimates on financial variables such as average operating revenues and expenses, net operating income, and off-farm income for Canadian farms, farm operators and farm families. The administrative data are obtained from tax forms sent to the Canadian Revenue Agency (CRA) by unincorporated (T1), incorporated (T2) and communal farms in electronic or paper format.

A longitudinal component was added to the TDP starting with tax year 2001. The goal of the new component is to follow individual farms over time and provide an understanding of the characteristics of farms that undergo various types of changes. The 2001 tax year panel has been followed on a yearly basis. Annual panels of cohorts have been created starting in tax year 2006 and have also been followed yearly. We currently have a longitudinal database for the 2001 panel units with data from tax years 2001 to 2006.

In this talk we describe the longitudinal component of the TDP; the impact on the cross-sectional sample of following more than one cohort, the imputation strategy designed specifically for the longitudinal units and the weighting methodology.

(E) Longitudinal Surveys on Hard-to-Trace Populations

E.J. Reedy, Kauffman Foundation, U.S.A.

As the largest global foundation focused on entrepreneurship and a decade of experience working with scholars to measure this process, the proposed presentation would focus on what the Foundation has learned about longitudinal surveys with young and emerging firms, focusing on the Kauffman Firm Survey (KFS) and the Panel Study on Entrepreneurial Dynamics (PSED).

The KFS is the world’s largest longitudinal survey of new businesses – by its completion totaling eight years. The KFS is a panel that included new businesses founded by a person or team of people, purchases of existing businesses by a new ownership team, and purchases of franchises. Baseline interviews were completed with principals of 4,928 businesses that started operations in 2004. A self-administered Web survey and Computer Assisted Telephone Interviewing (CATI) were used for data collection, and KFS respondents were paid $50 to complete the interview. The KFS has maintained a high retention of its baseline sample, with approximately 80% or higher weighted response rates using AAPOR response rate 1.

The PSED provides a valid and reliable data on the process of business formation based on nationally-representative samples of nascent entrepreneurs, those active in business creation. PSED I began with screening in 1998-2000 to select a cohort of 830 with three follow-up interviews. PSED II began with screening in 2005-2006 with two follow-up interviews. The information obtained includes data on the nature of those active as nascent entrepreneurs, the activities undertaken during the start-up process, and the characteristics of start-up efforts that become new firms.

Session 7 – Synthetic Data Approaches to Confidentiality

(E) Analytical Validity and Confidentiality Protection in Longitudinally Integrated Statistical Data Systems

John M. Abowd, Cornell University, U.S.A.

This paper summarizes the results of six different synthetic data projects conducted with the support of the National Science Foundation and using longitudinally integrated statistical data from censuses, surveys, and administrative record systems. The systems were all designed to produce statistically valid releasable micro-data protected by synthetic data techniques. The systems that were studied included longitudinal establishment data, longitudinally integrated administrative employer-employee data, geo-spatially integrated residence/workplace data, and household surveys integrated with longitudinal administrative data. Analytical validity and confidentiality protection results from these projects will be summarized.

(E) Summary of Methods and Preliminary Assessment of the SIPP Synthetic Beta, version 5.0

Gary Benedetto and Martha Stinson, U.S. Census Bureau, U.S.A.
Melissa Bjelland, Cornell University, U.S.A.

This paper summarizes the methodology and quality assessment of the most recent version of the SIPP Synthetic Beta (SSB v5.0), a public use dataset that combines variables from the Census Bureau's Survey of Income and Program Participation (SIPP), the historical earnings data from Internal Revenue Service (IRS) tax forms, and the Social Security Administration's (SSA) individual retirement and disability benefit data. Multiple imputation and partial data synthesis were used to complete and perturb the data so that the final data product (multiple data sets, called implicates, which have the same structure as the underlying confidential data) would not compromise data confidentiality. The benefits of the methods used in this project are that the data users can run their analyses on each synthetic implicate exactly as they would have if they had access to the single, confidential data set. After getting results for each synthetic implicate, relatively simple formulae exist to combine these results to get proper point estimates and measures of variance that take into account the uncertainty introduced by the modeling. Moreover, since every value of the vast majority of variables on the file has been replaced by random draws from a probability distribution, the partially synthetic data offer a very high level of confidentiality protection. We also attempt to assess the analytic validity of the partially synthetic data and quantify the disclosure risk of making such data available to the public.

(E) Synthetic Data Creation for the Cross National Equivalent File

Jean-François Beaumont and Cynthia Bocci, Statistics Canada

The creation of synthetic data as a method of disclosure avoidance has gained popularity over the last 15 years. Statistics Canada has recently started investigating techniques to create synthetic data for the Canadian portion of the Cross National Equivalent File (CNEF). The CNEF combines data on labour and income and involves six countries. The Canadian portion of the CNEF is derived from a subset of variables from Statistics Canada’s Survey of Labour and Income Dynamics (SLID), a longitudinal survey. Due to confidentiality constraints however, the Canadian data can only be accessed through special arrangements unlike data from other countries that are collected by universities or private institutes. As a result, the Canadian data are sometimes omitted from analyses. The creation of synthetic data would allow access of these data to a larger number of researchers and hopefully increase their use.

In this presentation, we describe the methodology used to create longitudinal synthetic data for the Canadian portion of CNEF and discuss the challenges of creating consistent households that preserve as much as possible the relationships in the original data while keeping the risk of divulging confidential information at a low level. We present some preliminary cross-sectional results.

Session 8 – Waksberg Award Winner Address

(E) Methods for Oversampling Rare Subpopulations in Social Surveys

Graham Kalton, Westat, U.S.A.

Increasingly social surveys are required to produce estimates for subpopulations, often rare subpopulations. Sometimes a survey focuses on a single subpopulation but often the survey is required to produce estimates for several subpopulations and also for the total population. When membership of a rare subpopulation can be determined from the sampling frame, selecting a sample of the required size for the subpopulation is relatively straightforward. In this case the main issue is the extent of oversampling to employ when the survey aims to produce estimates for several subpopulations and the total population. Oversampling a rare subpopulation that cannot be identified from the sampling frame presents a major challenge. Methods to perform this oversampling include disproportionate stratified sampling, two-phase sampling, the use of multiple frames, multiplicity sampling, panel surveys, and the use of multi-purpose surveys. This paper will describe these methods and illustrate their application in a range of surveys.

Session 9 – Longitudinal Health Data: Issues and Challenges

(E) Establishing a Longitudinal Community Health Research Methodology: Issues and Challenges

David Marshall, University of Queensland, Australia

Longitudinal methods have enormous potential for better understanding health and wellbeing at a community level. In particular, awareness of local level social, economic and environmental conditions and their influence on the burden of chronic disease could be greatly enhanced by local level place-based longitudinal studies. However, to date they have been relatively under-utilised in this field.

This paper outlines some of the core methodological issues being considered in the establishment of ‘The Ipswich Study’, a proposed place-based study of health and wellbeing in an outer suburban community of approximately 150,000 persons in South East Queensland, Australia. Given the rapid changes and forecast growth anticipated for the region over the next two decades, a unique opportunity to analyse the impact of such change on community health has emerged. Identification of the issues and challenges generated by the proposed research agenda, the options considered to address them and the likely approach adopted to overcome them will be outlined. In particular, the paper focuses on the potential problems in studying a region which is undergoing enormous social, economic and environmental change.

(F) Analysis of the Longitudinal Health Approach Implemented in Belgium

Ann Ingenbleek, Yves Coppieters and Alain Levêque, Université Libre de Bruxelles, Belgium
Lies Lammens and Patrick Deboosere, Vrije Universiteit Brussel, Belgium
Florence Cols and William D’hoore, Université Catholique de Louvain, Belgium

The implementation of e-government strategies in Belgium created opportunities for the modernisation of the health sector. More specifically, the reorganisation of the health information flow has been decided on in order to improve the access to health and patients’ safety.

Taking advantage of this context, the national health information system will be supplemented with a component whose objective is to build a coherent and effective longitudinal perspective out of the existing health data sources.

Among the initiatives recently implemented in Belgium that are particularly valuable to the longitudinal approach are the following: the establishment of electronic health records, that induces the collection of patient-centred health histories; the set up of registers gathering at the same time clinical information (for the use of health professionals) and administrative data (required for the management of the health care system) which optimizes the human, technical and financial resources needed to generate health data in a recurrent way; moreover, besides follow-up surveys, a ‘permanent sample’ has been constituted, which is based on care consumption data recorded within the framework of the compulsory sickness insurance.

In this paper, we analyse the suitable and hampering country-level circumstances the implementation of the longitudinal health approach is facing.

Favourable factors consist of revisions made on the legislation, the common use of a unique personal identification number and the availability of substantial means for the health sector. Relevant challenges imply societal choices, the development of a global vision on health and the assent of health providers to changes.

(E) Ethical Implications of Longitudinal Data Collection on Both the Individual and the Societal Level

Lies Lammens and Patrick Deboosere, Vrije Universiteit Brussel, Belgium
Florence Cols and William D’hoore, Université Catholique de Louvain, Belgium
Ann Ingenbleek, Yves Coppieters and Alain Levêque, Université Libre de Bruxelles, Belgium

Due to technological advances, researchers in industrialized societies are confronted with a fundamental ethical dilemma: the ‘knowledge’ versus ‘privacy’ dilemma. Longitudinal data especially could be a threat to individuals’ private life, or at least be perceived as threatening, since they contain detailed information on the characteristics and behaviors of individuals, and reveal patterns that may disclose a personal identity. Moreover, to construct longitudinal data individual data often need to be linked repeatedly. This also increases the risk of identification considerably.

In an earlier paper we developed a conceptual framework on the ethical implications of (health) data collecting projects. We reflected on health policy goals within the context of an ever more detailed collection of personal data, which might become a potential source of negative consequences for other societal goals. We pointed to the potentially abusive use of collected data in the short (threatening people‘s privacy) and in the long term (threatening democracy).

In this paper we evaluate two countries with a different approach towards organizing their statistical system by confronting them with our theoretical framework. More specifically we compare the decentralised UK statistical system with the statistical system of Denmark, which is internationally recognized for its large use of public registers and highly centralised production of statistics. We question the way in which the organisation of these statistical systems influences the degree in which people’s privacy is protected in the short run and the degree in which a democratic use of collected data is guaranteed in the long run.

(F) Contribution of administrative and medical administrative databases to the Constances cohort

Gueguen, R. Sitta, JL. Lanoe, M. Goldberg and M. Zins, INSERM, France
L. Bénézet and G. Santin, Institut de veille sanitaire, France

CONSTANCES (www.constances.fr) is an epidemiological cohort of 200,000 participants who have been followed for several decades to study the effects that risk factors such as health-care disparities have on various health conditions and to provide public health-related information.

On selection, a random sample of general social security system participants (80% of the French population) will be invited to Social Security’s health examination centres in 17 departments for a complete medical check-up. Follow-up is carried out annually by questionnaire and by individual matching in a number of administrative databases: employment data from the national pension fund (labour force participation, non-participation, occupation), health data from the health insurance databases (hospitalizations, health-care utilization, medical procedures) and causes of death.

The invitation to a health examination centre will probably lead to a low participation rate, which will result in selection effects. Conventionally, in analytic epidemiology, such effects are taken into account by including the potential participation factors in the model of the exposure-disease relation, but this approach may be inadequate [Hernan et al., Epidemiology, 2004], and estimation of the frequencies of exposure or disease requires reweighting adjusted for non-participation. For this reason, the cohort of participants will be twinned with a cohort of non-participants that will be tracked in exactly the same way through the administrative databases. The latter will provide data going back up to two years before selection.

Session 10 – Data Collections Issues in Longitudinal Survey

(E) Keeping in Touch with Mobile Families in the UK Millennium Cohort Study

Lisa Calderwood, University of London, UK

Minimising attrition is a key concern for longitudinal studies. One of the main reasons for this is the risk that if the sample members who are lost to the study are systematically different from those who remain, the study will become less representative of the population of interest over time. Recent research has focused on examining the different sources of attrition: location, contact and co-operation.

This paper will focus on the problem of locating mobile families in the UK Millennium Cohort Study (MCS). The MCS is following over 19,000 children born in the UK in 2000/1. So far there have been four sweeps of the study at 9 months, 3 years, 5 years and 7 years.

This paper will examine what proportion of families who move between sweeps are successfully located through the study’s tracking procedures. It will examine the effectiveness of techniques designed to pick up address changes prior to the start of fieldwork for a particular survey compared with interviewer tracing in the field. In particular, it will focus on the utility of using administrative data from social security records of child benefit receipt for tracking. It will also examine some of the factors associated with success or failure to locate mobile families.

(F) Organization and monitoring of the survey area: Impact on estimator quality for a rotating household panel

Thomas Christin, Stéphane Fleury and Johan Pea, Federal Statistical Office, Switzerland

Total non-response is non-negligible and imperfectly modelled. It increases the variance of our estimators and causes bias despite weighting adjustments. While the probability of response is usually explained and corrected by the respondent’s individual characteristics, we plan to assess the extent to which organizing and monitoring the survey area can also significantly affect response rates.

If we look at the survey area as a complex interview production plant, it seems reasonable to assume that a suitable instrument panel that provides information about its key processes and the means to control them would help us improve its performance. We will describe the monitoring instruments used in Switzerland’s Statistics on Income and Living Conditions (SILC) survey, and we will present a critical assessment of their effectiveness. The SILC survey is a household panel with a four-year turnover period. The data are collected by a private social research institute using a CATI process. The initial sample of more than 10,000 households is divided into several subsamples that are activated in succession. Because Switzerland is a multilingual country, the interviews are conducted in French, German or Italian at two offices that are managed somewhat independently.

Because of its longitudinal component, the SILC provides the opportunity in the second and subsequent waves to anticipate a household’s cooperativeness regarding surveys. We will present various options for estimating cooperativeness and an assessment of their coherence. We will describe the option selected for the SILC survey in Switzerland and the steps taken in the field to boost the response probability of uncooperative households.

We will conclude by examining the extent to which careful survey monitoring designed to optimize the correlation between respondents’ characteristics and their probability of responding may have the effect of replacing one problem with another. What we gain in estimator quality by minimizing total non-response might be partly offset by losses due to partial non-response and response error.

(E) Responsive Design for the Survey of Labour and Income Dynamics

Owen Phillips and Tracy Tabuchi, Statistics Canada

The Survey of Labour and Income Dynamics (SLID) is a longitudinal survey used to measure changes in the economic wellbeing of Canadians and the factors that may influence these changes. Interviews are conducted by means of a computer-assisted telephone interview (CATI). A number of initiatives have been put in place over the years to better manage collection resources and effort. SLID has established an Active Collection Management (ACM) group, whose goal is to monitor collection progress and identify collection problems and corrective actions. Additionally, in pre- and post-collection, the group investigates improvements that might be implemented in the next collection and sees that proposed changes are integrated and tested in order to minimize the potential for problems. Beginning with the 2007 data collection, the SLID adopted a maximum of 40 call attempts to a given household, in an effort to diminish respondent burden and reduce collection costs.

Despite efforts to better manage collection, the SLID has seen a consistent drop in response rates over recent years. As a longitudinal survey, the SLID could benefit from contact history from previous cycles to better develop a contact strategy for the current cycle. This paper presents such an analysis and the application of the results in developing a responsive design (Groves and Heeringa 2006) for the 2010 SLID data collection.

Session 11 – Weighting and Estimation

(E) Propensity Score Weight Adjustment for Dual Sampling Frame

C. Boudreau, M.E. Thompson and M. Iraniparast, University of Waterloo, Canada

The International Tobacco Control Policy (ITC) Netherlands Survey is an ongoing longitudinal survey of more than 2,200 smokers, which began in 2008. It aims to study smoking behaviours and the impacts of national-level tobacco control policies. It is the first survey of the ITC Project (which now has surveys in over 18 countries) to utilize a dual web+RDD sampling frame. The web sampling frame is a database that consists of more than 200,000 Dutch respondents who have agreed to participate, on a regular basis, in research studies conducted by the international survey firm TNS NIPO. Although great care was taken by TNS NIPO to ensure that their database is an accurate representation of the Dutch population, some groups remain underrepresented and others overrepresented. The sampling weights were constructed to adjust as much as possible for this, but some selection bias remained.

In this talk we present a method that utilizes the smaller RDD sample of the ITC Netherlands Survey in conjunction with a propensity score (PS) to adjust the weights of the larger web sample. The method consists of

fitting a PS model where response corresponds to survey mode, and then performing a post-stratification adjustment of the web sample weights using estimated PS's. The method is easily modified to handle missing values. We illustrate with data from Wave 1 of the ITC Netherlands Survey, but the method can also be used with other dual frame surveys where one frame can be considered unbiased.

(E) Longitudinal Estimation in the European Survey of Income and Living Conditions

Ralf Münnich and Stefan Zins, University of Trier, Germany

In March 2000 the Lisbon European Council has asked the Members States of the European Union to make steps towards improving social cohesion and combating poverty by 2010. In order to adequately measure poverty and social cohesion, the Laeken indicators were agreed on as a set of indicators to be measured and published each year by all member states. These indicators include mainly cross sectional measures like the at-risk-of-poverty rate or the GINI coefficient. However, to monitor poverty developments some indicators also include longitudinal aspects. As the data source, the European Survey of Income and Living Conditions (EU-SILC) was introduced as a rotational panel survey.

The aim of the paper is to estimate selected poverty measures over time and to assess their accuracy. The main focus of the research is put on the estimation of the highly non-linear indicators and their accuracy. Further emphasis will be laid on the survey design as a rotational sampling scheme. The EU-SILC consists of four rotational quarters from which one will be substituted each year. Hence, only three quarters are overlapping each year. The usage of the information on time of non-overlapping quarters will be considered in order to improve the inference. The study will include a Monte-Carlo simulation in a close to reality framework.

The research is done within the AMELI project which is supported by the European Commission within the 7th Framework Programme (cf. http://ameli.surveystatstics.net).

(E) Weighting and Variance Estimation for the German Dual Frame Household Panel Survey “PASS”

Hans Kiesl, Institute for Employment Research (IAB), Germany

The German Institute for Employment Research has just completed the second wave of a new (annual) panel survey focusing on low income households, which is designed as a dual frame survey. The first frame is a register of households that are currently receiving some kind of unemployment benefits; the second frame consists of an address register of the whole population. In the beginning, 6,000 households were selected from each frame (with PPS sampling of zip codes in the first stage), resulting in a large variance of design weights between the two samples. For both subsamples and for the combined sample, weights for households as well as individuals are provided, resulting in six different sets of cross sectional weights for each wave.

In this paper, we describe the challenges of weighting and variance estimation for the first two waves of our survey, including trimming of extreme weights, coordination of household and individual weights, nonresponse adjustment strategies for households and individuals, a comparison of different weight share methods to tackle changes in household composition over time, and the use of convex weighting to integrate a second wave sample of births from the smaller sampling frame (i.e. new households in need of benefits).

Session 12 – Accommodating Missing Data in Longitudinal Survey Data Analysis

(E) Modelling and Analysis of Durations Based on Longitudinal Survey Data

Jerry Lawless and Dagmar Mariaca Hajducek, University of Waterloo, Canada

Data on education, employment, health and other life processes are collected in longitudinal surveys. The durations of spells that persons spend in specific life states are often of interest, for example, the durations of jobless spells, episodes of illness, or periods of social assistance. In spite of many advances in the analysis of life history data over the past 20 years, numerous challenges remain in dealing with data from longitudinal surveys, whose features include complex sampling designs, intermittent ascertainment of data through interviews, frequent missing data, and losses to follow-up through panel attrition. In addition, successive durations for an individual are typically correlated, as are durations of individuals within clusters. This talk will discuss the modelling and analysis of duration times under such conditions. Specific issues include the use of marginal duration models as well as models that condition on individuals' previous life history, the need to allow for state-dependent loss to follow-up, dealing with induced dependent censoring due to within-individual correlation of spell durations, and ways to deal with mismeasured or missing data on durations or other variables. The modelling and analysis of jobless spells from Statistics Canada's Survey of Labour and Income Dynamics will be used for illustration.

(E) Analysis of Longitudinal Surveys with Missing Responses

Changbao Wu, University of Waterloo, Canada

Ivan Carrillo Garcia, Statistics Canada

Longitudinal surveys have emerged in recent years as an important data collection tool for population studies where the primary interest is to examine population changes over time at the individual level. The generalized estimating equation (GEE) approach is the most popular statistical inference tool for longitudinal studies. The vast majority of existing literature on the GEE method, however, uses the method for non-survey settings, and issues related to complex sampling designs are ignored.

We propose methods for the analysis of longitudinal surveys when the response variable contains missing values. Our methods are built within the GEE framework, with a major focus on using the GEE method when missing responses are handled through imputation.

We first argue why and further show how the survey weights can be incorporated into the so-called Pseudo GEE method under a joint randomization framework, and the missing responses are handled either by a re-weighting method or by imputation. Consistency of the resulting GEE estimators of the regression coefficients are established under certain regularity conditions. Linearization variance estimators are developed under the assumption that the finite population sampling fraction is small or negligible, a scenario often held for large scale population surveys. Finite sample performances of the proposed estimators are investigated through a simulation study. The results show that the proposed GEE estimators and the linearization variance estimators perform well under several sampling designs for both continuous and binary responses.

(E) Longitudinal Studies with Missing Response and Missing Covariate: An Application to the ITC4 Survey Study

Baojiang Chen, University of Washington, U.S.A.
Mary Thompson, University of Waterloo, Canada

Data from longitudinal studies often feature both incomplete response and incomplete covariate data. The impact of missing data often depends on the frequency with which data are missing and the strength of the association between the missing data indicators and the response variables. When both response and covariate data may be incomplete it is important to take the association between the missing data indicators for these two processes into account through joint models. Inverse probability weighted generalized estimating equations are developed to deal with data that are missing at random. Empirical studies demonstrate that the consistent estimators arising from the proposed methods have very small empirical biases in moderate samples, and are more efficient than alternative methods which ignore the association between the missing data processes. An application to the International Tobacco Control (ITC) Four Country Survey demonstrates the usefulness of the proposed method.

Session 13 – Factors and Impacts of Non-Response

(E) Factors Associated with Different Patterns of Non-Response in English Longitudinal Study of Ageing (ELSA)

Hayley Cheshire and David Hussey, National Centre for Social Research, UK

Understanding the factors associated with attrition is of key relevance for those managing longitudinal studies. If analysts can identify those groups most likely to stay in or drop out of the study, the survey design process can be tailored (e.g. through fieldwork practices) to maximise likely response.

A review of the current literature relating to people aged 55 and over (Bhamra et al, 2008) has identified some factors related to attrition – for example being older, cognitively impaired, having lower socio-economic status, and being less well educated.

We propose to study the factors associated with different patterns of participation from wave 1 to wave 3 of the English Longitudinal Study of Ageing (ELSA). ELSA is a study of people aged 50 and over and their younger partners. A total of 12,100 individuals were included at baseline and subsequently followed up as part of the study.

The following comparison groups will be used:

  • Completed interviews at all waves
  • Dropped out at wave 2, but came back at wave 3
  • Dropped out at wave 2
  • Dropped out at wave 3

Our analyses will expand the current focus on demographic factors to include survey variables which can be used to provide some indication of level of commitment to the study at wave 1. Of key interest are factors relating to response behaviour, for example – the use of extremes or mid-points on answer scales, item non-response, willingness to consult documents during interview, and consent to linkage of government administrative data.

(E) Empirical Investigation of Nonresponse Bias Due to Attrition in National Survey of College Graduates (NSCG)

Donsig Jang, Mathematica Policy Research, U.S.A.
John Finamore and David Hall, U.S. Census Bureau, U.S.A.
Steve Cohen, Flora Lan and Fan Zhang, National Science Foundation, U.S.A.

The NSCG is a biennial survey whose main goal is to produce estimates representing the target population of U.S. scientists and engineers at a fixed reference date. The Decennial Census long form is used as its sampling frame. Because the complete sampling frame is available only once a decade, NSCG uses the data for several rounds of its survey with periodic supplemental samples of new graduates in sciences and engineering fields.

In a longitudinal survey like NSCG, follow-up surveys should include nonrespondents in its sample to minimize bias. However, most initial nonrespondents become persistent refusals and, therefore, difficult to gain cooperation in the next round. For this reason, the NSCG sample had been followed in three subsequent rounds only if individuals continued to respond to the survey. However, it is anticipated that the inclusion of only respondents in the next round of NSCG would cause a substantial survey bias even if customary nonresponse weighting adjustments were made. To understand the nonresponse bias due to the longitudinal nonrespondents, we compared estimates from the sample based on 1990s Decennial long form respondents collected in 2003 to those from a new sample drawn from Census 2000 initially surveyed in 2003. In this paper, we will present results from this investigation that will show empirical insights about attrition effects on survey bias.

(F) Factors associated with participation in the GAZEL cohort

Marie Zins, Jean François Chastang, Mireille Coeuret-Pellicer, Annette Leclerc, Sébastien Bonenfant, Alice Guéguen, Anna Ozguler and Marcel Goldberg, INSERM, France

Background: The GAZEL cohort was selected in 1989 from Électricité de France-Gaz de France employees aged 35 to 50. A postal questionnaire was used for selection, and 20,625 subjects (15,011 men and 5,614 women) agreed to take part. Follow-up consisted of an annual postal questionnaire and an invitation to go to a health examination centre for a medical check-up. The initial participation rate was 44.5%. Each year about 75% of the subjects mailed back the questionnaire, and 44.7% visited a health examination centre.

Objectives: Study the socio-demographic, behavioural, occupational and health-related factors associated with the effects that being selected had on participation in the annual follow-up and the health examination, and quantify their role.

Methods: On selection, volunteers were compared with non-participants using variables collected systematically from the company’s medical administrative databases (absenteeism, mortality, occupational exposure) in logistic regression models. Mixed models were used to study the probability of responding to the annual questionnaires during follow-up, while logistic regression models were used to study the probability of participation in the health examination.

Results: The various steps – selection, follow-up and visit to a health centre – are not always affected by the same factors. The magnitude of the selection effects varies from step to step. This study describes the potential biases that this can generate.

(F) Strategies for studying non-response bias in the Coset (Cohorte santé et travail) and Constances (Cohorte des consultants des centres d’examens de santé) cohort

Laetitia Bénézet, Gaëlle Santin, Stéphanie Gauvin, Hélène Sarter and Béatrice Geoffroy-Perez, Institut de veille sanitaire, France
Alice Guéguen, Rémi Sitta, Marie Zins and Marcel Goldberg, INSERM, France
Nicolas Razafindratsima, Institut National d’études démographiques, France

Non-response causes recurring problems in longitudinal surveys (biased estimates, inflated variances). Such problems will arise in the Coset (workplace epidemiological monitoring) cohort and the Constances (health examination centre consultants) cohort. One of the purposes of these cohorts, currently in pilot mode, is to describe and track changes in the health status of the target populations, particularly in relation to employment. The expected participation rate is unlikely to exceed 20% to 30%. A common strategy for correcting non-response bias was developed for the two cohorts. It uses administrative databases containing information about the healthcare claim payments and employment histories of participants and non-participants. To assess the strategy’s effectiveness, a telephone post-survey will be conducted during the pilot selection phase. A sample of non-participants will be surveyed using a short questionnaire, with the aim of obtaining a response rate close to 90%. Following the survey, the estimates produced by the two methods will be compared.

The description of the strategies will be based on their use in the pilot phase of the Coset cohort’s farm worker component, which will begin in the second half of 2009. In that phase, 10,000 members of the labour force will be selected at random and surveyed with a self-administered postal questionnaire regarding their health and employment history.

Session 14 – General Methodological Issues

(E) Sample Allocation for the 2010 Decade of the National Survey of College Graduates

John Finamore and David Hall, U.S. Census Bureau, U.S.A.
Donsig Jang, Mathematica Policy Research, U.S.A.
Stephen Cohen, Flora Lan and Fan Zhang, National Science Foundation, U.S.A.

The National Survey of College Graduates (NSCG) is a biennial longitudinal survey that derived its current sample from the 2000 decennial census long form. With the American Community Survey (ACS) replacing the long form, we are planning to use the ACS as a sampling frame for the 2010 decade of the NSCG. In the planning phase, we examined NSCG design options for the 2010 decade and decided on a rotating panel design. To transition into this rotating panel design, part of the 2010 NSCG sample will be selected from the ACS frame and part will be carried forward from the 2000-decade NSCG sample. In subsequent survey cycles, the ACS-based sample will be carried forward, but the 2000-decade cases will rotate out and will be replaced by sample from more recent ACS years.

The 2010 decade of the NSCG will be designed to use the ACS sampling frame to provide statistically reliable estimates for the key NSCG analytical domains. Based on current funding, the 2010 and 2012 NSCG will select approximately 130,000 sample cases from the ACS-based frame over the course of two NSCG survey cycles. This paper will discuss our research to identify the NSCG key analytical domains, establish the reliability thresholds for these NSCG domains, and develop an algorithm to determine the sample allocation under the reliability thresholds. The sample allocation algorithm will initially focus on the 2012 NSCG design (the first complete ACS-based design), but will also investigate allocating the NSCG sample in 2010, 2014, and beyond.

(E) The Life Pathways Project: Design and Methodological Issues

Trivina Kang, Melvin Chan, Tan Teck Kiang and David Hogan, Nanyang Technological University, Singapore

This paper aims to discuss the research design and methodological issues related to the Life Pathways Project, which was conducted in from 2004-2008 at the Center for Research in Pedagogy and Practice. This study of academic and non-academic outcomes of 30,000 students from three cohorts (Grade 4-6, Grade 7-10, and Post Secondary 1-3) is the largest project of its kind in Singapore. The range of outcomes studied included 21st century economic skills, subjective well-being, citizenship, life goals and aspirations. Students were selected through a stratified random sample of national schools and in each selected school, the entire cohort of students participated in the online survey. In total 38 primary and 37 secondary, and 27 post-secondary institutes were involved in the study. In addition, the secondary cohort students also participated annually in a pen and paper assessment in English and Mathematics and their cognitive scores were merged with their self-report survey data.

In the presentation, in addition to discussing the design of the project from conception to analysis, we will also discuss issues we faced as we piloted and validated our instruments, tracked this large group of students over the years, measures taken to minimize attrition, and issues we faced as we cleaned up and ensured that our dataset was ready for data analysis. We will also showcase some preliminary results that we have obtained from our first cut of longitudinal analysis.

Although this paper does not include Canadian data, it is our hope that this presentation will allow us to share and learn from longitudinal studies that are being conducted in different contexts as well as explore opportunities for future international collaborations.

(F) Using tax and social insurance data to measure living conditions in Switzerland

Philippe Wanner, University of Geneva, Switzerland

Our paper presents the methodological approach used to construct a longitudinal database and carry out a series of analyses concerning income, wealth and poverty in Switzerland. Conducted by the federal social insurance office, the study uses cantonal tax data to describe the population’s economic situation, examine the financial consequences of such changes as retirement and disability, and set priorities for old age security.

The data cover the period from 2003 to 2007 and are taken from tax records. They describe how taxpayers’ employment income, other sources of income and wealth varied over time in 9 of Switzerland’s 26 cantons. When compared with social insurance records using deterministic matching (based on social insurance number), the data also provide indicators concerning the variation in income subject to social contribution over the last decade, pension payments, widowhood and disability status. We will show how the tax data, which are comprehensive and of very high quality since they have been validated by the tax collector and the taxpayer, help identify the groups at risk of financial insecurity and reveal financial practices associated with retirement.

In particular, our paper will demonstrate the advantages and limitations of these administrative data (compared with other sources of data about income, such as surveys) and the precautions to be taken in using them.

(E) Experiences with the Design and Analysis of Longitudinal Data at Statistics New Zealand

Deborah Brunning, Statistics New Zealand, New Zealand

Prior to the turn of the century, Statistics New Zealand had little experience with the design and analysis of longitudinal data. However over the last decade, the interest of policy makers in New Zealand, like their counterparts in many other countries, has identified a need for more information to enable them to study patterns and dynamics beyond what is achievable with repeated cross-sectional snapshots. To respond to this need over the past 10 years Statistics New Zealand has: designed and run 7 waves of an 8 wave survey to measure income and employment dynamics (known as SoFIE); conducted, in partnership with the Department of Labour, 2 waves of a 3 wave longitudinal survey to measure migrant's settlement experiences in New Zealand (LISNZ); developed a longitudinal business database, by combining data from a number of sources; and developed a dataset which enables us to examine the longitudinal patterns and dynamics of both employers and employees using administrative data from the taxation system (LEED).

In this paper we will discuss our experiences with these developments. There have been both achievements and significant challenges encountered in this work, in areas such as design and implementation of collection methodologies, use of computer assisted methods and confidentiality and data access. We will discuss how these experiences are influencing our thinking about future longitudinal data collections in Statistics New Zealand.

Session 15 – Redesign of Large-scale Longitudinal Surveys

(E) Continuity and Innovation in the Design of Understanding Society: The UK Household Longitudinal Study

Heather Laurie, University of Essex, UK

The British Household Panel Study (BHPS) is a household panel survey of around 8,000 households in the UK which has completed 18 annual waves of data collection. The BHPS has been the major source of panel data for the UK since 1991 and is widely used by academic and policy researchers. Following extensive consultation with the user community, a contract to establish a new and larger household panel of 40,000 households, called Understanding Society: the UK Household Longitudinal Study (UKHLS), was awarded to the research team responsible for the BHPS. Wave 1 of the UKHLS began in January 2009 with wave 2 commencing in January 2010. The design of the UKHLS includes the incorporation of existing BHPS sample members from wave 2 of the new study. This paper outlines the design of the UKHLS and how experience from the BHPS informed design and implementation decisions in setting up the new study. The rationales for decisions made in incorporating the BHPS into the new study are considered. These include reconciling the competing demands for continuity and innovation, decisions on the timing of interviews throughout the year, questionnaire content, and maintaining panel loyalty while making the transition to a different fieldwork organisation.

(E) Survival and Revival of the Survey of Income and Program Participation

S. Johnson, U.S. Census Bureau, U.S.A.

For the past two decades, the Survey of Income and Program Participation (SIPP) has been the leading source of data about the economic well-being of Americans. SIPP has been used to evaluate the effectiveness of government programs by many federal, state, and local agencies, academic institutions, and private research and policy study bodies.

Recently, the U.S. Census Bureau initiated a project to reengineer the SIPP in order to provide crucial information in a timely manner and at reduced cost through reengineered survey design, improvements in processing efficiency, and a focused content scope. The main purpose of SIPP is to provide a nationally representative sample that can be used to evaluate the annual and sub-annual dynamics of income, the movements into and out of government transfer programs, and the effect on family and social context of individuals and households. The main activities of this reengineering process include: (1) improvements in the collection instrument and processing system; (2) development of an Event History Calendar in the instrument; (3) use of administrative records data to supplement and evaluate survey data; and (4) development of survey content and use of reimbursable supplements, through interactions with stakeholders.

An important activity initiated with the development of reengineered SIPP improvements was the interaction and consultation with stakeholders on both content and design of the proposed improvements for the re-engineered SIPP. In addition, the Bureau began fielding the current SIPP collection in September, 2008. Most recently, other activities in the re-engineering process have included:

  • Evaluating a test of a paper version of an Event History Calendar (EHC) questionnaire.
  • Planning for a larger scale test of an automated EHC questionnaire in early 2010.
  • Reconstitution of an American Statistical Association/Statistical Research Methodology SIPP advisory subcommittee.
  • Open meetings to discuss information needed to obtain recommendations from a National Academy of Sciences-Committee National Statistics panel commissioned to advise on the plans for and research of use of administrative records in the re-engineered SIPP.
  • Procurement of administrative record files and consultation services for some national level and selected state level data on government programs to assess data quality of both the paper and automated test data.
  • Plan for in depth consultation on training for Field Representatives in the Event History Calendar methodology of interviewing.

(E) Results from the Canadian Household Panel Survey Pilot

Andrew Heisz, Statistics Canada

In January 2006, a conference on longitudinal surveys hosted by Statistics Canada, the Social and Humanities Research Council of Canada (SSHRC) and the Canadian Institute of Health Research (CIHR) concluded that Canada lacks a longitudinal survey which collects information on multiple subjects such as family, human capital, labour health and follows respondents for a long period of time. Following this conference, funds were received from the Policy Research Data Gaps fund (PRDG) to support a pilot survey for a new Canadian Household Panel Survey (CHPS-Pilot). Consultations on the design and content were held with academic and policy experts in 2007 and 2008, and a pilot survey was conducted in the fall of 2008. The objectives of the pilot survey were to (1) test a questionnaire, evaluate interview length and measure the quality of data collected, (2) evaluate several design features; and (3) test reactions to the survey from respondents and field workers. The pilot survey achieved a response rate of 76%, with a median household interview time of 64 minutes. Several innovative design features were tested, and found to be viable. Response to the survey, whether from respondents or interviewers, was generally positive. This paper highlights these and other results from the CHPS-Pilot.

Session 16 – Latent Models and Bayesian Estimation

(E) Latent Growth Curve Modelling of Life Satisfaction Trajectories in the British Household Panel Survey

Maria de Fátima Salgueiro, ISCTE Business School, Portugal
Marcel de Toledo Vieira, Federal University of Juiz de Fora, Brazil
Peter W. F. Smith, University of Southampton, UK

Recent years have seen a growth in interest in subjective well-being (SWB) by social scientists. Several measures have been proposed, the choice of measurement instrument influencing the assessment of SWB and its determinants (Peasgood, 2007). The British Household Panel Survey (BHPS) is a national representative survey conducted, since 1991, on an annual basis. Several SWB measures are available in the BHPS. Since wave 6, in addition to a question on overall life satisfaction, respondents have been asked to rate their satisfaction levels with eight domain dimensions (health, income, house/flat, spouse/partner, job, social life, amount of leisure time, and its use). Statistical approaches adopted in the literature to model SWB often include ordered probit models and fixed effects models. Random effects models and cross-lagged structural equation models have also been proposed to model longitudinal survey data (see, e.g., Berrington et al., 2008).

In the current paper trajectories of life satisfaction are modelled using BHPS data. Employees who were interviewed in all waves 1 to 15 and fully answered all life satisfaction variables in all waves are considered. Latent growth curve modelling is used to model both within-individual and between-individual level variation in the two perceived life satisfaction latent factors considered. Possible determinants of life satisfaction include age, gender, having children, family income, level of education and number of working hours. The advantages of the proposed statistical approach over the more traditional longitudinal regression modelling approaches are highlighted. The benefits of taking into account the complex survey design are discussed.

(E) A Latent Transition Analysis Approach to Modeling Unobserved Population Heterogeneity over Time

Andy Ross, National Centre for Social Research, UK

This paper explores the usefulness of latent transition analysis (LTA) for modeling unobserved population heterogeneity across time. Specifically, we use a latent class framework to capture subgroups of young people disengaged/engaged with education at ages 14 to 16, and model transitions across these various groups over time. Previous quantitative research in this area has often used narrow, single-dimension definitions of disengagement such as truancy or underachievement. Our study goes beyond this, using a statistical approach that captures the multidimensional nature of disengagement by drawing on information from a range of measures. Subgroups of disengaged/engaged young people are defined by their combined responses to questions measuring aspirations, attitudes and behaviour.

The analysis proceeds in stages. In a first step we estimate the latent subgroups for three waves of data. We then employ latent transitions analysis to test the stability of these subgroups and measure transitions across time. In a final step covariates (both fixed and time-varying) are added, measuring characteristics of the individual and their experiences within the home and school in order to explore when and why some young people disengage from education. Data for the study come from the Longitudinal Study of Young People in England (LSYPE) a contemporary panel study of 15,000 young people through years nine to eleven, and into the early years following post-compulsory schooling.

(E) Longitudinal Mixed-Membership Models for Survey Data on Disability

Daniel Manrique-Vallier and Stephen E. Fienberg, Carnegie Mellon University, U.S.A.

When analyzing longitudinal data we need to balance our understanding of individual variability with the production of meaningful and interpretable summaries of overall population tendencies. This is specially true when those in the target population are know to be heterogeneous in their ways of progressing over time due to unobserved individual traits. Additional complications arise when the data are discrete and multivariate so that the resulting contingency tables are very sparse.

We propose a new family of models to analyze such data by combining features from a version of the cross-sectional Grade of Membership Model (Erosheva et al, 2007) and from the longitudinal Multivariate Latent Trajectory Model (Connor, 2006). These models assume the existence of a small number of “typical” or “extreme” classes of individuals and model their evolution over time. We regard individuals as belonging to all of these classes in different degree, by considering them as convex weighted combinations of the extreme classes. In this way, we are able to describe distinct general tendencies (the extreme cases) while accounting for the individual variability. We propose a full Bayesian specification and estimation methods based on Markov Chain Monte Carlo sampling.

We apply our method to data the National Long Term Care Survey (NLTCS), a longitudinal survey with six completed waves aimed to assess the state and characteristics of disability among U.S. citizens age 65 and above. A simple extension of our methods allows us to answer some relevant questions about the changes in disability across generations.

(E) Analyzing Longitudinal Mixed Categorical Outcomes with Potential Missing Data Using a Bayesian Approach

Z. Rezaei Ghahroodi and S. Eftekhari, Statistical Research and Training Center, Iran
M. Ganjali, Shahid Beheshti University, Iran

In panel studies often mixed categorical outcome measures together with time stationary and time varying explanatory variables are collected over time on the same individual to investigate the effects of explanatory variables on responses. A regression analysis for these types of data must allow for the correlation between variables during the time and also correlation between mixed responses for each individual at a specific time. In this paper, a transition markov model with random effects for analyzing longitudinal mixed ordinal and nominal responses with missing values in both responses is used to investigate between and within changes over time. Since the missing data are unavoidable in such studies, a method which is able to consider these longitudinal variables and their potential missingness simultaneously using data augmentation algorithm is proposed. Therefore, a Bayesian approach toward estimating model parameters and the Gibbs sampling method to perform parameter estimation and data augmentation are used. Results of using a full random effect transition model are compared with those of three other models which exclude random effect and /or transition effect. The approach is applied to the British Household Panel Survey (BHPS) data where two correlated response variables of interest are life satisfaction as an ordinal response and current economic activity status as a nominal response. It is shown that the full model gains more interpretability due to the consideration of all aspects of the collected data.

Session 17 – Measurement Errors

(E) Nonresponse and Measurement Error in Employment Research

Frauke Kreuter, JPSM University of Maryland, U.S.A.
Gerrit Mueller and Mark Trappmann, IAB Institute for Employment Research, Germany

Survey methodologists are increasingly concerned with the interaction of multiple error sources. Particularly prominent are discussions about nonresponse and measurement error. One hypothesis that is often found among practitioners is that sample cases that are brought into the survey only after repeated attempts and alternated recruitment strategies, are more likely to provide low quality data (e.g. Groves and Couper 1998). Data quality is often internally assessed through the proportion of missing items, proportion of don’t knows and the like (e.g. Fricker 2007). Rarely, in these studies, are external data available to evaluate the quality of respondents’ answers (e.g. Cannell & Fowler 1963, Olsen 2006).

The panel study PASS (Trappmann et al. 2009) is a novel dataset in the field of labor market, welfare state and poverty research in Germany. With almost 19,000 interviewed persons in more than 12,500 households, PASS is currently one of the most comprising panel surveys in Germany. The first round of data collection started in 2006. In PASS, survey data on the employment and unemployment history, income and education of participants can be linked to corresponding data from respondents' administrative records.

Based on this study, we give an assessment of data quality as a function of contactability and response propensity. Only for some variables, the measurement error (variance or bias) assessed through the administrative records is increased with decreasing contactability and response propensity of the target persons. In particular, this is found in case of retrospective questions. Here, the differing length of time between date of interview and event explains a large part of the difference in measurement error between respondents with high vs. low response propensity.

(E) Inconsistencies in Reported Job Characteristics among Employed Stayers: Evidence from a Series of Two-Wave Panels from the Italian Labour Force Survey, 1993-2003

Francesca Bassi and Ugo Trivellato, University of Padova, Italy
Alessandra Padoan, ISTAT, Italy

In this paper we deal with measurement error, and its potentially distorting role, in information on industry and professional status. As a case study we consider two-wave panels one year apart collected by the Italian Quarterly Labour Force Survey in the period from April 1993 to April 2003. The focus of our analyses is on inconsistent information on employment characteristics – industry and professional status – resulting from yearly transition matrices for workers who reported that they were continuously employed over the year and did not change job.

First, we compute and comment upon some usual indicators of disagreement. We find clear evidence that there is sizable measurement error in both industry and professional status. Then, we test whether the consistency of repeated information significantly increases when the number of categories is collapsed. Aggregating categories improves agreement. For professional status the best level of aggregation is the binary one (Employee/Self-employed); for industry, two classifications minimize inconsistencies, with 5 or 6 classes. We further explore the patterns of inconsistencies among categories of variables by testing several specifications of Goodman’s quasi-independence model. The model is almost always rejected, which points to the fact that even cross-section information is affected by non-random measurement error. Lastly, we consider and compare alternative 4-category classifications obtained by collapsing professional status and industry into a single variable. In this case the best level of aggregation is given by a non-standard 4-category classification, which distinguishes employees in the market services on one hand and in the industrial sector and private services on the other.

(E) Challenges and Insights from Overlapping Seams in the HILDA Survey

Nicole Watson, University of Melbourne, Australia

An issue unique to longitudinal surveys is seam effects. These occur when there is a tendency for changes in the data to unusually concentrate in adjoining periods from different interviews. One component of the Household, Income and Labour Dynamics in Australia (HILDA) Survey subject to seam effects is the labour market activity calendar. In this calendar respondents are asked to recall the various jobs they have had over a 14 to 18 month period, the time spent in unemployment and the time spent outside the labour force. As the calendar is administered every wave, an overlap of 2 to 6 months results depending on when the respondent is interviewed.

In this paper, we separately model the likelihood that respondents will make three types of errors in the activity calendar: i) reporting a spell in the first version of events and not in the second; ii) reporting a spell in the second version of events and not in the first; and iii) misplacing a spell in the second version of events compared to the first. The characteristics considered in the model include the various causes of errors in dating events, such as spell length, spell type, duration of the overlapping seam, recall ability of the respondent, and characteristics of the interview that may affect the respondent’s recall. The overlapping seam also permits the study of measurement error over time to identify whether the same people continually make the same mistakes.

Session 18 – Imputation

(E) Usefulness of Imputation in Longitudinal Surveys

Roberto Gismondi, ISTAT, Italy

A well known problem concerning business surveys is the evaluation of non response treatment. Many late methodological contributions – mostly driven by empirical simulations and comparisons – underline the general idea that weighting should be preferred to imputation. However, one may easily assess that each imputation implicitly corresponds to a weighting process and vice-versa: the real problem consists in the best use of auxiliary information for non response correction, as databases of historical data often available in longitudinal surveys.

The non response problem is particularly crucial in census or cut/off surveys (for instance, the monthly industrial production and turnover indexes in Italy and many EU countries), where the presence of attrition can adversely affect survey quality, thus leading to recourse to massive imputation or re-weighting of respondents’ data. Let’s note that a common case falling in this category is when late reporting occurs as regards business data required to be available soon after the reference period, in order to efficiently drive users and decisions makers.

According to a model based approach, in this context we evaluate the model variance of estimators of the population total based both on true and imputed data. Particular issues concern these topics: 1) links between imputation and weighting; 2) comparison among variances of estimators based on: simple expansion of respondents’ data, model based imputation of non respondents and donor imputation techniques; 3) theoretical conditions under which each estimation strategy can be preferred to the others. We also propose an empirical attempt aimed at the estimation of the wholesale trade sector turnover in Italy, on the basis of quarterly sample data picked by ISTAT and referred to the period 2003-2007. Different estimation techniques have been compared, in a framework where the whole population and the sample of respondents are given by, respectively, the sample of final respondents (responses after 180 days from the end of the reference quarter) and the “quick” respondents’ sample (responses after 60 days).

(E) On Balanced Random Imputation in Surveys

David Haziza, Université de Montréal, Canada
Guillaume Chauvet and Jean-Claude Deville, Laboratoire de Statistique d’Enquête (CREST/ENSAI), France

Random imputation methods are often used in practice because they tend to preserve the distribution of the variable being imputed, which is an important property when the goal is to estimate quantiles. A special case of random imputation, random hot-deck imputation, is often used in practice if the variable being imputed is categorical because it eliminates the possibility of impossible values. Also, it is used when it is desired to impute more than one variable at the time because the same donor can be used to impute all the missing values, which helps preserving the relationships between variables.However, random imputation methods introduce an additional amount of variability, called the imputation variance, due to the random selection of residuals. In this presentation, adapting the Cube method (Deville and Tillé, 2004) for selecting balanced samples, we propose a class of random balanced imputation methods which reduce/eliminate the imputation variance while preserving the distribution of the variable being imputed. The proposed class of imputation methods can be applied for both categorical and continuous variables. Also, it can be used for any type of sampling design. The results of a limited simulation study will be presented.

(E) Testing New Imputation Methods for Earnings in the Survey of Income and Program Participation

Martha Stinson and Gary Benedetto, U.S. Census Bureau, U.S.A.

This paper explores the feasibility and effectiveness of three significant changes to standard Census Bureau methods of imputing earnings in the Suwey of Income and Program Participation (SIPP). Currently imputation is performed by stratifting the data based on a set of analyst-chosen characteristics, randomly sorting within each sub-group, and choosing a donor based on the nearest neighbor. We investigate the possibility of using a model-based approach, supplementing survey*collected job and demographic characteristics with administrative earnings data, and using multiple imputation as proposed by Rubin. We will model monthly earnings from January 2A04 6 December 2005 using the SIPP 2004 panel linked to W-2 tax records extracted from the Social Security Master Earnings file. We will use linear regression techniques to estimate a posterior predictive distribution that is the distribution of earnings conditional on all observed characteristics (including administrative earnings). From this distribution, we will take four draws to create four imputed values per case with missing earnings. After thus "completing" the missing data, we will compare results using original versus new imputed values from several standard analyses in order to assess the impact of our new method. In particular, we will look at coefficients in a classic eamings regression, trends in earning changes over time, the moments of the cross-sectional eamings distribution for a particular month, and poverly levels as based on family income, of which eamings are an important component. The four imputed values will allow us to calculate variance estimates using Rubin's multiple imputation variance formulae and to assess the impact of imputation on the significance of regression coefficients, the shape of the earnings distribution, and the margin of error on poverty estimates.

Session 19 – Edit and Imputation

(E) EU-SILC in Slovenia – Experiences so far

Rudi Seljak, Statistical Office of the Republic of Slovenia, Slovenia

Survey on Income and Living Conditions (SILC) is European harmonized survey aiming at providing the data on living conditions in which the household members and selected individuals live and how they include themselves in the society. The survey is designed as a panel survey, meaning that each selected household should be followed for four consecutive years. Since the survey is output harmonized the output variables are prescribed by the European Regulation, while the way of collection of the input micro-data is more or less left to the decision of the each particular country.

In Slovenia the micro-data for the EU-SILC survey are gathered from two parts of sources. The first part of the data is collected by the »classical« survey, while the second part is gathered from registers and administrative sources. The exhaustive use of the administrative sources has on obvious advantage of response burden and survey costs reduction. Also as a consequence of the shorter questionnaire the unit and item non-response are lower, what can have a significant influence on quality improvement. However, such usage can also have disadvantages, which are the most obviously resulted in the increased extent of data editing work.

In the paper we will summarize the four-year »EU-SILC experience« on merging the data from different sources to point out the main advantages and disadvantages of such approach. The significant part of the paper will be devoted to the new application for data editing, which should improve the efficiency of our editing procedure and consequently improve the timelines of the final results.

(E) Longitudinal Data Editing for the Italian LFS

Simona Rosati and Barbara Boschetto, ISTAT, Italy

The Italian Labour Force Survey (LFS) is a rotating panel survey which is carried out by means of different computer assisted interviewing techniques (CAI). CAPI technique is used for the first interview, while CATI technique is used for the following interviews. More exactly CATI is a dependent interviewing method, which uses the answers from the previous wave in the formulation of the question in order to remind the respondent of previous responses.

Although the main objective of the survey is to produce quarterly and annual estimates, longitudinal analysis is obviously an aim of great consequence. Due to the longitudinal dimension a deeper analysis may be performed about the structure of labour market in terms of its dynamic components. Nevertheless the time dimension makes it more difficult to develop a strategy for dealing item non-response as well as unit non-response. This paper is mainly devoted to the longitudinal imputation method applied to correct item non-response on single questions in the LFS. The two main approaches for editing are introduced and the imputation procedure is presented. Special emphasis is given to the issues regarding data inconsistencies and difficulties that arise when longitudinal data are used. The paper concludes with a discussion about the main outcomes and some issues concerning data editing in CAI surveys. A short description of the strategy for the record linkage is also reported. For brevity all the results presented in the paper refers to the period 2007(1)-2008(1).

(E) Imputation of Longitudinal Registers: The Households Case

D.J. (Jan) van der Laan and Léander Kuijvenhoven, Statistics Netherlands, The Netherlands

Registers are potentially a rich source for longitudinal analyses. However, many of the editing and imputation strategies focus on cross-sectional analyses, creating longitudinal inconsistencies at a micro-level. In order to make the data suitable for longitudinal analyses, it is necessary to take into account information from other time periods in the derivation and imputation of variables.

At Statistics Netherlands households are derived from municipal population registers since 2000. Using the (family) relations present in these registers, for approximately 93% of the addresses the household composition can be uniquely determined. The remaining 7% of the addresses are imputed using a stochastic imputation model that takes background properties of the persons living at the addresses into account. However, as the derivation and imputation procedure only take information of the current time period into account, the households are not suitable for longitudinal analyses. We will present modifications to the present methodology that will result in households suitable for longitudinal analyses. Special care is taken to ensure that both cross-sectional estimates and change estimates are accurate.

A complication is that in statistical offices one is periodically supplied with new data, while at same time one has to publish periodically. This new data might contain better information about previously derived data. To obtain longitudinal consistent data it is generally necessary to correct previously derived data. Plans on how to deal with this at Statistics Netherlands are presented.

Session 20 – Application: Longitudinal Analysis of Health and Business Data

(E) The Children of Older First-time Mothers in Canada: a Longitudinal Analysis of their Health and Development

Tracey Bushnik and Rochelle Garner, Statistics Canada

Using a national sample of first-born children from the National Longitudinal Survey of Children and Youth (NLSCY), this study examined the relationship between late childbearing and three facets of child development: (i) physical health and development, (ii) behaviour, and (iii) cognitive development. Late childbearing is defined as having a first child at age 35 or older, and children born to mothers aged 25 to 29 comprised the reference group. Children were selected at ages 0 to 1 for inclusion in the study, and followed up at ages 2 to 3 and 4 to 5. Children’s outcomes were measured between the ages of 0 to 5. Due to the longitudinal nature of the data, several methodological issues were identified including the need for pooling of the data, how to choose between the various survey weights, and the need to assess potential non-response bias. The presentation will discuss both the findings of the study, as well as how the various methodological issues were addressed.

(E) Life Course BMI and Height Trajectories: A Comparison of Two British Birth Cohorts

Leah Li, Rebecca Hardy, Diana Kuh and Chris Power, University College London, UK

Obesity continues to increase worldwide. The development of BMI trajectories may have changed across recent generations experiencing the obesity epidemic at different life stages. Other components of physical development may also have changed. We compared child-to-adult growth trajectories across two British birth cohorts, born in 1946 (n=5,300) and 1958 (n=17,000), followed-up to ages 53y and to 45y respectively.

Individuals born in 1958 were not heavier at birth than the 1946 cohort, but were taller in early childhood by 1cm, grew faster and were 3-4cm taller by adolescence. The 1958 cohort achieved adult height earlier, were taller by 1cm, an increase entirely due to their longer leg length. We adopted linear spline models to repeated BMI measures (at 7, 11, 15, 20, 26, 36, 43, and 53y for the 1946 cohort, at 7, 11, 16, 23, 33, and 45y for the 1958 cohort) corresponding to distinct BMI trajectories for “childhood’’ and “adulthood’’. BMI trajectories diverged from early adulthood, with a faster growth rate in the 1958 cohort than the 1946 cohort, although mean BMI at 7y and rate of childhood gain had not shown an increase between two cohorts. By mid-adulthood the 1958 cohort had a greater BMI (1-2 kg/m2), larger waist (7-8cm) and hip (5cm) circumferences, and a higher prevalence of obesity (24% vs 12%). These changes over a relatively short period of 12 years suggest the likelihood of opposing trends of influences on later disease risk in these populations.

(F) Impact of training on the productivity of Canadian businesses in a longitudinal context: Comparison of an additive model and an interactive model

Amélie Bernier and Jean-Michel Cousineau, Université de Montréal, Canada

This paper examines the effects of training on the productivity of Canadian businesses using employer data from the Workplace and Employee Survey (WES) between 1999 and 2005. Of all the studies consulted that dealt with the possible impact of training on productivity, a growing number discussed the longitudinal nature of the data, but few estimated the delayed effects of training. In our research, we attempted to exploit the advantages of the longitudinal data by estimating two types of models: an additive model and an interactive model. The former estimates the impact that the delayed effects of training investments might have on business productivity by adding them to investments in physical capital. The latter model focuses instead on the interaction between the two types of investment. In both cases, our results show that training investments made with a three-year lag have significant positive effects on productivity. In contrast to the additive model, however, the interaction between capital investment and training investment reflects the variation between firms in utilizing the factors of production and thus makes it possible to differentiate between companies by return on investment. On the other hand, our results do not indicate that one model is superior to the other. These findings may have important implications in the evaluation process or in companies’ decisions concerning training investment.

(A) Workers’ mobility: A Review and Some New Results from the Workplace and Employee Survey

Yves Decady, Statistics Canada

In this paper, longitudinal data from the Workplace and Employee Survey is used to derive a wide array of labour mobility indicators. Data from the three WES panels will be used to first describe the incidence of job mobility, occupational mobility and direction of job and occupational mobility. Then, the paper will focus on explaining job and occupational mobility using a handful of models designed for the analysis of longitudinal data.

Using the International Labour Office (ILO) approach in Key Indicators of the Labour Market (KILM) publication, the magnitudes of the labour markets flows will be explored as the first research question. Inflow into paid employment and outflow from paid employment will be examined.

Along with job mobility, job immobility or job stability will be studied. From a review of the literature, it appears that job immobility/stability may widen earnings inequalities when workers are trapped in low paid jobs. The literature also indicates that upward occupational mobility of lower-wage workers may mitigate cross-sectional wage inequalities among workers. Hence, the second research question explored in this paper is whether there is a reward or return associated to job and occupational mobility.

Preliminary research results from the WES employee panel data show that, following a decrease from the first to the second panel, occupational mobility increased moderately in the third employee panel. Therefore, the responsiveness of job and occupational mobility to shocks imposed by the economic conditions is our third research question.

Session 21 – Longitudinal Data Analysis Techniques

(E) On the Use of Exploratory and Confirmatory Longitudinal Data Analysis Techniques

Marcel de Toledo Vieira, Ronaldo Rocha Bastos and Henrique Steinherz Hippert, Federal University of Juiz de Fora, Brazil
Augusto Carvalho Souza, Federal University of Minas Gerais, Brazil

This paper discusses the use of various approaches for analysing longitudinal survey data, including alternative exploratory data analysis techniques and different regression modelling strategies to address longitudinal analyses of the British Household Panel Survey data on attitudes to gender roles, and their relation to demographic and economic variables. The general question in this article is: would one draw different conclusions and inferences, depending on the approach one chooses? Both exploratory and confirmatory longitudinal data analysis have been performed, by the adoption of correspondence analysis (CA) and regression modelling techniques for the analysis of adaptive relationships. Results from the CA have generally been confirmed by the regression models parameter estimates, which have often agreed in sign with the relationships displayed in the CA maps. Empirical evidence has shown that the selection of the analysis approach and modelling strategy is an important issue in the longitudinal data analysis context. We recommend that the choice should be therefore made taking into consideration the aims of the longitudinal analysis.

(E) Goodness-of-Fit Measures for Models Based on Generalized Estimating Equations Approach

Punam Pahwa, University of Saskatchewan, Canada

An important part of any model selection process is the assessment of how well the model fits the data (goodness-of-fit). In the last two decades, many analytical methods have been developed for longitudinal data analysis, however, there is still a lack of standard reasonable goodness-of-fit measures for such models. For longitudinal data, we need goodness-of-fit statistics for selecting not only a correct response function but also for selecting an appropriate within-subject correlation/covariance structure. Goodness of fit statistics based on likelihood methods such as likelihood ratio test and Akaike's information Criteria (i) require repeated fittings of the data to a family of nested models, (ii) require complete specification of likelihood function, and (iii) can not be used to assess adequacy of models which are fitted by using generalized estimating equations (GEEs) approach. Vonesh et al developed three goodness-of-fit statistics: (i) rc – concordance coefficient to measure concordance between fitted and observed responses; (ii) r(ωˆ) – a measure of concordance between assumed and true covariance structures; and (iii) - to test the equality between assumed and true covariance structure (indirectly by testing the equality between ‘sandwich’ and assumed covariance structure). These three measures are based exclusively on the model at hand. I propose to utilize these measures to assess the goodness-of-fit of models fitted utilizing GEEs approach to analyze longitudinal data collected (based on non-survey design) on respiratory health of Canadian grain elevator workers. An attempt will be made to modify these measures to assess the adequacy of models for longitudinal complex-survey data.

(E) Fitting general linear model for longitudinal survey data under informative sampling

Abdulhakeem A.H. Eideh, Al-Quds University, Palestine

Data collected by sample surveys, and in particular by longitudinal surveys, are used extensively to make inferences on assumed population models. Often, survey design features (clustering, stratification, unequal probability selection, etc.) are ignored and the longitudinal sample data are then analyzed using classical methods based on simple random sampling. This approach can, however, lead to erroneous inference because of sample selection bias implied by informative sampling. To overcome the difficulties associated with the use of classical inference procedures for cross sectional survey data, Pfeffermann, Krieger and Rinott (1998) proposed the use of the sample distribution induced by the assumed population models, under informative sampling, and developed expressions for its calculation. Similarly, Eideh and Nathan (2006) fitted time series models for longitudinal survey data under informative sampling.

In this paper we fit the general linear model for longitudinal survey data under informative sampling using different covariance structures: the exponential correlation model, the uniform correlation model; see Diggle, Liang and Zeger (1994), and the random effect model; see Skinner and Holmes (2003).

Session 22 – Adjusting for Non-Response and Attrition

(E) Sample Loss from Cohort Studies: Patterns, Characteristics and Adjustments

Ian Plewis, University of Manchester, UK
Lisa Calderwood, Sosthenes Ketende and Rebecca Taylor, Institute of Education, UK

Research on sample loss from longitudinal studies is motivated by two particular concerns; (1) reallocating fieldwork resources to potentially ‘frail’ respondents and (2) generating statistical adjustments for missing data. Information about the characteristics of those lost from these kinds of studies over time has increased in recent years, and there is a recognition that a more complete understanding of sample loss comes from separating wave non-respondents from attrition cases, and by separating the reasons for loss – not located, not contacted and refusal – in the analysis. It is also clear that, although cases lost from a study are usually systematically different from those that remain, the ability to discriminate between different response patterns tends to be low. This paper is based on a project – Predicting and Preventing Non-response in Cohort Studies - funded by the UK Economic and Social Research Council as part of its Survey Design and Measurement Initiative. It presents evidence on the patterns and characteristics of different kinds of sample loss from the first four waves of the Millennium Cohort Study – the fourth in the series of UK birth cohort studies. It assesses the strength of the relationships found in terms of predicting sample loss, and then addresses the question of how this evidence might be efficiently used for statistical adjustment either by weighting or in multiple imputation.

(F) Analysis of attrition in the Longitudinal Study of Child Development in Quebec (ÉLDEQ) from 1998 to 2008

Catherine Fontaine and Robert Courtemanche, Institut de la statistique du Québec, Canada

Launched in 1998, the ÉLDEQ was designed to identify early-childhood factors that affect young Quebecers’ social adjustment and academic performance. The sample was intended to be representative of children born in Quebec in 1997/1998 (single births). Thirty-four percent of those who responded in the initial collection wave (1998) did not respond in the tenth wave (2008).

There is currently a willingness to continue the survey for the period during which the children would be in secondary school (2011-2015). One question that this raises is whether the current sample size (1,402 respondents in 2008), which has been affected by attrition, is large enough to support high-quality analyses at the end of the additional years of collection. It was therefore decided to determine how the loss of units since the ÉLDEQ’s inception affects the quality of its estimates and to suggest options for the future.

We describe the various steps in the attrition analysis: the selection of the characteristics to be studied, the methods used and the conclusions produced by the analysis. We will then present an alternative weighting method, which is used to mitigate some of the effects of attrition. Lastly, we will discuss various other proposals for future ÉLDEQ collection waves.

(E) Modelling non-response for a longitudinal survey using paradata: Application to the Survey of Labour and Income Dynamics

Beatrice Baribeau and Wisner Jocelyn, Statistics Canada

Having accumulated a large amount of data about the collection process (data concerning the collection process itself are known as paradata) in a longitudinal survey in which some non-respondents from one wave are included in the next wave, and having discovered the limitations of the data used so far for non-response adjustment (data that are collected for all individuals in the sample at the beginning of the panel and whose relevance diminishes as the panel ages), we feel it is time to re-evaluate the current non-response adjustment methodology. With that in mind, we undertook a comparative study of the current methodology used in the Survey of Labour and Income Dynamics and a new methodology based on paradata for the same survey.

Our approach involves deriving the best set of adjustment variables from the paradata, implementing the new non-response methodology based on paradata in a simulated production environment and using various statistical and graphic measures to compare the two methodologies. We will report and comment on the study’s principal findings and the solutions employed in the production simulation.