Statistics Canada
Symbol of the Government of Canada

Warning View the most recent version.

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please "contact us" to request a format other than those available.

Research Projects

Research, development and consultation in SRID
MITACS
Imputation and robust estimation
Sampling and estimation
Small area estimation
Data analysis research (DAR)
Data collection
Disclosure control methods

For more information on the program as a whole, contact:
Mike Hidiroglou (613-951-0251, mike.hidiroglou@statcan.gc.ca).

Research, development and consultation in SRID

The Statistical Research and Innovation Division (SRID) was created within the Methodology Branch on June 21, 2006. SRID is responsible for researching, developing, promoting, monitoring, and guiding the adoption of new and innovative techniques in statistical methodology in support of Statistics Canada’s statistical programs. Its mandate also includes the provision of technical leadership, advice and guidance to employees elsewhere in the Program. This assistance is in the form of advice for methodological problems that arise in existing projects or during the development of new projects. SRID is also be jointly involved with other employees of the Program via research projects sponsored by the Methodology Research and Development Program on specific topics, e.g., estimation methods, imputation methods, small area estimation, use of administrative data, measurement of nonsampling errors.

The Statistical Research and Innovation Division (SRID) was involved in many research, development and consultation projects in 2009-2010. In particular, SRID staff has made significant contributions in small area estimation, imputation and robust estimation, and synthetic data generation. The progress is incorporated within the review of the research topics reported later.

In addition to participating in the research activities of the Methodology Research and Development Program (MRDP) as research project managers and active researchers, SRID staff was involved in the following activities:

  • Staff advised members of the other divisions on technical matters both on an ad hoc basis and on a formal basis. Examples of ad hoc advice include variance estimation for complex surveys, small area estimation, and variance estimation for imputed data.
  • Research project managers within SRID provided ideas to team members of projects sponsored by the Methodology block fund on how to advance their research, and regularly reviewed their work.
  • SRID consulted with members of the Methods and Standards Committee, as well as with a number of other Statistics Canada managers in determining the priorities for the research program.
  • Several members participated in two courses given at the 57th Session of the International Statistical Institute Meetings, held in Durban, South Africa: M. Hidiroglou and Wesley Yung on Business Surveys; and John Kovar, Eric Rancourt and Jean-François Beaumont on editing and imputation.
  • A special invited presentation was given at the Statistics Society of Canada (SSC) meeting in Vancouver by Jean-François Beaumont. The presentation was entitled “A new approach to weighting and inference for samples drawn from a finite population” and is based on the paper by Beaumont (2008).
  • SRID continued to actively support the Survey Methodology Journal. Mike Hidiroglou became editor of the Survey Methodology Journal in January 2010, and three members of SRID are assistant editors.
  • The final report and recommendations of the Committee on Quality Measures were completed and released. The most important sections are the summary at the beginning and the recommendations in section 7. The report can be accessed at:
  • http://method/bibliostat/research/techcom/dataintegrationproject/documents
    /FINAL_CMQ-CQM_FinalReport_E.pdf

  • Staff continued activities in different Branch committees such as the Branch Learning and Development Committee, the Strategic Thinking Group and the Methodology Branch Informatics Committee. In particular, we have participated actively in finding and discussing papers of the month.
  • SRID organized the last interchange with the U.S. Census Bureau in Ottawa in April 2009: Jean François Beaumont and Cynthia Bocci.
  • SRID hosted a MITACS student: Harold Mantel.
  • Mike Hidiroglou was elected Council Member of the International Association of Survey Statisticians and became Editor of Survey Methodology.
  • Staff refereed several papers in statistical journals.
  • Staff Authored or co-authored 21 papers, many of which were presented at conferences (SSC, JSM, ISI), or published in learned statistical journals such as the Canadian Journal of Statistics, Survey Methodology, Handbook of Statistics, Statistica Sinica, Journal of Multivariate Analysis, and the Journal of the Royal Statistical Society.

For further information, contact:
Mike Hidiroglou (613-951-0251, mike.hidiroglou@statcan.gc.ca).

Reference

Beaumont, J.-F. (2008). A new approach to weighting and inference in sample surveys. Biometrika, vol. 95, no 3, 539-553.

MITACS

Statistics Canada was involved in the National Program on Complex Data Structures by participating in the MITACS internship program. This program pairs academic statistics researchers and their doctoral students with mentors at Statistics Canada so that the students can conduct research on unsolved statistical problems that have been identified in the area of complex survey data. As has been the case for the past few years, Statistics Canada agreed to take up to three students, each for a period of approximately four months. In the spring of 2009, advertisements were circulated nationally for these placements. One student, Haocheng Li from the University of Waterloo, was situated in SSMD from September to December 2009, researching the topic of analysis methods for longitudinal health survey data with missing observations. Two other students arrived in January 2010: Chen Xu from University of British Columbia, researching penalized likelihood methods for variable selection in high dimensional regression analysis, and Dongo Jiongo Valery from University of Montreal, researching robust inference in the presence of influential units in surveys. All three students will be presenting contributed papers at the SSC in May 2010 as well as making presentations and writing a report at Statistics Canada.

The three MITACS students, Zhijian Chen, Dagmar Mariaca-Hajducek, and Yan Liu, from the previous fiscal year, 2008-2009, presented papers on their work at a topic-contributed paper session at the 2009 SSC organized by Milorad Kovačević and Changbao Wu.

  1. Zhijian Chen (Chen and Mantel, 2009) discussed estimation of logistic regression coefficients when measurement error or misclassification of an ordinal covariate is dependent on other variables. An expected score method was proposed that uses a parametric assumption for the measurement error process. The method was illustrated with data from CCHS cycle 3.1 where association of heart disease with BMI categories are of interest. The presentation was written up and appears in the Survey Methods Section Proceedings for 2009.
  2. Dagmar Mariaca-Hajducek (Mariaca-Hajducek and Lawless, 2009) discussed the fitting of Cox PH models to jobless spell durations for individuals from a six-year panel of the Survey of Labour and Income Dynamics. Within-individual and within-cluster association in spell durations, dependent loss to follow-up and a non-ignorable sampling design were considered.
  3. Yan Liu (Liu, Kovačević, Saidi and Zumbo, 2009) demonstrated and compared two analytical methods adopting the SEM approach for analyzing a cohort-sequential design of data and were illustrated by longitudinal NLSCY data. Also demonstrated in the talk was how to apply sampling weights in this kind of modeling.

For further information, contact:
Georgia Roberts (613 951-1471, georgia.roberts@statcan.gc.ca).

Reference

Mariaca-Hajducek, D., and Lawless, J. (2009). Fitting Cox models to jobless spell durations in SLID. Presentation in a Topic Contributed Paper Session at the 2009 SSC Annual Meeting.

Imputation and robust estimation

SEVANI

SEVANI is the System for the Estimation of Variance due to Non-response and Imputation. In the current year, we have made progress both in the development of the system and in the associated research. Regarding development, we have accomplished the folloing:

  1. Completed and released version 2.1, which handles variance estimation when the missing y-value of a given unit is imputed by the product of a known adjustment variable x for that unit and an imputed value (using donor, regression or auxiliary value imputation) for the unknown ratio y/x (i.e., y is not imputed directly; the ratio y/x is first imputed before deriving an imputed value for y);
  2. Developed and released version 2.2, which is more modular than version 2.1 and, more importantly, much faster (sometimes, version 2.2 is 100 times faster than version 2.1); and
  3. Developed version 2.3, which handles variance estimation when hierarchical imputation classes have been used (more testing is still needed before releasing this version).

Note that the above development work has been undertaken following users’ requests. We have also continued to offer support to our users on the methodology and use of the system.

On the research side, we have complete papers on the following:

  1. The methodology of SEVANI for composite imputation (Beaumont and Bissonnette, 2010), which has been submitted for publication in a refereed journal; and
  2. Variance estimation under auxiliary value imputation (Beaumont, Haziza and Bocci, 2010), which has just been accepted for publication in Statistica Sinica.

SAS tool for detecting outliers

We have continued to develop an outlier detection system programmed in SAS. The primary objective of the system is to help methodologists conduct exploratory analysis of their survey data. It can be used to visualize the data and compare various methods. A number of methods are already programmed in the system: the Hidiroglou-Berthelot method, the sigma deviation method, the interquartile method, a number of classic linear regression methods (with or without y intercept), including Cook’s distance, and a multivariate method developed for the monthly Wholesale and Retail Trade Survey.

During the period, we finalized the addition of a robust multivariate method based on the Mahalanobis distance, for which the methodology is described in Patak (1990) and Franklin, Thomas and Brodeur (2000). We are also in the process of implementing an outlier detection method that uses the M‑estimation technique. In addition, we have added a graphic function that can be used to show outlier identifiers for the Hidiroglou-Berthelot and sigma deviation methods. In addition to the Survey of Employment, Payrolls and Hours and the Unified Enterprise Survey, the Industrial Water Survey is now among the system’s users. There are also plans to use it on the T1 data (a presentation has been given on this subject).

Robust estimation using conditional bias

We have developed a unified approach to robust estimation in finite population sampling. We use the conditional bias of a unit as a measure of influence as it is closely related to the influence function in classical statistics. Unlike the influence function, the concept of conditional bias is straightforward to extend to the design-based framework. Our robust estimators are obtained by reducing the influence on the sampling error of the largest sample units. Our general estimator reduces to the estimator developed by Kokic and Bell (1994) when a stratified simple random sampling design is used. We have written a paper (Beaumont, Haziza and Ruiz-Gazen, 2009), which was presented at the International Statistical Institute Conference in Durban in 2009. We have started refining this paper and are currently conducting a simulation study. Once completed, it will be sent to a refereed journal.

For further information, contact:
Jean-François Beaumont (613-951-1479, jean-francois.beaumont@statcan.gc.ca).

References

Franklin, S., Thomas, T. and Brodeur, M. (2000). Robust multivariate outlier detection using Mahalanobis’ distance and modified Stahel-Donoho estimators. Proceedings of the Second International Conference on Establishment Surveys, American Statistical Association, 697-706.

Kokic, P.N., and Bell, P.A. (1994). Optimal Winsorizing cutoffs for a stratified finite population estimator. Journal of Official Statistics, vol.10, no 4, 419-435.

Patak, Z. (1990). Robust principal component analysis via projection pursuit. Master’s thesis, University of British Columbia, Canada.

Sampling and estimation

Sample Coordination

A study comparing the overlap between the Ontario Crops Survey (CS) and Livestock Survey (LS) samples produced by two different sample coordination methodologies was completed. The two methodologies are as follows:

1. The current sample coordination methodology implemented after the 2006 redesign.
2. The sequential SRSWOR methodology as described by Ohlsson (1995).

The current methodology uses collocated sampling and rebalances the permanent random numbers (PRNs) several times. The sequential SRSWOR methodology, proposed by us as an alternative, is simpler, yields constant stratum sample sizes and does not require rebalancing of PRNs. For the current method, the total overlap between the Ontario CS and LS samples is 2,619 farms, while it is 1,885 farms for the proposed method. As the objective is to minimize the CS and LS overlap, the sequential SRSWOR method performs better. Even though we had intended to, we did not use our method of optimal coordination to minimize the overlap between the Ontario CS and LS samples because it would have been quite complex due to the current design requirements (e.g., annual rotation, replicates used by CS). The study and its results are documented in Mach (2009).

We have been working on coordination methodology that achieves the first-order inclusion probabilities for each survey but not necessarily the higher order probabilities; thus, it solves a less constrained problem than the optimal sample coordination method. We are searching for a method that has good conditional properties and, for positive coordination, yields efficient estimators of change between the two survey occasions. This work will be presented at the 2010 Joint Statistical Meetings.

Generalization of the Lavallée-Hidiroglou algorithm: Implementing new developments

The Lavallée-Hidiroglou (LH) algorithm, which stratifies and allocates samples for business surveys, was generalized. Some of the new functionalities include the following:

  • The ability to model the response variable Y on the stratification variable X in the case where it is known that the two quantities are not equal (Rivest, 2002);
  • An option in the model to account for the unknown death of units (where X > 0 yet the true Y may be 0; Baillargeon, Rivest and Ferland, 2007);
  • The possibility of creating a Take-None stratum and the option to compensate for anticipated non-response at the stratification step (Baillargeon and Rivest, 2009).

Furthermore, an alternative to the LH algorithm has been proposed (Kozak, 2004) that uses randomized search methods to optimize the sample size, rather than the iterative deterministic algorithm at the heart of the LH algorithm. All of these developments (including Kozak’s algorithm) have been incorporated in a software package written in R. While LH-related functionalities developed prior to 2008 were tested on the Monthly Wholesale and Retail Trade Survey (MWRTS), it was of interest to consider the new package including Kozak’s algorithm and its ability to optimally stratify and allocate samples for the Unified Enterprise Survey (UES) as well as potentially other business surveys.

A program was written in SAS to interface with the R program and use the algorithms in the software package to stratify and allocate the sample successfully from the October 2009 G-SUF. The results of the basic implementation of the R version of the LH algorithm were generally comparable to the results of the RY2009 sampling production run, aside from the software package’s inability to consider the impact of pre-specified “must-take” units that were to be sampled with certainty. The process was then manipulated to take these pre-specified units into account when computing CVs. The software developers are considering including this functionality in future versions of the package.

The Kozak algorithm was compared to the standard LH algorithm and, given the same fixed CV and allocation scheme, as well as parameters that permitted the Kozak algorithm the freedom to find the optimal solution, it was determined that the stratification proposed by both algorithms resulted in the roughly similar sample sizes.

Studies were done to compare the act of compensating for non-response during the stratification and allocation steps as opposed to a posteriori (the approach taken in production). Future work includes an assessment of the degree to which the Take-None stratum creation mechanism in the R package is appropriate for UES sampling. It should be noted that the R package penalizes the inclusion of units in a TN stratum in a way that is not completely in line with the reasoning behind the creation of the UES TN strata.

Generalized bootstrap for Poisson sampling in Prices surveys

We have studied the generalized bootstrap technique for variance estimation and the construction of confidence intervals under general sampling designs. In the context of this methodology, bootstrap weights are defined so that the first two (or more) design moments of the sampling error are tracked by the corresponding bootstrap moments. Most bootstrap methods in the literature can be viewed as special cases. We have investigated issues such as the choice of the distribution used to generate bootstrap weights, the choice of the number of bootstrap replicates and the potential occurrence of negative bootstrap weights. We have also developed two ways of bootstrapping the generalized regression estimator of a population total. We have studied in greater depth the case of Poisson sampling, which is often used to select samples in Price Index surveys. For Poisson sampling, we have considered a pseudo-population approach and shown that the resulting bootstrap weights capture the first three design moments of the sampling error. We have confirmed the theory using a simulation study as well as data from the Commercial Rent Index. Our methodology was presented at the International Statistical Institute Conference in Durban in 2009 (Patak and Beaumont, 2009). A more detailed version of our paper has been submitted to a refereed journal (Beaumont and Patak, 2010). Our paper will also be presented at the next Advisory Committee on Statistical Methods in April 2010.

SAS tool for generating bootstrap weights under a stratified three-stage design with simple random sampling without replacement at each stage

Generalized bootstrap (for example, Beaumont and Patak, 2010) is a practical method for generating bootstrap weights for any sampling design even when sampling fractions are not negligible. Specifications have been written (Beaumont, 2010) for generalized bootstrap for a three-stage design with simple random sampling without replacement at each stage. Joël Bissonnette has used these specifications to create a SAS macro that implements generalized bootstrap for this sampling plan. This SAS macro was written for use in a survey on Aboriginal peoples in the spring of 2010. The survey uses a three-stage stratified design with non-negligible sampling fractions at the first stage. Accordingly, the Rao-Wu standard bootstrap, which is based on an assumption that sampling fractions at the first stage are small, is not valid.

Mean bootstrap

Mean bootstrap is a resampling method that is similar to the Rao-Wu standard bootstrap. In this method, RxB bootstrap replication is first generated using the Rao-Wu method. Each of the B mean bootstrap weights is then calculated by taking the mean of R Rao-Wu bootstrap weights. Finally, the B mean bootstrap weights are used to estimate variance. This technique makes it possible to avoid nil weights and thus a potential division by 0 for a ratio in a small area. In a more general sense, it eliminates the problem of regular matrix inversions that can occur for certain bootstrap replications when estimating equations are solved.

A theoretical formula that has been developed shows that the convergence rate of the variance produced using mean bootstrap weights is the same as that produced using standard bootstrap weights. In other words, the stability of the variance estimator does not depend on the choice of R, but only on the number of B bootstrap weights. Simulations for estimating the variance of totals and medians were also made in the context of a simple random sampling design without replacement. The latter indicate that the standard bootstrap is higher than the mean bootstrap for estimating the variance of a median. If we want to avoid nil bootstrap weights, this suggests the use of a value for R that is as small as possible (but also large enough to eliminate them). In addition, simulations were made by applying the method described by Saavedra (2001). This method is an alternative to the mean bootstrap that also avoids nil weights and that seems highly promising. Finally, we have started working on a simulation in which a two-stage plan is being considered. This simulation will be used to study how the various bootstrap methods behave when the parameters of a regression model are being estimated.

SAS tool for ridge calibration

Ridge calibration is a methodology that can be used to control the maximum and minimum weight adjustments resulting from the use of auxiliary information (e.g., Beaumont and Bocci, 2008). It is particularly useful when there are a large number of auxiliary variables. Specifications (Beaumont, 2009) have been written for the implementation of this methodology in the Travel Survey of Residents of Canada. Based on these specifications, a SAS macro was developed by Joël Bissonnette and used in this survey.

Take-None estimation

We reviewed the strategies implemented by various sub-annual (both monthly and quarterly) surveys to estimate the Take-None portion of the population. The surveys that we reviewed and considered for evaluating the proposed Take-None estimation strategies included the Monthly Retail Trade Survey, the Monthly Wholesale Survey, the Monthly Survey of Manufacturers (MSM) and the redesigned Quarterly Trucking Financial Survey. A draft report documenting the evaluation plan was completed.

We used data from the 12 monthly Retail Trade Surveys and the 12 monthly Wholesale Surveys for the calendar year 2008 to match both the frame and survey files with the historical GST files that would have been available at the time each survey month was processed. We used the data for the large Take-Some strata to estimate the regression model and then used the model to predict the contribution from the corresponding small Take-Some strata. For those businesses that did not link to the GST, the GST was imputed via hot-deck imputation so that estimates could be imputed for the entire survey population. We then computed the Absolute Relative Deviation (ARD) between the Take-Some contributions from the direct survey estimates and the model estimates at the publication level, i.e., province and industry group. We used the above approach because there is no survey data for the Take-None strata for computing direct survey estimates. Our assumption was that if the model estimated from the large Take-Some strata can adequately predict the small Take-some values, then the model estimated from the small Take-Some strata would also be adequate to predict the Take-None values. We ignored the contribution of the non-linked businesses during the above evaluation but plan to include them using imputation. In general, the evaluation indicated that the model derived from the large Take-Some strata could be applied to the small Take-Some strata. Thus, there is no evidence that a similar approach would not work for the Take-None strata.

The process was repeated using data from the redesigned Quarterly Trucking Survey (QTS). The results are similar to those obtained using data from wholesale and retail surveys. The report has been updated with these results.

Variance estimation with one PSU per stratum

This project is motivated by the Canadian Health Measures Survey (CHMS) where, for operational and cost reasons, a small number (15) of PSUs is selected from five strata across the country. One of the strata is quite small and has only one PSU selected. Furthermore, for preliminary estimates based on the first half of the data collection, three of the five strata have this problem. Different analytical approximations to variance in pps sampling, particularly based on Brewer and Donadio (2003, Survey Methodology) and Hartley and Rao (1962, The Annals of Mathematical Statistics) are also considered. To handle the one PSU per stratum problem, relevant ideas from the literature, particularly collapsing of strata, are reviewed. A new approach based on components of variance from different stages of sampling is then proposed. These different variance estimators, along with jackknife and bootstrap resampling methods, are compared empirically using data from the first half of the CHMS collection. They all give similar results, except that when the Horvitz-Thompson estimator (with approximate {pi_{ij}}) is used for the first stage of variance the estimates are somewhat larger.

Different applications of calibration

Calibration is commonly used in survey sampling to include auxiliary information at the estimation stage. Calibration adjusts the design weights {d_i = pi_i ^{-1}}, where {pi_i} is the inclusion probability for a given unit “i” in the population. The calibration weights {w_i (s)} are obtained by minimizing a suitable distance measure between the design weights and calibration weights {w_i (s)} to satisfy the benchmark constraints {\sum_{i epsilon s} w_i (s) bold x_i = bold X BREVE T}. Huang and Fuller (1978) proposed a scaled modified chi-squared distance measure and obtained the calibration weights through an iterative solution that satisfies the benchmark constraints. Deville and Särndal (1992) proposed a number of distance measures subject to the above restriction and also constrained the weights to be bounded above and below.

The Lagrange multipliers method is used to determine the solution for the calibration weights. This method, named after Joseph Louis Lagrange, provides a strategy for finding the maximum/minimum of a function subject to constraints. That is, given that we wish to maximize {f (bold x)} subject to {g (bold x) = c}, we introduce the Lagrange multiplier {lambda} and optimize {h (bold x) = f (bold x) + lambda (g (bold x) - c)}. We showed that the Lagrange procedure could be used for a variety of maximizations subject to constraints, including sample size determination and benchmarking a time series. The resulting equations are similar in form to the equations that define calibration in estimation. The resulting work was presented by M. Hidiroglou at the Atelier en Sondage à l’Honneur de Jean-Claude Deville Neuchâtel, le 26 Juin 2009.

Panel effects for the Labour Force Survey

This project looks at the monthly series of panel level (rotation groups) estimates from the LFS, by major industrial (16 groups) and occupational (10 groups) classifications. The question of interest is whether there are rotation group effects or month-in-sample effects in the series. A second question is whether estimates for industry groups are more stable than those for occupational groups, since employment by industry group is used as a calibration total in the current composite estimator. Estimators of rotation group and month-in-sample effects, along with appropriate variance estimators, are derived. These are then evaluated for both the employment and unemployment series. Although there are some significant effects, there is no discernible pattern. For example, rotation group effects that are significant for total employment may be not significant for most industries, or may be significant but in the opposite direction for some industries. Furthermore, this study finds no evidence of higher instability for estimates at the occupational group level. A report outlining the methodology and results is being finalized.

For further information, contact:
Wesley Yung (613-951-4699, wesley.yung@statcan.gc.ca).

References

Beaumont, J.-F., and Bocci, C. (2008). Another look at ridge calibration. Metron, vol. LXVI, no 1, 5-20.

Baillargeon, S., and Rivest, L.-P. (2009). A general algorithm for univariate stratification. International Statistical Review, vol. 77, no 3, 331-344.

Baillargeon, S., Rivest, L.-P. and Ferland, M. (2007). Stratification en enquêtes entreprises : une revue et quelques avancées. 2007 Annual meeting of SSC. Proceedings of the Survey Methods Section, Statistical Society of Canada (available online at http://www.ssc.ca/survey/documents/ SSC2007_S_Baillargeon.pdf).

Deville, J.-C., and Särndal, C.-E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, vol. 87, 376-382.

Huang, E.T., and Fuller, W.A. (1978). Nonnegative regression estimation for sample survey data. Proceedings of the Social Statistics Section, American Statistical Association, 300-305.

Kozak, M. (2004). Optimal stratification using random search method in agricultural surveys. Statistics in Transition, vol. 6, no 5, 797-806.

Lavallée, P., and Hidiroglou, M. (1988). On the stratification of skewed populations. Survey Methodology, vol. 14, 33-43.

Ohlsson, E. (1995). Coordination of samples using permanent random numbers. In Business Survey Methods, (Eds., B.G. Cox, D.A. Binder, D.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott), New York: John Wiley & Sons, Inc., 153-169.

Rivest, L.-P. (2002). A generalization of the Lavallée and Hidiroglou algorithm for stratification in business surveys. Survey Methodology, vol. 28,no 2, 191-198.

Saavedra, P. (2001). An extension of Fay’s method for variance estimation to the bootstrap. Proceedings of the Survey Research Methods Section, American Statistical Association.

Small area estimation

Small area estimation using area level models with applications

Area level models are widely used in practice to improve the direct survey estimates for small areas. In this project, we have studied various area level models for small area estimation, including the basic Fay-Herriot model and some extended models such as unmatched nonlinear models, time series models and spatial correlation models. We investigate the problems of sampling variance smoothing and modeling as well as benchmarking small area estimates using EBLUP and HB approaches. We have applied the hierarchical Bayes (HB) approach with Gibbs sampling method to various models and applications at Statistics Canada. We have studied the use of small area models in various surveys for small area estimation including the Labour Force Survey, the Canadian Community Health Survey, the Participation and Activity Limitation Survey, and the Reverse Record Check for Census undercoverage estimation. The area level modeling methods and related results of applications have been presented as the 2009 SSC annual meeting in Vancouver. A conference paper has been finished and submitted (You, 2009). We also plan to have a small area estimation documentation\working paper finished on area level modeling approaches for small area estimation with applications in different surveys of Statistics Canada. This documentation can be used as a user guide for small area estimation practice in Statistics Canada for model-based small area estimation using area level models. It also can be used as a reference documentation\paper for the small area estimation system under development.

Evaluation of variance component estimation methods for the Fay-Herriot model

The Fay-Herriot model is the most popular model used in small area estimation to improve the direct survey estimates. The Fay-Herriot model has two variances, namely, a model variance associated with the area random effect and a sampling variance associated with the sampling errors. The sampling variance can be estimated from the survey data and can also be smoothed by various methods. The model variance is unknown and needs to be estimated from the model. In small area estimation, various methods have been proposed to estimate the model variance. In this project, we study the various estimation methods including the fitting constant (FC) method, REML method, FHI (Fay-Herriot Iterative) method, Wang and Fuller’s method (Wang and Fuller, 2003), and the most recently developed ADM (adjusted density maximum) method (Li and Lahiri, 2010). We conduct a simulation study along the line of Rivest and Vandal (2003) to evaluate the methods of estimation variance components and small area parameters under the EBLUP approach. The sampling variances will be generated from a chi-square model as in Rivest and Vandal (2003) and then the small area data will be generated from a Fay-Herriot model. We compare these different methods and their impact on the MSE estimation. We also study the effect of input sampling variances under these different estimation methods for the model variance, particularly for the MSE estimation when the direct sampling variance estimates are used, as in Wang and Fuller (2003). This will give us an indication of the proper use of estimation methods under sampling variance smoothing and modeling. The estimation methods are coded in S-Plus and some preliminary results have been obtained. A draft paper (You, 2010) is in progress. We will continue the project, and a paper will be written up. We plan to present the results at conference meetings in the next year. The results of this study can be used as a reference guide for choosing the method to estimate the model variance in practice and also for the small area estimation system currently under development at SRID.

Estimation for occupation counts using the Labour Force Survey

The Canadian Labour Force Survey (LFS) uses a probability sample that is based on a stratified multi-stage design. The LFS is the only source of monthly estimates of total employment including the self-employed, full and part-time employment, and unemployment. It publishes monthly standard labour market indicators such as the unemployment rate, the employment rate and the participation rate. Employment estimates include detailed breakdowns by demographic characteristics, industry and occupation, job tenure, and usual and actual hours worked. Information for occupational detail for employed persons is available on Statistics Canada’s computerized database and information retrieval service (Canadian Socio-economic Information Management System or CANSIM), at the one and two-digit level within each province for time periods chosen by the CANSIM user.

It is possible to generate three- and four-digit level occupational detail at the provincial level. However, the data quality (in terms of statistical reliability or coefficient of variation) is not reasonable for most users. We used the SPREE procedure to improve the reliability and smoothness of the two-digit occupation series publishes by the LFS, and subsequently of the lower level occupations. The resulting work, summarized in Hidiroglou and Patak (2009), was presented at the SSC meetings held in Vancouver.

Issues in the application of Penalized Spline models for small area estimation

SAE models can be stated either at the area level, where the observations are domain survey estimates of a study variable, or at the unit level, where the observations are the survey units of study. Often the model depends on the level at which good data are available. Sometimes, however, the correlation between direct estimates of the domains and corresponding auxiliary variables is considerably higher than the correlation between unit level quantities. Hence, though auxiliary data for business surveys is usually available at the unit level, in this project we use area level models with the expectation that by aggregation the model fits better at the area level. We consider the Penalized Splines (PS) model, which consists of a linear sampling model for the direct survey estimates, and a linking spline model that includes a random error to account for the small area effects not explained by the auxiliary variables. It is also a special case of the General Linear Mixed Model. We derived the Mean Squared Prediction Error (MSPE) of the BLUP estimators of small area means or totals in the frequentist framework, when the variance components are known. The more realistic case of unknown variance components yields the empirical best linear unbiased prediction (EBLUP) estimators. However, we cannot apply existing theory to estimate the MSPE because the variance structure is not block diagonal.

We have adapted parametric and non-parametric bootstrap procedures to estimate the MSPE of the EBLUP estimators for area level models and have continued investigating their properties. Below, we summarize our main accomplishments, some of which were also reported under the SAE account.

  • We proved that the method of moments yields a consistent estimator of the model variance random effect as the number of areas increases towards infinite, and a consistent estimator of the spline effect as the number of knots increases towards infinite.
  • We tested the method empirically and found that convergence is very slow, so a spline function with a very large number of knots is needed to ensure a good enough proportion of positive spline variance estimates. The method is not practical when the number of areas is small.
  • The properties of the EBLUP estimators for various combinations of true linking functions and fitting splines were studied empirically, by calculating the true MSPE via Monte Carlo simulations. We used the method of restricted maximum likelihood (REML) to estimate the model variance components for this study.
  • The properties of the bootstrap estimators of the MSPE of EBLUPs were also studied via Monte Carlo simulations in terms of their relative bias.
  • An invited paper with the results of this investigation was presented at the Small Area Estimation 2009 ISI Satellite Conference (SAE2009) in Elche, Spain.
  • A long abstract (or proceedings for SAE2009) paper was written (Rubin-Bleuer, Dochitoiu and Rao, 2009a).
  • In addition to the bootstrap estimator proposed for the presentation in SAE2009, we proposed many other variants of the bootstrap estimator paper (Rubin-Bleuer, Dochitoiu and Rao, 2009b).

Issues in the application of a time series and cross-sectional model (Yu-Rao type)

The Survey of Employment, Payrolls and Hours (SEPH) is a monthly survey designed to produce estimates of levels and month-to-month trends of payrolls, employment, paid hours and earnings. The target population is composed of all employees in Canada except for those in a few select industries. The survey makes extensive use of administrative data with the aid of a monthly survey. The two data sources are currently combined through the use of the Generalized Regression (GREG) estimator. In an attempt to improve the precision of the GREG estimator, the SEPH program is planning to use a composite estimator RC which is a GREG estimator that borrows strength across time by using information from previous occasions of the survey as auxiliary data. However, there are some domains of interest where the sample will not be large enough to produce estimates with an acceptable precision, such as particular NAICS4 by province domains. The purpose of this project is to examine different models and variance estimation techniques for small area estimation in SEPH. Below, we go over our main achievements, some of which were also reported under the SAE account.

  • We developed a time series and cross-sectional area level model in which the vector of composite estimators at time t and time t-1 is a function of the Average Monthly Earnings (cross-sectional data) and both the sampling errors and the model errors are correlated over time. The model is a modification of the Yu-Rao model (in Rao, 2003, pages 158-160).
  • We compared the Empirical Best Linear Prediction (EBLUP) estimator obtained from the proposed model and the GREG estimator via Monte Carlo simulations using data from a finite population that had been created using actual 2005 SEPH data.
  • The variances ψi s of the sampling errors were obtained by drawing 500 independent samples following the SEPH design, calculating the corresponding 500 direct sample estimators and their standard deviation using the true mean.
  • We estimated the variance components by the method of moments, proved their consistency as the number of small areas increases and calculated the variance of the variance estimators (Rubin-Bleuer, 2009a). We used the variance of the variance estimators in the implementation as a method to deal with negative estimates of the variance components (Wang and Fuller, 2003).
  • In order to compare the design-based properties of the estimators, a Monte Carlo procedure based on 1,000 independent samples was used: for each of these samples, we calculated the EBLUP estimator. Subsequently, we calculated the design-based Monte Carlo average relative bias and the average MSE.
  • The results of this study were presented in an invited paper at the SSC’s 2009 annual conference in Vancouver and a proceedings paper was written (Yung, Rubin-Bleuer and Landry, 2009).

Robust small area estimation for business surveys

Data from business surveys contains many outliers and influential points and we rely on robust estimation methodology to tackle this problem. In this project, we proposed to model the survey small area direct estimators as a function of area level covariates obtained from administrative tax data. We adapted the methodology for robust estimation developed by Sinha and Rao (2008) to area level models and wrote the specifications for the SAE system being developed at Statistics Canada. (Rubin-Bleuer, 2009b).

Small area estimation under informative sampling

Population unit level models are often used in model-based small area estimation of totals and means; however, the models may not hold for the sample if the sampling design is informative. As a result, assuming the model holds for the sample, standard methods can lead to biased estimators. We propose to study alternative methods that use survey weights as an additional auxiliary variable in the sampling model and/or in the estimation of means and MSE estimation using the pseudo-EBLUP approach proposed by You and Rao (2002). Francois Verret, M. Hidiroglou and J. N. K. Rao have studied the properties (bias and MSE) of these alternative methods via a simulation study, using informative sampling schemes to generate the samples. The results of this research are to be presented at the Statistics Society of Canada meetings to be held in Laval, Quebec in June 2010.

Design-based procedures have been added to the unit level model via the survey weights, and the estimated components of variance for between and within small areas. Their properties have been investigated in conjunction with the informativeness of the sampling plan. These procedures are being compared to those developed by Pfeffermann and Sverchkov (2007). Preliminary results show that improvements to the efficiency of the resulting estimates can be gained with the incorporation of the design.

Development of a small area estimation system

This project has involved a review and documentation (Estevao, You, and Hidiroglou 2009) of small area estimation methods and their implementation in a SAS prototype program. The current system produces small areas based on the Fay-Herriot model, using direct survey estimates, direct sampling variance estimates and related auxiliary variables as input values. The direct sampling variance estimates can be used directly in the system or can be smoothed and then treated as known in the model. We reviewed the published methodology for small area estimation and implemented four methods to estimate the model variance under the Fay‑Herriot area level model: Adjusted Density Maximization (ADM), Restricted Maximum Likelihood (REML), Fay-Herriot (FH) and Wang-Fuller (WF). We created a Methodology Specifications document from this review. This document describes the methodology and outlines the steps and various issues in the implementation of each method. From this document, we created a prototype (a SAS macro program) to produce small area estimates under the four methods. The prototype produces a SAS data set with the required estimates and displays the details of the calculations in the SAS Log window. The prototype has validation of the inputs and a message system to display errors and warnings. This is all described in a separate document titled Validation of the Inputs for the Small Area Estimation Prototype. We have increased the efficiency of this development by using many of the macros available in StatMx. The logical extension is to move this development into StatMx so that it is a fully supported corporate product.

In summary, we have finished the following steps for the system:

  1. Implementation of four different methods including REML, FHI, WFI and ADM to estimate the model variance with a general random coefficient in the Fay-Herriot model;
  2. Verification of the variance component estimation under REML, FHI, WFI and ADM;
  3. MSE estimator for the EBLUP estimators including an extra g4 term to account for the variability when the direct sampling variance estimates are used as input values in the model;
  4. Verification of the MSE estimation and its components g1, g2, g3 and g4 terms;
  5. Development of a GVF (generalized variance function) procedure to smooth the direct sampling variance estimates if needed.
  6. Development of benchmarking procedures to ensure the robustness of the model-based estimates.
  7. Model evaluation procedures including various diagnostics and tests have been specified and under development.

We are currently doing a final review of the documentation and completing the testing of the functionality of the prototype. We are also reviewing various tests and diagnostics that form the evaluation of the small-area model. This implementation will be done in the next fiscal year. Small area estimation based on unit level models with and without weights will be added to the system next year. Also, hierarchical Bayes (HB) inference method with Gibbs sampling will also be introduced to the system to deal with more complex models and provide the users with more options to choose small area estimation methods and models to improve the direct survey estimates from both the area level and unit level.

Small area estimation, the use of auxiliary data for informative designs

In the context of small area estimation with a nested error regression model under non-informative sampling, You and Rao (2002) compared the (unit-level) empirical best linear unbiased predictor (EBLUP) with two design-consistent alternatives: an aggregate-level pseudo-EBLUP (Prasad and Rao, 1999) and a better estimator they developed that makes use of the survey weights and both unit and aggregate-level versions of the model. You, Rao and Kovačević (2003) further extended the use of sampling weights to iterative weighted estimating equations to estimate both fixed effects and variance components.

In this research project, the EBLUP and the You-Rao estimator were compared through simulations under various degrees of informativeness of the (Rao-Sampford) sampling design via a setting similar to the one used in Asparouhov (2006). Different methods of estimating the variance components of the model were also compared (namely REML, fit-the-constants and You-Rao-Kovačević). Small area mean estimators were evaluated in terms of absolute bias, mean squared error (MSE), relative bias and bias ratio. Fixed effects and variance components estimators were also compared in terms of bias, MSE and relative bias to shed some light on multi-level estimation problems. It is planned to extend this simulation from the evaluation of point estimates to the evaluation of the corresponding MSE estimators. The comparison with other estimation methods more suited for informative sampling designs such as those of Asparouhov (2006), Pfeffermann and Sverchkov (2007) and Huang and Hidiroglou (2010) are also considered.

A contributed paper authored by Francois Verret, Mike Hidiroglou and J.N.K Rao will be presented at the upcoming 2010 Statistical Society of Canada Meetings.

For further information, contact:
Susana Rubin-Bleuer (613 951-6941, susana.rubin-bleuer@statcan.gc.ca).

References

Asparouhov, T. (2006). General multi-level modeling with sampling weights. Communications in Statistics–Theory and Methods, vol. 35.

Huang, R., and Hidiroglou, M. (2010). Design consistent estimators for a mixed linear model fitted to survey data. To be published in Journal of the Royal Statistical Society, Series B.

Li, H., and Lahiri, P. (2010). An adjusted maximum likelihood method for solving small area estimation problems. Journal of Multivariate Analysis, vol. 101, 882-892.

Pfeffermann, D., and Sverchkov, M. (2007). Small-area estimation under informative probability sampling of areas and within the selected areas. Journal of the American Statistical Association, Theory and Methods, December 2007, vol. 102, 480.

Prasad, N.G.N., and Rao, J.N.K. (1999). On robust small area estimation using a simple random effects model. Survey Methodology, vol. 25, no 1, 67-72.

Rao, J.N.K. (2003). Small Area Estimation. New York: John Wiley & Sons, Inc.

Rivest, L.-P., and Vandal, N. (2003). Mean squared error estimation for small areas when the small area variances are estimated. Proceedings of the International Conference on Recent Advances in Survey Sampling, (Ed., J.N.K. Rao).

Wang, J., and Fuller, W.A. (2003). The mean squared error of small area predictors constructed with estimated area variances. Journal of the American Statistical Association, vol. 98, 716-723.

You, Y., and Rao, J.N.K. (2002). Small area estimation using unmatched sampling and linking models. The Canadian Journal of Statistics, vol. 30, no 1, 3-15.

You, Y., Rao, J.N.K. and Kovačević, M. (2003). Estimating fixed effects and variance components in a random intercept model using survey data. Proceedings: Symposium 2003, Challenges in Survey Taking for the Next Decade.

Data analysis research (DAR)

The Data Analysis Research Group conducts research on methodological problems that have been identified by analysts and methodologists and involve statistical inference about finite population parameters or issues arising from the modeling of survey data. The Group is also involved in the transfer and exchange of knowledge and experience through reviewing and publishing technical papers and by giving seminars, presentations and courses.

The data analysis research conducted in 2009/10 involved a number of independent projects. In this report we present only those that made progress during the reporting period.

Selected topics in design-based methods for analysis of survey data

1.  Role of Weights in Descriptive and Analytical Inferences from Survey Data: An Overview. Statistical agencies generally collect data from samples drawn from well defined finite populations and using complex sampling procedures that may include stratification, clustering and multi-stage sampling and using unequal probabilities of selection. Sample design weights, defined from the sampling procedures, are often adjusted to account for non-responding units and to calibrate to known population totals of auxiliary variables. Once adjusted, these ‘final’ weights are included on the survey datasets. There has been some debate on the necessity of using these weights in the estimation of descriptive statistics and to perform analysis of data from these surveys. A paper co-authored with Jon Rao, Mike Hidiroglou, Wesley Yung and Milorad Kovačević was written to discuss the role of weights in descriptive and analytical inference. It was presented at the 57th Session of the International Statistical Institute Meetings, held in Durban, South Africa.

2.  Analysis of data aggregated from more than one survey source. When analysing special subpopulations, researchers frequently find that any single data source has limited sample in the subpopulation but that there is more than one data source containing the same variables for which the target populations are the same or related (e.g., different cycles of a repeated cross-sectional survey or different cycles of a longitudinal survey). In this research we address a variety of topics that the researchers should be aware of, such as the comparability of the variables across surveys, and the suitability of positing a model for the variables in the different surveys. We investigate possible approaches for combining the information. A presentation was made at JSM 2009 and a paper entitled “Analyses Based on Combining Similar Information from Multiple Surveys” (Roberts and Binder, 2009) was submitted to the conference proceedings. More research and writing on this topic is planned, focusing on problems encountered with Statistics Canada survey data.

3.  Methods for Microsimulation Modeling. Research in methodology of microsimulation models was done with a concentration on identification and validation of considered models for prediction of probabilities of a variety of events as well as on studies of statistical properties of the resulting estimates used to advance individuals along a simulated life course. A presentation was made to Advisory Committee in May 2009, titled “Challenges in Methodology of Microsimulation Models” (Kovačević, 2009). A contributed paper was presented at the 2009 SSC meetings and a paper titled “Estimation of Relative Risks for Population Projections Using Microsimulation” (Carrillo-Garcia and Kovačević, 2009) was submitted to the Proceedings. Results of the research have been used in the DemoSim demographic microsimulation model.

4.  A model-design randomization framework for analytical methods for complex survey data. No research was done in this year, but the paper based on earlier research, titled “Design- and Model-Based Inference for Model Parameters” (Binder and Roberts, 2009) was published in the Handbook of Statistics.

5.  Adjustment for Record Linkage Errors in Analysis. Record linkage is defined as the bringing together of two or more micro-records to form a composite record (Newcombe (1988), Fellegi and Sunter (1969)). As an example, record linkage is done in epidemiological mortality and morbidity studies between cohort data sets and vital statistics. Any errors in the linkage are a possible problem for subsequent statistical analysis. In the majority of record linkages completed at Statistics Canada, these mismatch errors (missed links or false links) are ignored when analyzing the file. Unlike imputation, where the impact on both estimation and variance is currently undergoing study, the impact of linkage errors has not received a great deal of attention. This is in spite of a 1965 study by Neter, Maynes and Ramanathan that showed how a relatively small mismatch error could result in substantial bias. Analysts are left wondering as to what the potential impact is and what the most appropriate corrections for measurement error are. A draft of a 10-page article and a presentation for the meeting of the Advisory Committee on Statistical Methods in April 2010 are being prepared.

6.  Meta-Analysis of Survey Data. The recent increase in straightforward access to large survey datasets has lead to a proliferation of analysis that groups or pools multiple survey results. This ready access, combined with user-friendly computer software, has lead researchers to try to borrow strength from the many surveys in an attempt to either improve precision or to address research questions for which the original studies were not designed. Researchers have started to employ many different techniques to combine data from complex national surveys. However, the analysis is often done without reference to a generalized framework or a systematic review and often without an understanding of the assumptions used in survey methodology. Simulation studies have been programmed and are in the process of being verified. Analysis of the simulation results are near completion. The first draft of the initial chapter of Karla Fox’s PhD thesis, based on this research, has been externally reviewed and revisions are underway.

7.  Transition models. Transition models are under-utilized in the analysis of longitudinal data. The focus of this research is on investigating methods appropriate to longitudinal survey data and to account for the erosion over time in the sample due to non-response. In this fiscal year, there has been a review of the literature and some SAS programs have been developed and tested for some simple transition models using dummy NPHS data.

Multi-level analysis of survey data

This year, the main work done was an improvement of a SAS macro for calling the HLM6 software in order to use the survey bootstrap weights for estimating the variances of the parameters of a hierarchical model. A paper titled “BHLMSAS_V0: Nouveau “logiciel” pour l’estimation de la variance par les méthodes de rééchantillonnage dans les modèles multiniveaux appliqués aux données d’enquêtes” (Pierre and Saidi, 2009), submitted to the RDC Information and Technical Bulletin, has been modified based on reviewers’ comments and resubmitted. Experience from this work will allow for the assessment of a procedure for bootstrapping about to be launched in the Mplus software, as requested by the Mplus developers.

Design-consistent estimators for a mixed linear model fitted to survey data

Most investigations associated with large government surveys typically require statistical analysis for populations that have a complex hierarchical structure. Classical analyses often fail to account for the nature of complex sampling designs and possibly result in biased estimators for the parameters of interest. Linear mixed models can be used to analyze survey data collected from such populations in order to incorporate the complex hierarchical design structure. We developed a general method for estimating the parameters of the linear mixed model accounting for sampling designs. We obtain the best linear unbiased estimators for the fixed and random effects by solving sample estimating equations. The use of survey weights results in design consistent estimation. We also derived estimators for variance components for the nested error linear regression model. We compared the efficiency of the proposed estimators with that of existing estimators using a simulation study. This simulation study was carried out using a two-stage sampling design. Several informative or non-informative sampling schemes were considered in the simulation. The resulting work has been submitted to the Journal of the Royal Statistical Society Series B (Statistical Methodology). A revised version was re-submitted accounting for the referees’ comments.

Regression analysis of data obtained by record linkage

With the availability of data from different sources such as complex surveys and administrative files, many richer synthetic data sets have been created using computerized record linkage methodology. The creation and use of such synthetic data files is on the rise in Canada, especially in the areas related to population health studies. This, however, introduces linkage errors due to mismatching an individual from one data set with a different individual from the other data set. Some authors have pointed out that the resulting estimates can be subject to bias and additional variability in the presence of linkage errors. The issues of jointly accounting for the linkage errors and the complex sample design when analyzing the linked data sets have not been studied so far. Our idea is to extend the results of Lahiri and Larsen (2005) into a complex sample design situation and to provide appropriate modifications to correct for the linkage/matching error. A presentation titled “Regression Analysis of Record-Linked Survey Data” (Kovačević, 2009) was made at the 2009 SSC in Vancouver. A paper titled “Inference Based on Estimating Equations and Probability-Linked Data” (Chambers, Chipperfield Davis and Kovačević, 2009) has been submitted to a journal for publication.

Bootstrap for model parameters

The Rao-Wu bootstrap is often used to estimate the design variance of estimators of finite population parameters. When estimating model parameters, two sources of variability must normally be taken into account for variance estimation: the sampling design and the model that is assumed to have generated the finite population. If the sampling fraction is negligible, the model variability can in principle be ignored, which is often done in practice, and the Rao-Wu bootstrap remains valid. However, this simplification is not always appropriate. We have thus developed a generalized bootstrap method which correctly takes into account both sources of variability. This general procedure may be used for any parameter defined through an estimating equation as long as the observations are assumed to be independent under the model. It is simple to apply once bootstrap weights that capture the first two design moments have been obtained (e.g., using the Rao-Wu bootstrap method). We have conducted a simulation study and our results show that design bootstrap weights lead to underestimation of the variance, even when finite population corrections are ignored. Our proposed generalized bootstrap weights correct this deficiency. We have also shown the practical usefulness of our methodology by using data of the Aboriginal Children Survey. We have written a paper (Beaumont and Charest, 2010), which has been submitted for publication in a refereed journal and which will be presented at the SSC meeting in 2010 in Québec City.

Spatial analysis of geocoded data

In spatial data, observations from similar locations may not be independent of one another, similar to autocorrelation in time series analysis, where observations recorded over time are not independent. If the spatial effects are ignored when performing regression analysis, the resulting residual error terms may be spatially autocorrelated, thus violating the assumption of independence. There are a variety of techniques for analysing spatial data depending on the structure of the data and the desired analysis. Research into the basic forms of spatial autoregressive models for data that have been aggregated to regions has already been performed and a method for estimating the spatial model parameters with optimal instrumental variables was developed. Results were presented in a paper titled “Spatial Modeling of Geocoded Crime Data” that was presented at the 2009 SSC and a paper of the same name (Collins, 2009) was submitted to the proceedings of the Survey Methods Section. In this fiscal year, simulated data with a larger sample size have been examined for the impact on the results and to compare the method used with other instrumental variable approaches in the literature. The results are being written up for submission for publication.

Using copula theory in multivariate usual quantities distribution estimation

A major objective of Cycle 2.2 of CCHS is to estimate the distributions of the usual intakes of a number of different nutrients and food groups at the provincial level for 15 different age and sex groups. At the moment, the Software for Intake Distribution (SIDE) is being used by the vast majority of analysts, but it only allows the study of one nutrient at a time. The purpose of this research is to determine whether copula theory could be used to extend the estimation to the joint distributions of usual intakes. The theory of copulas has been studied and a seminar has been prepared that presents the theory of copulas, measures of dependence and an example of applying the theory to the data from a complex survey. The purpose of the seminar is to provide a basic knowledge of the theory to other methodologists. The seminar will be given near the start of the next fiscal year. Other applications of the theory may be possible.

Analysis of incomplete survey data: Using survey weights when estimating the response model

Progress was made in the research on imputation for missing responses in longitudinal surveys. It has been shown (Carrillo-Garcia, 2008) that the solution to the survey-weighted GEE is consistent under hot-deck imputation for missing responses in longitudinal surveys. Two papers (Carrillo-Garcia, Wu and Chen, 2010a, b) have been written and submitted based on the research results in this thesis. A talk entitled “A Comparison of Re-Weighing and Imputation Approaches for Missing Responses in Longitudinal Surveys” and co-authored by Ivan Carrillo-Garcia, Milorad Kovačević and Changbao Wu is being prepared for the 2010 SSC.

Asymptotic properties of the proportional hazards model for survey data

This project tackles the problem of fitting the proportional hazards (PH) regression model to survey data. Standard asymptotic results may not apply and could yield misleading inferences. Consequently, many authors have provided methods for either sampling design-based inference or joint sampling design and model-based inference. In previous studies with joint design and model based inference in mind, exact design and model conditions were not clearly spelled out and formal proofs were not given. In the case where the super-population is composed by independent censored failure times, we used the central limit theorem (CLT) for martingales in the joint sampling design and model space to obtain results for a general multistage sampling design under mild and easy to verify conditions. The technique of working directly in the joint space yielded CLTs (in the joint probability space) for data from many sampling designs that did not hold a (design) CLT and, as a consequence, the results are more general than previously obtained. This was the outcome of an investigation done during 2004-2006. During the 2009-2010 period, we proved the asymptotic normality of the Sample Partial Likelihood Score (SPLS) function and the Sample Maximum Partial Likelihood (SMPL) estimator for data from stratified clustered super-populations. In this case, we could not apply the CLT for martingales. Instead, we employed a technique based on the existence of both design-based and model-based CLTs. Thus, we required that the design admit a CLT for the sample mean: We worked with a stratified two-stage probability proportional to size with replacement (PPSWR) design. Below is a list of the main results (contained in Rubin-Bleuer, 2010):

  • Asymptotic normality of the Partial Likelihood Score function for clustered super-populations under the PH model with a single baseline hazard function. The proof presented is straightforward and based on counting processes’ tools.
  • Asymptotic normality of the centralized SPLS function and the SMPL.
  • A consistent estimator of the variance of the SMPL.
  • Model and design conditions under which these asymptotic properties hold clearly spelled and formal proofs given.
  • A paper containing results for both independent and clustered super-populations written and submitted for peer review.

For further information, contact:
Georgia Roberts (613-951-1471, georgia.roberts@statcan.gc.ca).

References

Carrillo-Garcia, I.A. (2008). Analysis of Longitudinal Surveys with Missing Responses. PhD thesis, University of Waterloo, ON, Canada.

Fellegi, I.P., and Sunter, A.B. (1969). A theory for record linkage. Journal of the American Statistical Association, vol. 64, no 328, 1183-1210.

Lahiri, P., and Larsen, M.D. (2005). Regression analysis of data files that are computer matched. Journal of the American Statistical Association, vol. 100, no 469, 222-230.

Neter, J., Maynes, E.S. and Ramanathan, R. (1965). The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, vol. 60, no 312, 1005-1027.

Newcombe, H.B. (1988). Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business. Oxford, U.K.: Oxford University Press.

Data collection

The goal of this project is to gather knowledge of best practices in questionnaires design in mixed mode surveys. You will find below the progress made in 2009-2010:

  • The paper entitled “A framework for research on the effects of collection modes on the quality of survey data” by Yiptong and Houle (2009) was finalized.
  • The database of the predetermined questionnaires was completed. Questions related to Health, Education, Income and Employment Status are part of the database. The purpose of this exercise was to identify differences in the number of questions and wording of questions for the same items, as well as differences in meanings between French and English versions. There was also a need to identify the collection tools for these data items in order to determine
    1. whether the survey data collected are comparable;
    2. what is the effect of question wording on responses;
    3. whether the mode has an effect on responses and data quality.
  • Common questions found on different questionnaires using different or the same collection modes were identified and chosen. Access to the identified data was requested and granted by the appropriate survey areas.
  • An analysis plan of the identified variables is being drafted.
  • Discussion has taken place regarding a future empirical study to measure mode effect in personal interview and self-administered collection modes. Survey managers who plan to use the Internet as an additional mode of collection will be contacted at the beginning of the new fiscal year. (Surveys considered are the Labour Force Survey (LFS) and the General Social Survey (GSS)).
  • Literature research on mixed mode methods and the effect of mixed mode on the quality of survey data is still ongoing.
  • A proposal for future research has been written and an application for funding for 2010-2011 has been submitted.

For further information, contact:
Jackie Yiptong (613-951-3146, jackie.yiptong@statcan.gc.ca) or
Patricia Houle (613-951-4554, patricia.houle@statcan.gc.ca).

Disclosure control methods

As part of its mandate, the DCRC provided advice and assistance to Statistics Canada programs on methods of disclosure risk assessment and control. It also shared information or provided advice on disclosure control practices with other departments and agencies including the U.S. Census Bureau, National Center for Health Statistics and Bureau of Labor Statistics, Statistics New Zealand, Health Canada, Industry Canada, Natural Resources Canada, the Treasury Board Secretariat, the Ontario Ministry of Community Safety and Correctional Services, the B.C. Ministry of Advanced Education and Labour Market Development, the Children’s Hospital of Eastern Ontario Research Institute and CanadaPrivacy Services Inc.

Support for disclosure control methods

Continuing support on disclosure control methods is also given to the Agency’s Research Data Centres (RDC) Program. Most of it is in terms of assistance for the implementation and interpretation of disclosure control rules for RDC data holdings. In particular, the problem of confidentiality of business outputs was looked at in more detail and rules were developed for business data (Tambay, 2009a, 2009b).

Confid2

Confid2 is Statistics Canada’s generalized software for the suppression of sensitive cells in tables of magnitude. It aims to preserve the confidentiality of respondent data while minimizing the amount of information that is suppressed. In January 2010 version 1.02 of Confid2 was released. Enhancements include the incorporation of a new in-house dominance rule (called the C2 rule) to eventually replace the current standard (known as the Duffett rules) and the use of user-specified variables as cost variables to prioritize cells for suppression (previously, only pre-specified functions of the study variable could be used as cost variables). The software was the subject of a presentation at the SSC annual meeting (Frolova, Fillion and Tambay, 2009).

Papers presented

An invited paper on the development of a Real Time Remote Access Infrastructure at Statistics Canada was presented at the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality in Bilbao, Spain (Simard, 2009).

Creation of synthetic data

There is an emerging literature on methods for the creation of synthetic or simulated data. These methods attempt to preserve as much as possible the relationships in the original data while keeping the risk of divulging confidential information at a low level. The current methods involve modelling the multivariate relationships in the collected data so as to reproduce these observed relationships in the synthetic data.

We have continued synthesizing data from the Cross National Equivalent File (CNEF) in which five other countries are involved. The Canadian portion of the CNEF is a subset of about 40 variables from the Survey of Labour and Income Dynamics. Below is a summary of the main accomplishments made during the year:

  • The first year (1999) of the six-year panel of the CNEF has been synthesized. Basic validations have been made for each synthesized variable (e.g., comparisons of cross tabulations between synthetic data and real data) and a complete documentation has been written describing the strategy and issues related to each variable.
  • A calibration strategy has been developed for the skewed income variables to ensure that the first and second moments are captured as much as possible.
  • We also conducted a more thorough validation of the 1999 synthetic file. This involves calculating a recently developed global measure of utility as well as some analysis-specific measures. These results indicate that our synthetic data lead to reasonably accurate estimates (although not perfect in all scenarios).
  • We have started work on the creation of synthetic data for the second year of the panel. Some major issues such as deaths and household splits have been discussed and a strategy put in place.
  • A presentation of our synthetic data methodology and results was given at the Census Bureau interchange in April 2009 in Ottawa.
  • Our methodology and results have also been presented at the last Statistics Canada Symposium in October 2009 and a paper for the proceedings has been written (Bocci and Beaumont, 2009). A more complete paper is being written for possible publication in a refereed journal.

For further information, contact:
Jean-Louis Tambay (613-951-6959, Jean-Louis.Tambay@statcan.gc.ca).