1. Introduction

Takis Merkouris

Previous | Next

Matrix sampling is a sampling design in which a long questionnaire is divided into subsets of questions (items), possibly overlapping, and each subset is then administered to one or more distinct random subsamples of an initial sample. In its various forms this design may serve a variety of purposes, such as reducing the length and cost of the survey process and addressing concerns related to respondent burden and data quality associated with a long questionnaire. Matrix sampling has been applied or explored in various fields, primarily in educational assessment and public health studies. A review of previous research on matrix sampling, with discussion of the issues arising in its implementation in surveys, is given in Gonzalez and Eltinge (2007). For recent work on design and estimation for matrix survey sampling, motivated by the potential benefits of such sampling schemes in large scale surveys, see Raghunathan and Grizzle (1995), Thomas, Raghunathan, Schenker, Katzoff and Johnson (2006), Gonzalez and Eltinge (2008), Chipperfield and Steel (2009, 2011), and references therein. Among the many matrix sampling designs explored in the literature, we distinguish the following four principal designs varying in the number of subsamples and the number of sub-questionnaires (overlapping or not) administered to each subsample.

  1. Different (non-overlapping) sets of questions are administered to different subsamples.
  2. An additional core set of questions is administered to all subsamples in design (a). There are several reasons for including a core set of items in all subsamples: High precision may be required for some items of special interest; some other items (e.g., demographic characteristics) define subpopulations and may be used in cross-tabulations of survey results; the correlation of the core items with the rest of items may be used to enhance the precision of estimates for all items.
  3. A variant of design (a) involving an additional subsample that receives the full questionnaire. It may be viewed as a generalization of two-phase sampling design. The motivation for this design is to allow for analysis of interaction between sets of questions, by having responses to all questions from the units of the additional sample, and to enable more efficient estimation.
  4. An extension of design (c), in which the core set of questions is administered to all subsamples. It embodies all features of the previous three designs.

A current trend in survey planning relates to a variant of matrix sampling in which a number of distinct surveys with overlapping content are integrated for the benefit of streamlined survey operations, harmonized survey content and data consistency, as well as improved estimation. In this nonstandard matrix sampling framework, the distinct surveys may use subsamples of a large master sample or independent samples from the same population. Such sampling schemes are actively being researched or implemented in various statistical agencies; see, for example, the integration of household surveys in the British Office of National Statistics (Smith 2009) and in the Australian Bureau of Statistics (2011). Although such integration may be viewed as a reverse process to splitting a questionnaire, the structure of the design with respect to the collection of different subsets of data items from different samples is essentially the same as in the standard framework. In the particular case where the samples from constituent surveys are independent, possibly with different sampling designs, the designs (b), (c) and (d) could be characterized as non-nested matrix sampling designs. It is to be noted that the advantages of matrix sampling are not always contingent on using subsamples (necessarily dependent) of an initial sample. It may be more practical in certain situations to use independent samples, notwithstanding the possibility of a negligible sample overlap.

In this paper we address the estimation problem in matrix sampling, namely the loss of precision of survey estimates due to not collecting all data items from all sample units. In the nonstandard matrix sampling of the preceding paragraph, the estimation problem is the improvement of the precision of estimates for each constituent survey. For matrix sampling designs (b), (c) and (d), involving overlapping subsets of questions, a dual estimation task is to combine data on common items from different subsamples for improved estimation, and to exploit correlations among items surveyed in different subsamples for more efficient estimation for all items. To this aim, estimation involving imputation of the missing values caused by the omitted items in each subquestionnaire has been explored in Raghunathan and Grizzle (1995) and Thomas et al. (2006). Estimation using a simple weighting adjustment that combines data on common items has been considered by Gonzalez and Eltinge (2008). In the particular case of non-nested design (b), the estimation problem of combining data from independent samples has also been dealt with in the literature; see, for example, Renssen and Nieuwenbroek (1997), Houbiers (2004), Merkouris (2004, 2010), Wu (2004) and Kim and Rao (2012). Non-nested design (d) has been considered in Renssen (1998). We propose an efficient estimation method, based on the principle of best linear unbiased estimation, which produces composite optimal regression estimators of totals by means of a suitable calibration procedure for the sampling weights of the combined sample, when the second-order sample inclusion probabilities are known. A variant of this calibration procedure of more general applicability produces composite generalized regression estimators, which for certain sampling settings are optimal regression estimators. The method exploits correlations of items across the subsamples to improve the efficiency of estimators even for items surveyed in all subsamples. It is also operationally very convenient, producing estimates for all items at population or domain level by means of a simple adaptation of the standard calibration system commonly used in statistical agencies. Introducing here the method, we study in detail the principal designs (c) and (d). Adaptations to more general designs are fairly straightforward.

In the following Section 2 and Section 3 we describe the proposed method for design (c). The application of the method to design (d) is described in Section 4. Domain estimation is dealt with in Section 5. A simulation study is presented in Section 6. We conclude with a discussion in Section 7.

Previous | Next

Date modified: