A comparison between nonparametric estimators for finite population distribution functions 1. Introduction

Since Chambers and Dunstan’s seminal paper Chambers and Dunstan (1986), several estimators for finite population distribution functions have been proposed. Most of them are based either on different types of fitted values or on different ways to combine them into an estimator. The estimator proposed by Chambers and Dunstan (1986), for example, is based on fitted values derived from a superpopulation model where the study variable and an auxiliary variable are linked by a linear regression model with independent error components whose variances are assumed to be known. Substituting the fitted values to the unobserved indicator functions in the definition of the population distribution function of the study variable yields the Chambers and Dunstan estimator. Rao, Kovar and Mantel (1990) incorporate design weights into the fitted values of Chambers and Dunstan and use them in a generalized difference estimator. Kuo (1988) uses nonparametric regression to estimate directly the regression relationship between the indicator functions and the auxiliary variable and obtains fitted values that accommodate virtually any superpopolation model. Like Chambers and Dunstan, she substitutes the unobserved indicator functions with their corresponding fitted values and obtains a model-based estimator. Chambers, Dorfman and Wehrly (1993) combine the fitted values of Chambers and Dunstan (1986) and of Kuo (1988) and propose still another model-based estimator that aims to be more efficient than the Kuo estimator if the linear superpopulation model assumed by Chambers and Dunstan is true, and that does not suffer from model misspecification bias otherwise. Following these early works there has been quite a large number of subsequent proposals with the aim to achieve some gain in efficiency with respect to the Horvitz-Thompson estimator, while preserving robustness and sometimes also one or both of the following desirable properties shared by the Horvitz-Thompson estimator: (i) the fact that it is a linear combination of the sample indicator functions with coefficients that do not depend on the study variable and (ii) the fact that it gives always rise to nondecreasing estimates for the distribution function.

The present work originates from the idea to improve upon the fitted values proposed by Kuo (1988) through incorporation of an estimate for the mean regression function (see Section 2). This idea has been put forward in a recent textbook of Chambers and Clark (2012) and it is based on the assumption of an underlying superpopulation model with smooth regression relationship between the study variable and an auxiliary variable and with smoothly varying error component distributions. According to this idea, the fitted values are the outcome of a two-step procedure: at the first step the mean regression function is estimated through either parametric or nonparametric regression, and at the second step, using the regression residuals from the first step, the distribution functions of the error components are estimated using nonparametric regression in order to accommodate the possibility of smoothly varying error component distributions. Combining both estimates one may compute fitted values for the indicator functions in the definition of the finite population distribution function of the study variable. Chambers and Clark (2012) analyze the model-based estimator that is obtained by substituting the unobserved indicator functions by their corresponding fitted values and they sketch a proof that leads to an expression for the model variance of the resulting estimator. In that proof they assume that the mean regression function is estimated by a consistent estimator and that the contribution from its estimation error to the model variance of the final distribution function estimator can be neglected. In the present work we consider local linear regression for estimating both the model mean regression function and the error component distributions. We provide asymptotic expansions for the model bias and the model variance of the resulting estimator and compare them with those corresponding to the Kuo estimator based on local linear regression. It turns out that the leading terms in the model variances are the same and that, for appropriately chosen bandwidth sequences, the squared model bias of both estimators goes to zero faster than the model variance. To establish which estimator is asymptotically more efficient from the model-based perspective thus requires knowledge of the second order terms of the model variances. The latter however depend on more specific assumptions than those considered in the present work and, at least for the estimator based on the modified fitted values, it seems no easy task to determine the second order terms of the model variances. Which estimator is more efficient from the model-based perspective remains thus an open question.

In addition to the above model-based estimators, we analyze also the generalized difference estimators based on both types of fitted values in their design weighted versions. The results in Section 3 show that the convergence rates of their model biases and their model variances are the same as those of their model-based counterparts. As for design-based properties, they are discussed to some extent in Section 4 along with the issue of variance estimation. It would of course be of interest to derive and compare asymptotic expansions for the design biases and the design variances. Breidt and Opsomer (2000) derive under mild conditions a general expression for the first order term in the design mean square error of local polynomial regression estimators, of which the generalized difference estimator based on the fitted values of Kuo is a special case. The generalized difference estimator based on the modified fitted values does however not fall into this class. In line with Särndal, Swensson and Wretman (1992), we conjecture that under broad conditions the first order term of its design mean square error is the same as the one of the generalized difference estimator based on the fitted values of Kuo. Formal proofs could perhaps be obtained by adapting and extending some of the results in Wang and Opsomer (2011). To test this conjecture and to compare the performance of the generalized difference and the model-based estimators in various settings, we performed a simulation study whose results are presented in Section 5.

Date modified: