1. Introduction

Jianqiang C. Wang, Jean D. Opsomer and Haonan Wang

Previous | Next

Bagging, short for “bootstrap aggregating”, is a resampling method originally introduced to improve “weak” learning algorithms. Bagging was proposed by Breiman (1996), who heuristically demonstrated how it improved the performance of tree-based predictors. Since then, bagging has been applied to a wide range of settings and analyzed by many authors. Bühlmann and Yu (2002) showed the smoothing effect of bagging and its variations on hard-decision classification algorithms, and formalized the notion of “unstable predictors”. Chen and Hall (2003) derived theoretical results on bagging estimators defined by estimating equations. Buja and Stuetzle (2006) considered bagging U-statistics, and claimed that bagging “often but not always decreases variance, whereas it always increases bias”. Friedman and Hall (2007) examined the impact of bagging on nonlinear estimators. More recently, Hall and Robinson (2009) discussed the effects of bagging on cross-validation choice of smoothing parameters, and presented intriguing results on improving the order of the cross-validation selected kernel bandwidth by bagging.

The aforementioned literature studied the effects of bagging on various estimators, especially nonlinear, non-differentiable estimators, under the i i d MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9 Ff0dmeaabaqaciGacaGaaeqabaWaaeaaeaaakeaacaWGPbGaamyAai aadsgaaaa@3839@  (independent and identically distributed) sampling assumption. For dependent data, Lee and Yang (2006); Inoue and Kilian (2008) studied the effects of bagging on economic time series. The former authors studied the bagging effect on non-differentiable predictors like sign functions and quantiles, and the latter focused on bagging pretest predictors with application to U.S. consumer price inflation forecasting.

As this brief literature review shows, bagging is a promising method used to improve the efficiency of estimators. To date, however, bagging for survey estimators has not been considered. This article is a first exploration of the use of bagging in the survey context, including an evaluation of the potential efficiency gain, a number of theoretical results, and a discussion of implementation and variance estimation issues. Corresponding with general survey practice, we will only consider estimators that can be written as functions of Horvitz-Thompson (HT) estimators. More specifically, we will consider the following three types of estimators. Firstly, many commonly used survey estimators can be written as differentiable functions of HT estimators. For instance, the Hajek estimator, ratio estimator, generalized regression estimator can all be regarded as differentiable functions of HT estimators. Secondly, there are other survey estimators that are non-differentiable, including the Dunstan and Chambers estimator (Dunstan and Chambers 1986), the Rao-Kovar-Mantel estimator (Rao, Kovar, and Mantel 1990), the endogenous post-stratification estimator (Breidt and Opsomer 2008), and estimators of low-income proportion (Berger and Skinner 2003), among others. Thirdly, other estimators are only defined as solutions to weighted estimating equations. For more information on estimating equations in the survey context, see Godambe and Thompson (2009); Fuller (2009), and references therein.

While bagging can be considered a type of replication method, it is quite different from bootstrapping and other replication methods that are designed for variance estimation. Unlike these other methods, bagging is introduced to improve the actual estimator itself. The bagging method can be naturally embedded in large-scale complex surveys, since we can take advantage of replication weights that are readily available in many practical surveys. In this paper, we will show how replicates created for bootstrap variance estimation can be modified and used in bagging the original estimator. Unfortunately, one difficulty in implementing bagging in surveys is the lack of a design-based variance estimator. We will discuss a number of proposals on how to estimate the variance of bagged survey estimators, but further work is still required in this area.

The remainder of this paper is organized as follows. We define our target survey estimators and introduce the bagged version of each estimator in Section 2. We then present the theoretical properties of the bagged estimators in Section 3. Section 4 shows how to use survey replicates to implement bagged versions of estimators, and addresses variance estimation for the resulting bagged estimators. We report on simulation results in Section 5, and conclude the paper with some final remarks in Section 6.

Previous | Next

Report a problem on this page

Is something not working? Is there information outdated? Can't find what you're looking for?

Please contact us and let us know how we can help you.

Privacy notice

Date modified: