2. Bagging survey estimators
Jianqiang C. Wang, Jean D. Opsomer and Haonan Wang
Previous | Next
2.1 General approach
In this section,
we discuss the implementation of bagging in the context of survey estimation.
We first introduce necessary notation. Let
represent a finite population of
size
in which each element
is associated with a vector of
measurements,
, in the
-dimensional Euclidean space
The sampling design
is used to draw a random sample
of sample size
We denote by
the collection of sample
observations. Here, the sampling design could be simple random sampling without
replacement (SRSWOR), Poisson sampling or a complex design with stratification
and/or multiple stages. Under each design, the probability of an element
being included in the sample is
denoted by
The population mean
of the measurement vector
is denoted by
It can be estimated by the
Horvitz-Thompson (HT) estimator defined as
where
is the sample membership indicator for the
element. More generally, let
denote a population quantity of
interest, and
is the estimator of
based on the sample observations
. The estimator
will be abbreviated as
when there is no confusion. As
noted in the previous section, we assume that
can be written as a function of
simpler estimators of the form (2.1).
In its most
general form, the bagging algorithm for survey estimation is as follows:
- For
- Draw
resample
from the random sample
and denote the observations in
the resample as
- Calculate
the parameter estimate based on the resample
denoted by
- Average
over the replicated estimates
to obtain the bagged survey
estimator,
In the bagging literature, the resamples
are often referred to as bootstrap samples (Breiman 1996), and
we will do the same here despite the fact that we will not use them for
variance estimation.
In the algorithm,
the bootstrap samples could be drawn according to the sampling design rather
than the empirical distribution of the sample observations, which is more
commonly used in the ordinary bagging literature (Breiman 1996) and equivalent
to simple random sampling (with or without replacement). For example, if the
sample
is drawn using stratified or
cluster sampling, such design scheme could be taken into consideration when
selecting the resamples. More generally in the survey context, step 1 of the
proposed bagging algorithm can be treated in the framework of two-phase
sampling: the first phase corresponds to the original sample
and the second phase to the
resample
Thus the classical expansion estimator
for two-phase designs Särndal, Swensson and Wretman (1997) is implemented in
calculating the replicated estimator
In the resample
the pseudo inclusion probability
for the
element is
where
is the inclusion probability of
the
element in resample
given that it is included in
sample
Hence, the bagged estimator is an
approximation to the expectation of the two-phase estimator with respect to the
second sampling phase, which is also referred to as bootstrap expectation in ordinary bagging methods (Bühlmann and Yu 2002).
Although a general design for the bootstrap samples is possible, in the
theoretical portions of this article we will restrict ourselves to SRSWOR. To
broaden the scope of our discussion, in the variance estimation and numerical
section, we introduce the case in which the bootstrap samples are drawn by
stratified SRSWOR with the same strata as the original sample
which is a useful and realistic
extension.
As an example, we consider the HT estimator as
defined in (2.1). The bootstrap resampling from the realized sample
is drawn under SRSWOR of size
Under this resampling scheme, the
replicated sample estimator is defined as
where the pseudo inclusion probability
Then the bagged version of the
classical
-estimator can be calculated using (2.2).
Straightforward calculation shows that the bagged estimator is identical to the
original HT estimator if all SRSWOR samples of size
are enumerated in calculating (2.2).
The same result holds for any other linear survey estimator. In general, the
calculation of the bagged estimator
is not as easy. In the rest of
this section, we will focus on such calculations for the three types of
nonlinear survey estimators discussed in Section 1.
2.2 Bagging differentiable
survey estimators
For the survey
estimators that are differentiable functions of HT estimators, the population
quantity of interest can also be written as a differentiable function of
population means; that is,
where
is a known differentiable
function. The subscript “” stands for differentiable in contrast to non-differentiable
and estimating equation
coming later. A direct plug-in
estimator of
based on sample observations
, can be written as
where
is defined in (2.1). Thus, the
replicated sample version of
can be expressed as
where
is defined by (2.3). Then the
bagged estimator of
denoted by
is defined using (2.2).
2.3 Bagging
explicitly defined non-differentiable estimators
As an example of this type of estimators,
consider the estimation of the proportion of households with income below the
poverty line for a population. Such quantity can be written as
where
is the income value for the
household in the population, and
is the population poverty line.
It can be seen that this quantity of interest is the mean of indicator kernel
functions, and the kernel function is non-differentiable with respect to
Here, we consider a more general
class in which the kernel is an arbitrary non-differentiable but bounded
function. This type of population quantity can be expressed as
where
is an unknown population
parameter, for example, the mean, a quantile or other population quantity, and
is a non-differentiable function
of
The population quantity
generalizes the notion of the
proportion below an estimated level and resembles the general form of a
U-statistic.
Wang and Opsomer
(2011) studied a class of U-statistics-like estimators, namely,
non-differentiable survey estimators,
where
is a design-based estimator of
In the non-survey context,
estimators of this type are regarded as “non-differentiable functions of the
empirical distribution” (Bickel, Götze and van
Zwet 1997). The study of appropriate bootstrap procedures for such estimators
was carried out by Beran and Srivastava (1985) and Dümbgen (1993), among
others. We define the replicated version of
based on resample
as
where
solely depends on the bootstrap
resample
and the bagged estimator is then
defined by averaging replicated estimators. Suppose that the resampling process
is SRSWOR of size
and every subsample is selected
in calculating the bagging estimator, then the bagging estimator takes the
following form after manipulation,
which replaces
in (2.5) by a “smoothed” quantity
by averaging the “jumps” in the estimator. Very often, variance
reduction can be achieved by this replacement. The summand
is the bootstrap expectation of
and can be approximated using the
convolution of
with the sampling distribution of
Study of the theoretical aspects
of
is deferred until Section 3.
2.4 Bagging
estimators defined by non-differentiable estimating equations
Finally, we
explain how to bag estimators defined by non-differentiable estimating
equations. For ease of presentation, we consider a one-dimensional parameter of
interest. The population parameter
of interest is defined as
where
and
is a non-differentiable real
function. We can estimate the population parameter
by
where
with
A frequently encountered estimator of this type is the sample quantile
defined by inverting the sample cumulative distribution function (Francisco and
Fuller 1991), where
for the
-quantile.
Conceptually,
there are two versions of bagging
one is to solve the “bagged
estimating equation” defined by bagging the score function, and another is to
average over resampled estimates of
Similarly to the discussion in
Section 2.1, the first version results in an estimator equivalent to the
original estimator, because the bootstrap expectation of bootstrap samples of
is equal to
for fixed
We therefore only consider the
latter version. To define the bagged estimating equation estimator, we first
define the replicated score function
based on resample
as
Then the replicated estimator based on
is defined as
Thus the overall bagging
estimator is defined as
where the average is over all possible without-replacement samples of
size
selected from
Chen and Hall (2003) discussed
bagging estimators defined by nonlinear estimating equations under the
setup, and they stated that
bagging does not always improve the precision of estimators under study.
Previous | Next