3. Theoretical results
Jianqiang C. Wang, Jean D. Opsomer and Haonan Wang
Previous | Next
We begin by
briefly describing the asymptotic analysis of the bagging estimators under
general sampling design from a finite population, i.e. the design-based
setting. We do this under the usual increasing-population framework, where we
consider an increasing sequence of nested populations, say
with finite population means
Associated with the sequence of
populations is a sequence of sampling designs used to draw random sample
of sample size
with associated inclusion
probabilities
As commonly done in the survey
literature, we suppress the
subscript in the sample
the sample size
and the inclusion probabilities
For the sake of brevity, only
design-based asymptotic results for bagging differentiable estimator
and non-differentiable
are provided. The formal
assumptions under which the results are obtained and the theorems for
differentiable and non-differentiable estimators are in Appendix A.1. The main
result we are able to obtain in this design-based setting is that, if we are
starting from a design-consistent estimator and we let the number of bootstrap
samples
grow with
the bagged versions of the
estimators are also design consistent. This is clearly a key property of these
estimators, since there would be no reason to consider them unless they
satisfied this design consistency.
Unfortunately, the
above design-based results are quite limited and in particular, do not provide
an asymptotic distribution with which one might be able to perform inference,
another highly desirable property of survey estimators. We therefore also
consider a model-based setting, under which we are able to obtain an asymptotic
variance approximation. In presenting model-based results, we assume the
sampling design selecting the original sample
is an equal probability design,
and the population characteristics can be regarded as an
sample from a superpopulation
distribution. Under this framework, the bagging estimator can be treated as a
U-statistic. Thus we can apply the theory on U-statistics to obtain asymptotic
expansion of bagging estimators. The analysis parallels that of Bühlmann and Yu
(2002) and Buja and Stuetzle (2006). For the current paper, we restrict
ourselves to bootstrap samples of size
where
is bounded and fixed. Under this
asumption, the bagging estimators can be regarded as fixed-degree U-statistics,
for which asymptotic theory has been well developed. A more interesting case is
when the resample size
grows with sample size
and this leads to infinite-degree
U-statistics. Infinite-degree U-statistics have applications in studying the
Kaplan-Meier estimator and
bootstrap estimators, and the
readers are referred to Frees (1989); Heilig (1997); Heilig and Nolan (2001),
and the references therein on their statistical properties. Schick and Wefelmeyer
(2004) studied the statistical properties of infinite-degree U-statistics
constructed from moving averages of innovations in time series. The study of
bagging estimators by viewing them as infinite-degree U-statistics is out of
the scope of the current paper, and hence we limit ourselves to the case of
fixed and bounded bootstrap sample size in the model-based case.
We first consider
bagged estimator (2.5). Under SRSWOR, estimator (2.5) can be simplified to
and the bagged version of
is defined as
where
only depends on resample
For ease of presentation, we take
as the sample mean. In this case,
straightforward algebra reveals that
where
is the collection of subsets of
size
from set
The estimator
is a
U-statistic with kernel
provided that
remains finite.
One can see that
the bagging estimator
is a symmetric statistic of
and standard theory on symmetric
statistics (Lee 1990) applies. The results are stated in Theorem 1, with assumptions
and proofs in Appendix A.2.
Theorem 1 Under
Assumptions M.1-M.4 on the superpopulation distribution, sampling and
resampling designs,
where the limiting value
the asymptotic variance
and
As indicated by (3.3),
the asymptotic variance of the bagging estimator depends on unknown functions
and
which are expectations of
with respect to the
superpopulation distribution. In
and
is calculated from
together with an arbitrary vector
The expectation is with respect
to the distribution of
random vectors
This high-dimensional expectation
is difficult to calculate and may not have an explicit expression in general.
The exact form of
and
can not be obtained but can be
approximated via a resampling-based approach. The unknown functions
and
are defined as expectations of
respective quantities with respect to the superpopulation distribution, which
can be approximated by the expectation with respect to the empirical
distribution.
The model-based
asymptotic variance can be estimated along with the process of bagging. We can
calculate integrands
and
based on each bootstrap sample,
with
denoting where we want to
evaluate
and
and
denoting resampled values. Then
we can average each quantity to approximate the expectation. Finally, the
variance can be estimated by computing the sample variance of the expectations
evaluated at each of the sample points. For nonsmooth estimators like the ones
we are dealing with, it is often recommended to use smoothed bootstrap in
variance approximation (Efron 1979; Davison and Hinkley 1997). We apply the
smoothed bootstrap and add a small amount of noise to each resampled value to
smooth the underlying function. The detailed algorithm will be explained in
Section 5 through an example.
We now study the
model-based result of bagging estimators defined by estimating equations (2.7).
A special case in this framework is bagging sample quantiles, which was studied
by Knight and Bassett (2002). Knight and Bassett (2002) considered both
bootstrap and SRSWOR for resampling, and studied the effects of bagging on the
remainder term in the Bahadur representation of quantiles (Bahadur 1966). We
take a slightly different perspective and treat the bagging estimator as a
U-statistic. Assumptions and proof are again in Appendix A.2. Note that
Assumption M.5 requires that the non-differentiable estimating function have a
smooth limit. In the next theorem, we provide linearization of the bagging
estimating equation estimator and give an expression for the asymptotic
variance.
Theorem 2 Under
Assumptions M.1-M.3 and M.5, the following asymptotic result holds for the
bagged estimating equation estimator (2.7),
where
denotes the asymptotic limit of population
quantity
the asymptotic variance of
is
and
As we saw for the
bagged estimator (3.1), the asymptotic results in Theorem 2 involve an unknown
function. This function can again be computed using resampling that takes
advantage of the available replicate samples.
Previous | Next