Browse by

2. Bagging survey estimators

Jianqiang C. Wang, Jean D. Opsomer and Haonan Wang

2.1 General approach

In this section, we discuss the implementation of bagging in the context of survey estimation. We first introduce necessary notation. Let $U$ represent a finite population of size $N,$ in which each element $i \in U$ is associated with a vector of measurements, $y_{i}$ , in the $q$ -dimensional Euclidean space $ℝ^{q} .$ The sampling design $p ()$ is used to draw a random sample $A \subseteq U$ of sample size $n .$ We denote by $Y = {y_{i} | i \in A}$ the collection of sample observations. Here, the sampling design could be simple random sampling without replacement (SRSWOR), Poisson sampling or a complex design with stratification and/or multiple stages. Under each design, the probability of an element $i$ being included in the sample is denoted by $π_{i} .$

The population mean of the measurement vector $y$ is denoted by $μ .$ It can be estimated by the Horvitz-Thompson (HT) estimator defined as

$\hat{μ} = \frac{1}{N} \sum_{i \in A} \frac{y_{i}}{π_{i}} = \frac{1}{N} \sum_{i = 1}^{N} \frac{y_{i}}{π_{i}} I_{i}, (2.1)$

where $I_{i}$ is the sample membership indicator for the $i -th$ element. More generally, let $θ$ denote a population quantity of interest, and $\hat{θ} (Y)$ is the estimator of $θ$ based on the sample observations $Y$ . The estimator $\hat{θ} (Y)$ will be abbreviated as $\hat{θ}$ when there is no confusion. As noted in the previous section, we assume that $\hat{θ}$ can be written as a function of simpler estimators of the form (2.1).

In its most general form, the bagging algorithm for survey estimation is as follows:

For b = 1 , 2 , … , B : MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFCI8FfYJH8YrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbb a9q8WqFfea0=yr0RYxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9 Ff0dmeaabaqaciGacaGaaeqabaWaaeaaeaaakeaacaWGIbGaeyypa0 JaaGymaiaacYcacaaIYaGaaiilaiablAciljaaiYcacaWGcbGaaiOo aaaa@3D95@
1. Draw resample $A_{b}$ from the random sample $A,$ and denote the observations in the resample as $Y_{b}^{*} = {y_{i} | i \in A_{b}} .$
2. Calculate the parameter estimate based on the resample $A_{b},$ denoted by $\hat{θ} (Y_{b}^{*}) .$
Average over the replicated estimates $\hat{θ} (Y_{1}^{*}), \hat{θ} (Y_{2}^{*}), \dots, \hat{θ} (Y_{B}^{*})$ to obtain the bagged survey estimator,
${\hat{θ}}_{b a g} = \frac{1}{B} \sum_{b = 1}^{B} \hat{θ} (Y_{b}^{*}) . (2.2)$

In the bagging literature, the resamples $A_{b}$ are often referred to as bootstrap samples (Breiman 1996), and we will do the same here despite the fact that we will not use them for variance estimation.

In the algorithm, the bootstrap samples could be drawn according to the sampling design rather than the empirical distribution of the sample observations, which is more commonly used in the ordinary bagging literature (Breiman 1996) and equivalent to simple random sampling (with or without replacement). For example, if the sample $A$ is drawn using stratified or cluster sampling, such design scheme could be taken into consideration when selecting the resamples. More generally in the survey context, step 1 of the proposed bagging algorithm can be treated in the framework of two-phase sampling: the first phase corresponds to the original sample $A$ and the second phase to the resample $A_{b} .$ Thus the classical expansion estimator for two-phase designs Särndal, Swensson and Wretman (1997) is implemented in calculating the replicated estimator $\hat{θ} (Y_{b}^{*}) .$ In the resample $A_{b},$ the pseudo inclusion probability for the $i -th$ element is $π_{i}^{*} = π_{i} π_{i | A}$ where $π_{i | A} = \Pr (i \in A_{b} | i \in A)$ is the inclusion probability of the $i -th$ element in resample $A_{b}$ given that it is included in sample $A .$ Hence, the bagged estimator is an approximation to the expectation of the two-phase estimator with respect to the second sampling phase, which is also referred to as bootstrap expectation in ordinary bagging methods (Bühlmann and Yu 2002). Although a general design for the bootstrap samples is possible, in the theoretical portions of this article we will restrict ourselves to SRSWOR. To broaden the scope of our discussion, in the variance estimation and numerical section, we introduce the case in which the bootstrap samples are drawn by stratified SRSWOR with the same strata as the original sample $A,$ which is a useful and realistic extension.

As an example, we consider the HT estimator as defined in (2.1). The bootstrap resampling from the realized sample $A$ is drawn under SRSWOR of size $k .$ Under this resampling scheme, the replicated sample estimator is defined as

$\hat{μ} (Y_{b}^{*}) = \frac{1}{N} \sum_{i \in A_{b}} \frac{y_{i}}{π_{i}^{*}}, (2.3)$

where the pseudo inclusion probability $π_{i}^{*} = π_{i} π_{i | A} = k π_{i} / n .$ Then the bagged version of the classical $π^{*}$ -estimator can be calculated using (2.2). Straightforward calculation shows that the bagged estimator is identical to the original HT estimator if all SRSWOR samples of size $k$ are enumerated in calculating (2.2). The same result holds for any other linear survey estimator. In general, the calculation of the bagged estimator ${\hat{θ}}_{b a g}$ is not as easy. In the rest of this section, we will focus on such calculations for the three types of nonlinear survey estimators discussed in Section 1.

2.2 Bagging differentiable survey estimators

For the survey estimators that are differentiable functions of HT estimators, the population quantity of interest can also be written as a differentiable function of population means; that is, $θ_{d} = m (μ),$ where $m (\cdot)$ is a known differentiable function. The subscript “ $d$ ” stands for differentiable in contrast to non-differentiable $(θ_{n d})$ and estimating equation $(θ_{e e})$ coming later. A direct plug-in estimator of $θ_{d},$ based on sample observations $Y$ , can be written as

${\hat{θ}}_{d} = m (\hat{μ}), (2.4)$

where $\hat{μ}$ is defined in (2.1). Thus, the replicated sample version of ${\hat{θ}}_{d}$ can be expressed as

${\hat{θ}}_{d} (Y_{b}^{*}) = m (\hat{μ} (Y_{b}^{*})),$

where $\hat{μ} (Y_{b}^{*})$ is defined by (2.3). Then the bagged estimator of $θ_{d},$ denoted by ${\hat{θ}}_{d, b a g},$ is defined using (2.2).

2.3 Bagging explicitly defined non-differentiable estimators

As an example of this type of estimators, consider the estimation of the proportion of households with income below the poverty line for a population. Such quantity can be written as $(1 / N) \sum_{i = 1}^{N} I (y_{i} \leq λ_{N}),$ where $y_{i}$ is the income value for the $i -th$ household in the population, and $λ_{N}$ is the population poverty line. It can be seen that this quantity of interest is the mean of indicator kernel functions, and the kernel function is non-differentiable with respect to $λ_{N} .$ Here, we consider a more general class in which the kernel is an arbitrary non-differentiable but bounded function. This type of population quantity can be expressed as

$θ_{n d} = \frac{1}{N} \sum_{i = 1}^{N} h (y_{i} - λ_{N}),$

where $λ_{N}$ is an unknown population parameter, for example, the mean, a quantile or other population quantity, and $h (y - λ) : ℝ^{p} \to ℝ$ is a non-differentiable function of $λ .$ The population quantity $θ_{n d}$ generalizes the notion of the proportion below an estimated level and resembles the general form of a U-statistic.

Wang and Opsomer (2011) studied a class of U-statistics-like estimators, namely, non-differentiable survey estimators,

${\hat{θ}}_{n d} = \frac{1}{N} \sum_{i \in A} \frac{1}{π_{i}} h (y_{i} - \hat{λ}), (2.5)$

where $\hat{λ}$ is a design-based estimator of $λ_{N} .$ In the non-survey context, estimators of this type are regarded as “non-differentiable functions of the empirical distribution” (Bickel, Götze and van Zwet 1997). The study of appropriate bootstrap procedures for such estimators was carried out by Beran and Srivastava (1985) and Dümbgen (1993), among others. We define the replicated version of ${\hat{θ}}_{n d}$ based on resample $A_{b}$ as

${\hat{θ}}_{n d} (Y_{b}^{*}) = \frac{1}{N} \sum_{i \in A_{b}} \frac{1}{π_{i}^{*}} h (y_{i} - \hat{λ} (Y_{b}^{*})),$

where $\hat{λ} (Y_{b}^{*})$ solely depends on the bootstrap resample $A_{b},$ and the bagged estimator is then defined by averaging replicated estimators. Suppose that the resampling process is SRSWOR of size $k,$ and every subsample is selected in calculating the bagging estimator, then the bagging estimator takes the following form after manipulation,

${\hat{θ}}_{n d, b a g} = \frac{1}{N} \sum_{i \in A} \frac{1}{π_{i} (\begin{matrix} n - 1 \\ k - 1 \end{matrix})} \sum_{A_{b} ∍ i} h (y_{i} - \hat{λ} (Y_{b}^{*})), (2.6)$

which replaces $h (y_{i} - \hat{λ})$ in (2.5) by a “smoothed” quantity $\sum_{A_{b} ∍ i} h (y_{i} - \hat{λ} (Y_{b}^{*})) / (\begin{matrix} n - 1 \\ k - 1 \end{matrix}),$ by averaging the “jumps” in the estimator. Very often, variance reduction can be achieved by this replacement. The summand $\sum_{A_{b} ∍ i} h (y_{i} - \hat{λ} (Y_{b}^{*})) / (\begin{matrix} n - 1 \\ k - 1 \end{matrix})$ is the bootstrap expectation of $h (y_{i} - \cdot)$ and can be approximated using the convolution of $h (y_{i} - \cdot)$ with the sampling distribution of $\hat{λ} (Y_{b}^{*}) .$ Study of the theoretical aspects of ${\hat{θ}}_{n d, b a g}$ is deferred until Section 3.

2.4 Bagging estimators defined by non-differentiable estimating equations

Finally, we explain how to bag estimators defined by non-differentiable estimating equations. For ease of presentation, we consider a one-dimensional parameter of interest. The population parameter $θ_{e e}$ of interest is defined as

$θ_{e e} = \inf {γ : S (γ) \geq 0},$

where

$S (γ) = \frac{1}{N} \sum_{i = 1}^{N} ψ (y_{i} - γ),$

and $ψ (\cdot)$ is a non-differentiable real function. We can estimate the population parameter $θ_{e e}$ by ${\hat{θ}}_{e e},$ where

${\hat{θ}}_{e e} = \inf {γ : \hat{S} (γ) \geq 0}$

with

$\hat{S} (γ) = \frac{1}{N} \sum_{i \in A} \frac{1}{π_{i}} ψ (y_{i} - γ) .$

A frequently encountered estimator of this type is the sample quantile defined by inverting the sample cumulative distribution function (Francisco and Fuller 1991), where $ψ (y_{i} - γ) = I_{(y_{i} \leq γ)} - α$ for the $α$ -quantile.

Conceptually, there are two versions of bagging ${\hat{θ}}_{e e},$ one is to solve the “bagged estimating equation” defined by bagging the score function, and another is to average over resampled estimates of ${\hat{θ}}_{e e} .$ Similarly to the discussion in Section 2.1, the first version results in an estimator equivalent to the original estimator, because the bootstrap expectation of bootstrap samples of $\hat{S} (γ)$ is equal to $\hat{S} (γ)$ for fixed $γ .$ We therefore only consider the latter version. To define the bagged estimating equation estimator, we first define the replicated score function ${\hat{S}}_{b} (γ)$ based on resample $A_{b}$ as

${\hat{S}}_{b} (γ) = \frac{1}{N} \sum_{i \in A_{b}} \frac{1}{π_{i}^{*}} ψ (y_{i} - γ) .$

Then the replicated estimator based on $A_{b}$ is defined as ${\hat{θ}}_{e e} (Y_{b}^{*}) = \inf {γ : {\hat{S}}_{b} (γ) \geq 0} .$ Thus the overall bagging estimator is defined as

${\hat{θ}}_{e e, b a g} = \frac{1}{(\begin{matrix} n \\ k \end{matrix})} \sum {\hat{θ}}_{e e} (Y_{b}^{*}), (2.7)$

where the average is over all possible without-replacement samples of size $k$ selected from $A .$ Chen and Hall (2003) discussed bagging estimators defined by nonlinear estimating equations under the $i i d$ setup, and they stated that bagging does not always improve the precision of estimators under study.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology