Browse by

3. Variance estimation for the one-step calibration estimator

Phillip S. Kott and Dan Liao

In this section, we let

$t_{y} = \sum_{R} w_{k} y_{k} = \sum_{R} d_{k} α (g^{T} x_{k}) y_{k}$

be the calibration-weighted estimator for $T_{y},$ where $w_{k} = d_{k} α (g^{T} x_{k})$ when $k \in R$ is the calibration weight, and $w_{k}$ is conveniently defined to be 0 when $k \notin R .$ The weight-adjustment function $α (\cdot)$ is defined implicitly by equation (2.4), and $g$ is again chosen so that the calibration equation (2.5) holds for either $θ = 0$ or $1.$

We propose the following estimator for the variance $t_{y} :$

$v (t_{y}) = \sum_{k, j \in S} (1 - \frac{π_{k} π_{j}}{π_{k j}}) [d_{k} (θ z_{k}^{T} b + α_{k} e_{k})] [d_{j} (θ z_{j}^{T} b + α_{j} e_{j})] + \sum_{k \in R} d_{k} (α_{k}^{2} - α_{k}) e_{k}^{2}, (3.1)$

where $π_{k j}$ is the joint selection probability of $k$ and $j$ under the original sampling design, $π_{k k} = π_{k} = 1 / d_{k}, π_{k} = α (g^{T} x_{k})$ when $k \in R$ and 0 otherwise,

$b = {[\sum_{R} d_{k} α^{'} (g^{T} x_{k}) x_{k} z_{k}^{T}]}^{- 1} \sum_{R} d_{k} α^{'} (g^{T} x_{k}) x_{k} y_{k}, (3.2)$

and $e_{k} = y_{k} - z_{k}^{T} b .$ We will show that $v (t_{y})$ in equation (3.1) can be nearly unbiased in some sense if either a response model (Section 3.1) or prediction model holds (Section 3.2).

The variance estimator in equation (5.2) of Kott (2006) is identical to $v (t_{y})$ in equation (3.1) when $θ = 0.$ The variance estimator in Kim and Haziza (2014) is also similar. Their prediction model is more general than the linear prediction model considered here.

This variance estimator $v (t_{y})$ presupposes that the original sampling design is such that each element can only be drawn once. In Section 3.1, we see that when the probabilities of response are independent (Poisson), then under mild assumptions, $v (t_{y})$ is a nearly unbiased estimator of the mean squared error of $t_{y}$ under the quasi-sampling design whether or not the prediction model, $E (y_{k} | x_{k}, z_{k}) = z_{k}^{T} β,$ holds.

In Section 3.2, $v (t_{y})$ is shown to be a nearly unbiased estimator for the combined prediction-model and original-sampling-design variance of $t_{y}$ as an estimator for $T_{y}$ whether or not the response model in equation (2.4) holds. Thus, $v (t_{y})$ can be called a “simultaneous variance estimator”.

3.1 Variance estimation under the response model

For ease of exposition we will assume that the response model in equation (2.4) with a finite $u$ holds. Sufficient conditions for $v (t_{y})$ to be a nearly unbiased estimator for the mean squared error of $t_{y}$ (by which the bias converges to 0 as the sample size grows arbitrary large) are

$π_{k j} \geq B_{0} > 0 (3.3)$

$\sum_{j = 1}^{N} | \frac{π_{k j}}{π_{k} π_{j}} - 1 | \leq B_{1} < \infty for every k, (3.4)$

$\frac{\sum_{j = 1}^{N} ψ_{j}^{r}}{N} \leq B_{2} < \infty where ψ_{j} is y_{j} or any component of x_{j} or z_{j}, while r = 1 or 2, (3.5)$

and $N^{- 1} \sum_{R} d_{k} α^{'} (g^{T} x_{k}) z_{k} x_{k}^{T}$ is of full rank and is bounded in probability as the sample size grows arbitrarily large.

From these, $α^{'} (ϕ) = (1 - α (ϕ) / u) \exp (ϕ) / [(1 + \exp (ϕ) / u)]$ being bounded when $u$ is finite, and the Cauchy-Schwarz inequality $({(\sum a_{k} b_{k})}^{2} \leq \sum a_{k}^{2} \sum b_{k}^{2}),$ it is not hard to see not only that $g$ is a consistent estimator for $γ,$ but also that $b$ in equation (3.2) (which can be rendered $b = {[N^{- 1} \sum_{R} d_{k} α^{'} (g^{T} x_{k}) x_{k} z_{k}^{T}]}^{- 1} N^{- 1} \sum_{R} d_{k} α^{'} (g^{T} x_{k}) x_{k} y_{k})$ has a probability limit, call it $b^{*},$ whether or not the prediction model holds. Moreover, both $b - b^{*}$ and $g - γ$ are $O_{p} (1 / \sqrt{n}) .$

Observe that

$\begin{array}{l} (t_{y} - T_{y}) / N & = & θ (\sum_{S} d_{k} z_{k}^{T} b^{*} - \sum_{U} z_{k}^{T} b^{*}) / N \\ + & [\sum_{R} d_{k} α (g^{T} x_{k}) e_{k}^{*} - \sum_{R} d_{k} α (γ^{T} x_{k}) e_{k}^{*}] / N \\ + & [\sum_{R} d_{k} α (γ^{T} x_{k}) e_{k}^{*} - \sum_{U} e_{k}^{*}] / N, \end{array}$

where $e_{k}^{*} = y_{k} - z_{k}^{T} b^{*} .$ The insertion of the $α^{'} (\cdot)$ into the “regression coefficient” $b$ allows us to ignore the contribution to quasi-design mean squared error of the second term in this sum, $Q = \sum_{R} d_{k} [α (g^{T} x_{k}) - α (γ^{T} x_{k})] e_{k}^{*} / N .$ That is because $\sum_{R} d_{k} α^{'} (γ^{T} x_{k}) x_{k} e_{k} = 0$ is true by definition, which implies $\sum_{R} d_{k} α^{'} (γ^{T} x_{k}) x_{k} e_{k}^{*}$ is $O_{p} (1 / \sqrt{n})$ under our assumptions. Moreover, since $α (g^{T} x_{k}) - α (γ^{T} x_{k}) = α^{'} (c_{k}) {(g - γ)}^{T} x_{k}$ is also $O_{p} (1 / \sqrt{n}),$ $Q = {(g - γ)}^{T} \sum_{R} d_{k} α^{'} (c_{k}) x_{k} e_{k}^{*}$ is $O_{p} (1 / n),$ which is asymptotically ignorable relative to the two $O_{p} (1 / \sqrt{n})$ components of $(t_{y} - T_{y}) / N .$

With the contribution of $Q$ eliminated from consideration, an idealized, but not calculable, nearly unbiased estimator for the quasi-design mean squared error of $t_{y}$ is

$v_{I 1} (t_{y}) = \sum_{k, j \in S} (1 - \frac{π_{k} π_{j}}{π_{k j}}) [d_{k} (θ z_{k}^{T} b^{*} + e_{k}^{*})] [d_{j} (θ z_{j}^{T} b^{*} + e_{j}^{*})] + {\sum_{k \in R} (\frac{d_{k} e_{k}^{*}}{p_{k}})}^{2} (1 - p_{k}), (3.6)$

where the first term on the right estimates the mean squared error before nonresponse (if any) and the second the added variance due to nonresponse.

An alternative nearly unbiased idealized mean squared error estimator, closer to being calculable, is

$v_{I 2} (t_{y}) = \sum_{k, j \in S} (1 - \frac{π_{k} π_{j}}{π_{k j}}) [d_{k} (θ z_{k}^{T} b^{*} + \frac{R_{k}}{p_{k}} e_{k}^{*})] [d_{j} (θ z_{j}^{T} b^{*} + \frac{R_{j}}{p_{j}} e_{j}^{*})] + {\sum_{k \in R} d_{k} (\frac{e_{k}^{*}}{p_{k}})}^{2} (1 - p_{k}), (3.7)$

where again $R_{k} = 1$ when $k \in R, 0$ otherwise. Since the $(R_{k} / p_{k}) e_{k}^{*}$ are independent under the response model with mean $e_{k}^{*}$ and variance ${(e_{k}^{*} / p_{k})}^{2} p_{k} (1 - p_{k}), E [(R_{k} / p_{k}) e_{k}^{*} (R_{j} / p_{j}) e_{j}^{*}] = e_{k}^{*} e_{j}^{*}$ when $k \neq j .$ By contrast, the following holds when $k = j :$

$\begin{array}{l} (1 - π_{k}) E [{(d_{k} \frac{R_{k}}{p_{k}} e_{k}^{*})}^{2}] & = (1 - π_{k}) [{(d_{k} e_{k}^{*})}^{2} + {(\frac{d_{k} e_{k}^{*}}{p_{k}})}^{2} p_{k} (1 - p_{k})] \\ = (1 - π_{k}) {(d_{k} e_{k}^{*})}^{2} + {(\frac{d_{k} e_{k}^{*}}{p_{k}})}^{2} p_{k} (1 - p_{k}) - d_{k} {(\frac{e_{k}^{*}}{p_{k}})}^{2} p_{k} (1 - p_{k}) . \end{array}$

The first summation on the right-hand side of equation (3.7) has terms where $k \neq j$ and terms where $k = j,$ the latter of which causes the second summation in (3.7) to differ from the second summation on the right-hand side of equation (3.6). Note that the expectation under the response model of $\sum_{R} d_{k} {(e_{k}^{*} / p_{k})}^{2} (1 - p_{k})$ in the second summation on the right-hand side of (3.7) is $\sum_{S} d_{k} {(e_{k}^{*} / p_{k})}^{2} p_{k} (1 - p_{k}) .$

Finally, $v_{I 2} (t_{y})$ can be replaced by the asymptotically identical, but computable, $v (t_{y})$ in equation (3.1) since $\sum_{j \in S} (1 - π_{k} π_{j} / π_{k j})$ is bounded for all $k$ under assumptions (3.3) and (3.4), allowing $e_{k}$ and $α_{k}$ to be substituted for the unknown $e_{k}^{*}$ and $1 / p_{k},$ respectively (because $e_{k}^{*} - e_{k}$ and $α_{k} - 1 / p_{k}$ are $O_{p} (1 / \sqrt{n})$ for all $k) .$

3.2 Variance estimation under the prediction model

Matters are a bit simpler when we assume a prediction model holds but not necessarily the response model in equation (2.4). Suppose $E (y_{k} | x_{k}, z_{k}) = z_{k}^{T} β,$ whether or not $k$ is sampled or responds when sampled, and the $ε_{k} = y_{k} - z_{k}^{T} β$ are uncorrelated random variables with variances equal to $σ_{k}^{2} = z_{k}^{T} η,$ where $η$ need not be specified other than having finite components.

The mean squared error of $t_{y}$ as an estimator for $T_{y}$ under that prediction model is the sum of the prediction variance of $t_{y}$ as an estimator for $T_{y}, \sum_{R} (w_{k}^{2} - w_{k}) σ_{k}^{2}$ (see, for example, Kott 2009, page 69), and the squared bias, ${(\sum_{S} x_{k}^{T} β - \sum_{U} x_{k}^{T} β)}^{2},$ the latter being zero when $θ = 0.$ The combined variance of $t_{y}$ as an estimator for $T_{y}$ under the prediction model and original sample design is

$V_{C} = θ {Var}_{D} (\sum_{S} x_{k}^{T} β) + E_{D} [\sum_{S} (w_{k}^{2} - w_{k}) σ_{k}^{2}],$

where the subscript $D$ denotes that the operation (variance or expectation) is with respect to the original sampling design. Recall $w_{k} = 0$ for $k \neq R .$

To see that $v (t_{y})$ in equation (3.1) provides a nearly unbiased estimator for $V_{C},$ observe first that

$e_{k} = y_{k} - z_{k}^{T} b = ε_{k} - z_{k}^{T} {[N^{- 1} \sum_{R} d_{j} α^{'} (g^{T} x_{j}) x_{j} z_{j}^{T}]}^{- 1} N^{- 1} \sum_{R} d_{j} α^{'} (g^{T} x_{j}) x_{j} ε_{j} .$

Let $δ_{k j} = 1$ when $k = j$ and $0$ otherwise. Because the $ε_{k}$ are uncorrelated, and $E (ε_{k}^{2}) = σ_{k} = z_{k}^{T} η,$ it is now not hard to show that $E (e_{k} e_{j}) = δ_{k j} σ_{k}^{2} + O (1 / n)$ for almost every $k, j$ pair under the prediction model when $N^{- 1} \sum_{R} d_{k} α^{'} (g^{T} x_{k}) z_{k} x_{k}^{T}$ converges to an invertible matrix, and assumptions (3.3), (3.4), and

$\frac{\sum_{j = 1}^{N} ψ_{j}^{r}}{N} \leq B_{2} < \infty where ψ_{j} is any component of x_{j} or z_{j}, and r = 1, 2, 3, or 4, (3.8)$

hold. Observe that the change from the assumptions in (3.5) to (3.8) makes the relative bias of $v (t_{y})$ as an estimator for $V_{C} (or \sum_{R} (w_{k}^{2} - w_{k}) σ_{k}^{2} when θ = 0) O (1 / n)$ rather than $O (1 / \sqrt{n}) .$

Previous | Next

Date modified:: 2015-11-27

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

3. Variance estimation for the one-step calibration estimator

3.1 Variance estimation under the response model

3.2 Variance estimation under the prediction model