Browse by

2. Motivation

Andrés Gutiérrez, Leonardo Trujillo and Pedro Luis do Nascimento Silva

2.1 Sampling designs and estimators

Consider a finite population as a set of $N$ units, where $N < \infty$ , forming the universe of study. $N$ is known as the population size. Each element belonging to the population can be identified with an index $k .$ Let $U$ be the index set given by $U = {1,..., k,..., N} .$ The selection of a sample $s = {k_{1}, k_{2}, \dots, k_{n (s)}}$ is done according to a sampling design defined as the multivariate probability distribution over a support $Q$ in a way that $p (s) > 0$ for every $s \in Q$ and

$\sum_{s \in Q} p (s) = 1.$

Under a sampling design $p (\cdot)$ , an inclusion probability is assigned to every element in the population in order to denote the probability that the element belongs to the sample. For the $k$ -th element in the population this probability is denoted as $π_{k}$ and it is known as the first order inclusion probability given by

$π_{k} = P r (k \in S) = P r (I_{k} = 1) = \sum_{s ∍ k} p (s)$

where $I_{k}$ is a random variable denoting the membership of the element $k$ to the sample, and the subindex $s ∍ k$ refers to the sum over all the possible samples containing the $k$ -th element. Analogously, $π_{k l}$ is known as the second order inclusion probability and it denotes the probability that the elements $k$ and $l$ belong to the sample and it is given by

$π_{k l} = P r (k \in S; l \in S) = P r (I_{k} = 1; I_{l} = 1) = \sum_{s ∍ k,l} p (s) .$

The aim of the sample survey is to study a characteristic of interest $y$ associated with every unit in the population and to estimate a function of interest $T,$ called a parameter.

$T = f (y_{1}, \dots, y_{k}, \dots, y_{N}) .$

This inferential approach is known as design-based inference. Under this approach, the estimates of the parameters and their properties depend directly on the discrete probability measure related to the chosen sampling design and do not take into account the properties of the finite population. Also, the values $y_{k}$ are taken as the observation for the individual $k$ for the characteristic of interest $y$ . Also, $y$ is considered as a fixed quantity rather than a random variable.

Then, the Horvitz-Thompson (HT) estimator can be defined as:

${\hat{t}}_{y, π} = \sum_{k \in s} \frac{y_{k}}{π_{k}} = \sum_{k \in s} d_{k} y_{k}$

where $d_{k} = 1 / π_{k}$ is the reciprocal of the first-order inclusion probability and it is known as the expansion factor or basic design weight. The HT estimator is unbiased for the total population $t_{y} = \sum_{U} y_{k}$ , (assuming all the first order inclusion probabilities are greater than zero) and its variance is given by

$V a r ({\hat{t}}_{y, π}) = \sum_{k \in U} \sum_{l \in U} Δ_{k l} \frac{y_{k}}{π_{k}} \frac{y_{l}}{π_{l}} . (2.1)$

where $Δ_{k l} = C o v (I_{k}, I_{l}) = π_{k l} - π_{k} π_{l}$ . If all the second-order inclusion probabilities are greater than zero, an unbiased estimator of (2.1) is given by

$\hat{V a r} ({\hat{t}}_{y, π}) = \sum_{k \in s} \sum_{l \in s} \frac{Δ_{k l}}{π_{k l}} \frac{y_{k}}{π_{k}} \frac{y_{l}}{π_{l}} .$

Gambino and Silva (2009) suggest that in a household survey, the main interest is to focus on characteristics for particular household members that could be related to health variables, educational variables, income/expenses, employment status, etc. In general, the sampling designs used for this kind of survey are complex and use techniques such as stratification, clustering or unequal probabilities of selection. Some of the results from repeated surveys consider the estimation of level at a particular point of time, estimation of changes between two survey rounds and the estimation of the average level parameters over repeated rounds of a survey. Different rotation schemes and the frequency of the survey can affect considerably the precision of the estimators.

2.2 Pseudo-likelihood

Some authors such as Fuller (2009), Chambers and Skinner (2003, p. 179), and Pessoa and Silva (1998, chapter 5) consider the problem where the maximum likelihood estimation is appropriate for simple random samples, as is the case in Stasny (1987), but not for samples resulting from a complex survey design. Under this scheme, it is assumed that the density population function is $f (y, θ)$ where the parameter of interest is $θ$ . If there is access to the information for the whole population, through a census, the maximum likelihood estimator of $θ$ can be obtained by maximizing

$L (θ) = \sum_{k \in U} \log f (y_{k}, θ)$

with respect to $θ$ . We will denote $θ_{N}$ as the value maximizing the last expression. The likelihood equations for the population are given by

$\sum_{k \in U} u_{k} (θ) = 0.$

The $u_{k}$ are known as scores and they are defined as

$u_{k} (θ) = \frac{\partial \log f (y_{k}, θ)}{\partial θ} .$

The pseudo-likelihood approach considers that $θ_{N}$ is the parameter of interest according to the information collected in a complex sample. If $\sum_{k \in U} u_{k} (θ)$ is considered as the parameter of interest, it is possible to estimate it using a weighted linear estimator

$\sum_{k \in s} d_{k} u_{k} (θ)$

where $d_{k}$ is a sampling design weight such as the inverse of the inclusion probability of the individual $k .$ Then, it is possible to obtain an estimator for $θ_{N}$ solving the resulting equation system.

Definition 2.1 A maximum pseudo-likelihood estimator ${\hat{θ}}_{s}$ for $θ_{N}$ corresponds to the solution of the pseudo-likelihood equations given by

$\sum_{k \in s} d_{k} u_{k} (θ) = 0.$

Using the Taylor linearization method, the asymptotic variance of a maximum pseudo-likelihood estimator based on the sampling design is given by

$V_{p} ({\hat{θ}}_{s}) \approx {[J (θ_{N})]}^{- 1} V_{p} [\sum_{k \in s} d_{k} u_{k} (θ_{N})] {[J (θ_{N})]}^{- 1}$

where $V_{p} [\sum_{k \in s} d_{k} u_{k} (θ_{N})]$ is the variance of the estimator for the population total of the scores based on the sampling design and

$J (θ_{N}) = {\frac{\partial \sum_{k \in U} u_{k} (θ)}{\partial θ} |}_{θ = θ_{N}} .$

An estimator for $V_{p} ({\hat{θ}}_{s})$ is given by

${\hat{V}}_{p} ({\hat{θ}}_{s}) = {[\hat{J} ({\hat{θ}}_{s})]}^{- 1} {\hat{V}}_{p} [\sum_{k \in s} d_{k} u_{k} ({\hat{θ}}_{s})] {[\hat{J} ({\hat{θ}}_{s})]}^{- 1}$

where ${\hat{V}}_{p} [\sum_{k \in s} d_{k} u_{k} ({\hat{θ}}_{s})]$ is a consistent estimator for the variance of the estimator of the population total of the scores and

$\hat{J} ({\hat{θ}}_{s}) = {\frac{\partial \sum_{k \in s} d_{k} u_{k} (θ)}{\partial θ} |}_{θ = {\hat{θ}}_{s}} .$

Then, following Binder (1983), the asymptotic distribution of ${\hat{θ}}_{s}$ is normal since

${\hat{V}}_{p} {({\hat{θ}}_{s})}^{- 1 / 2} ({\hat{θ}}_{s} - θ_{N}) \sim N (0,1) .$

These definitions offer a solid background for the correct inference when using large samples as is the case in labour force surveys.

2.3 Nonresponse

Särndal and Lundström (2005) state that nonresponse has been a topic of increasing interest in national statistical offices during the last decades. Also, in the sampling survey literature, the attention to this topic has increased considerably. Nonresponse is a common non desirable issue in the development of a survey that can affect considerably the quality of the estimates.

Lohr (1999) discusses several types of nonresponse mechanisms:

The nonresponse mechanism is ignorable when the probability of an individual responding to the survey does not depend on the characteristic of interest. Note that the word "ignorable" makes reference to a model explaining the mechanism.
On the other hand, the nonresponse mechanism is nonignorable when the probability of an individual responding to the survey depends on the characteristic of interest. For example, in a labour survey, the possibility of response may depend on the labour force classification of the individuals in a household.

Lumley (2009, chapter 9) analyses individual nonresponse with partial data for a respondent considering a design-based approach adjusting the sampling weights. Fuller (2009, chapter 5) considers some imputation techniques for the nonresponse treatment through probabilistic models and sampling weights. Särndal (2011) considers a model-based approach through balanced sets in order to achieve higher representativeness of the estimates. In the same way, Särndal and Lundström (2010) propose a set of indicators in order to judge the effectiveness of auxiliary information in order to control the bias generated by nonresponse. Särndal and Lundström (2005) give a large number of references about nonresponse. These references examine two main complementary aspects in a survey: prevention of the problem of nonresponse (before it happens) and estimation techniques in order to take into account nonresponse in the inference process. This second aspect is known as adjustment for nonresponse.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

2. Motivation

2.1 Sampling designs and estimators

2.2 Pseudo-likelihood

2.3 Nonresponse