2 Joint inference and super-population

Chen Xu, Jiahua Chen and Harold Mantel

The random behavior of an inference procedure is mostly inherited from the randomness in the data. In the context of surveys, the set of sampled units is random because of the probabilistic sampling design. At the same time, the value of each sampling unit may be regarded as a random outcome from some conceptual infinite super-population (Royall 1976).

In a design-based analysis, the finite population is regarded as nonrandom and all measurements of sampling units are constants. The parameters of interest are finite population quantities such as the population total or the population median. The statistical inference is evaluated based on the randomness from the probability design.

One may also regard the design-induced randomness as an artifact. The measurements of sampled units are independent realizations of a random variable from a probability model for the postulated super-population. The parameters of interest are related to the assumed model and model-based inferences are evaluated solely based on the randomization introduced from the model.

A third approach is called model-design-based inference; it incorporates the randomization from both design and model. In such a joint randomization mechanism, the finite population is regarded as a random sample from a super-population. The survey sample is considered as a second-phase sampling from the super-population. The parameters of interest can be either model or finite-population parameters. In this mechanism, inferences on the finite-population parameters are motivated from the super-population model. Model-design-based inference can be more efficient than pure design-based approaches when the finite population is well described by the super-population model. Compared with pure model-based approaches, it protects against model violation and is therefore more robust in general (see, e.g., Binder and Roberts 2003; Kalton 1983).

We study the variable selection problem under the joint randomization mechanism. Let $D = {1, \dots, N}$ be a finite population consisting of $N$ sampled units. The measurements on the $i^{th}$ unit are denoted $(y_{i}, x_{i}),$ where $y_{i}$ is the response of interest and $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{T}$ is a $p $ dimensional explanatory vector (covariate vector). These are regarded as independent realizations of $(Y, X)$ from a super-population. We postulate a generalized linear model (GLM) on the super-population as follows. Conditioning on $X,$ the distribution of $Y$ belongs to a natural exponential family, the density of which takes the form

$f (y; θ) = c (y) \exp {θ y - b (θ)} . (2.1)$

$θ$ is known as the natural parameter of $f (y; θ)$ such that $b^{'} (θ) = E [Y | X] \equiv μ$ and $b^{″} (θ) = Var [Y | X] \equiv σ^{2}$ , and $c (y)$ is a non-negative base measure. The influence of the explanatory variable $X$ on $Y$ is expressed through $g (μ) = X^{T} β$ for some assumed linkage function $g (\cdot),$ where the vector $β = {β_{1}, \dots, β_{p}}^{T}$ is the $p $ dimensional regression coefficient. If $g (\cdot)$ is the canonical link, i.e., $g (μ) = θ,$ then we have $θ = X^{T} β .$ For simplicity, we focus on the canonical link in this paper.

Based on this model, the effect of the explanatory variable is characterized through the size of the corresponding regression coefficient. In applications, a complex model with many variables often leads to over-fitting and a poor interpretive value. Hence, it is desirable to fit the data with a parsimonious model in which many regression coefficients are estimated to be zero. Explanatory variables with nonzero coefficients are then considered to be influential on the response. To this end, we assume that $β$ is ideally sparse, and address the variable selection problem through identifying a sparse model formed by the covariates with nonzero coefficients.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

2 Joint inference and super-population