2 Joint inference and super-population
Chen Xu, Jiahua Chen and Harold Mantel
Previous | Next
The random behavior of an inference procedure is mostly
inherited from the randomness in the data. In the context of surveys, the set
of sampled units is random because of the probabilistic sampling design. At the
same time, the value of each sampling unit may be regarded as a random outcome
from some conceptual infinite super-population (Royall 1976).
In a design-based analysis, the finite population is
regarded as nonrandom and all measurements of sampling units are constants. The
parameters of interest are finite population quantities such as the population
total or the population median. The statistical inference is evaluated based on
the randomness from the probability design.
One may also regard the design-induced randomness as an
artifact. The measurements of sampled units are independent realizations of a
random variable from a probability model for the postulated super-population.
The parameters of interest are related to the assumed model and model-based
inferences are evaluated solely based on the randomization introduced from the
model.
A third approach is called model-design-based inference;
it incorporates the randomization from both design and model. In such a joint
randomization mechanism, the finite population is regarded as a random sample
from a super-population. The survey sample is considered as a second-phase
sampling from the super-population. The parameters of interest can be either
model or finite-population parameters. In this mechanism, inferences on the
finite-population parameters are motivated from the super-population model.
Model-design-based inference can be more efficient than pure design-based
approaches when the finite population is well described by the super-population
model. Compared with pure model-based approaches, it protects against model
violation and is therefore more robust in general (see, e.g., Binder and Roberts 2003; Kalton 1983).
We study the variable selection problem under the joint
randomization mechanism. Let be a finite population consisting of sampled units. The measurements on the unit are denoted where is the response of interest and is a dimensional explanatory vector (covariate
vector). These are regarded as independent realizations of from a super-population. We postulate a
generalized linear model (GLM) on the super-population as follows. Conditioning
on the distribution of belongs to a natural exponential family, the
density of which takes the form
is known as the natural parameter of such that and , and is a non-negative base measure. The influence
of the explanatory variable on is expressed through for some assumed linkage function where the vector is the dimensional regression coefficient. If is the canonical link, i.e., then we have For simplicity, we focus on the canonical link
in this paper.
Based on this model, the effect of the explanatory
variable is characterized through the size of the corresponding regression
coefficient. In applications, a complex model with many variables often leads
to over-fitting and a poor interpretive value. Hence, it is desirable to fit
the data with a parsimonious model in which many regression coefficients are
estimated to be zero. Explanatory variables with nonzero coefficients are then
considered to be influential on the response. To this end, we assume that is ideally sparse, and address the variable
selection problem through identifying a sparse model formed by the covariates
with nonzero coefficients.
Previous | Next