1 Introduction

Chen Xu, Jiahua Chen and Harold Mantel

Previous | Next

In many areas of scientific research, one common interest is to identify the influential factors associated with certain behavioral, social, or economic indices within a target population. For example, sociologists would like to identify important factors that affect the unemployment rate in a specific region, and epidemiologists are interested in finding risk behavior for diseases. In such studies, researchers often start with a survey of the target population (e.g., Rahiala and Teräsvirta 1993; Korn and Graubard 1999; Wolfson 2004). A representative sample is then selected and measurements of the variables of interest for the sampled units are collected. A regression model is routinely employed to summarize the information contained in the data. It explains variations in the response variable through a simple function of explanatory variables (covariates). When they lack prior knowledge, researchers may collect information on many potential explanatory variables. The goal of identifying influential factors can be achieved through a variable selection procedure.

Variable selection is fundamental in statistical modeling. In non-survey settings, classical selection criteria have been developed to assess and select candidate variables. Examples include Mallow's C p MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4qamaaBaaaleaacaWGWbaabeaaaaa@3DC0@  statistic (Mallows 1973), the (generalized) cross-validation (CV/GCV; Stone 1974; Craven and Wahba 1979), the Akaike information criterion (AIC; Akaike 1973) and the Bayesian information criterion (BIC; Schwarz 1978). All these criteria are very useful and can provide meaningful inferences in practice.

Despite the abundance of the literature on variable selection, it has received little attention in the context of survey sampling. When variable selection methods are applied to survey data, many potential complications arise. We focus on issues related to special features of surveys. First, data collected through survey sampling are usually obtained from a finite population without replacement, and hence they have an intrinsic dependence structure. Second, in complex survey designs, the inclusion probabilities of sampling units often vary over the target population. Consequently, the correlation between the response and the covariates reflected in the sample can be distorted from the population. This is potentially the case when some parts of the population are sampled more intensively than the others. Ignoring survey designs in the selecting process may result in biased selection results for the target population.

In the literature, sampling weights are often utilized in estimating parameters in regression models based on survey data. The weighted estimates of regression coefficients are helpful to avoid the biased inference from informative sampling (Pfeffermann 1993; Fuller 2009, Section 6.3; Skinner 2012). Although model estimation and selection serve for their own purposes, they often have coherent linkage in a modeling process. It is natural to conjecture that using sampling weights is beneficial for the variable selection.

In this spirit, we investigate the use of pseudo-likelihood to take account of the sampling weights, and derive a pseudo-likelihood-based BIC criterion for variable selection of survey data. A penalized pseudo-likelihood-based procedure (PPL) is further proposed for numerical implementation of the proposed criterion. Under a joint randomization framework, we prove that the new procedure consistently identifies the influential variables. The weighted selection method is assessed through simulation studies and using data from the 2009 Survey on Living with Chronic Diseases in Canada.

The paper is organized as follows. In Section 2, we introduce the joint randomization mechanism and the super-population model. In Section 3, we derive the pseudo-likelihood-based BIC for the analysis of survey data and propose its implementation via the PPL procedure. In Section 4, we investigate the asymptotic behavior of the proposed BIC procedure. We use numerical studies in Section 5 to further assess the performance of our approach and provide concluding remarks in Section 6. We provide the proofs of theorems in a separate technical supplement: Xu and Chen (2012), where the derivation of proposed BIC can also be found.

Previous | Next

Date modified: