2 Joint inference and super-population

Chen Xu, Jiahua Chen and Harold Mantel

Previous | Next

The random behavior of an inference procedure is mostly inherited from the randomness in the data. In the context of surveys, the set of sampled units is random because of the probabilistic sampling design. At the same time, the value of each sampling unit may be regarded as a random outcome from some conceptual infinite super-population (Royall 1976).

In a design-based analysis, the finite population is regarded as nonrandom and all measurements of sampling units are constants. The parameters of interest are finite population quantities such as the population total or the population median. The statistical inference is evaluated based on the randomness from the probability design.

One may also regard the design-induced randomness as an artifact. The measurements of sampled units are independent realizations of a random variable from a probability model for the postulated super-population. The parameters of interest are related to the assumed model and model-based inferences are evaluated solely based on the randomization introduced from the model.

A third approach is called model-design-based inference; it incorporates the randomization from both design and model. In such a joint randomization mechanism, the finite population is regarded as a random sample from a super-population. The survey sample is considered as a second-phase sampling from the super-population. The parameters of interest can be either model or finite-population parameters. In this mechanism, inferences on the finite-population parameters are motivated from the super-population model. Model-design-based inference can be more efficient than pure design-based approaches when the finite population is well described by the super-population model. Compared with pure model-based approaches, it protects against model violation and is therefore more robust in general (see, e.g., Binder and Roberts 2003; Kalton 1983).

We study the variable selection problem under the joint randomization mechanism. Let D={ 1,,N } MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Wefv3ySLgznfgDOfdaryqr1ngBPrginfgDObYtUvgaiuaacqWFdepr cqGH9aqpdaGadeqaaiaaigdacaaISaGaeSOjGSKaaGilaiaad6eaai aawUhacaGL9baaaaa@4E72@  be a finite population consisting of N MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOtaaaa@3CAA@  sampled units. The measurements on the i th MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyAamaaCaaaleqabaGaaeiDaiaabIgaaaaaaa@3ED4@  unit are denoted ( y i , x i ), MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba WaaeWabeaacaWG5bWaaSbaaSqaaiaadMgaaeqaaOGaaGilaiaahIha daWgaaWcbaGaamyAaaqabaaakiaawIcacaGLPaaacaGGSaaaaa@430E@  where y i MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamyEamaaBaaaleaacaWGPbaabeaaaaa@3DEF@  is the response of interest and x i = ( x i1 ,, x ip ) T MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaCiEamaaBaaaleaacaWGPbaabeaakiabg2da9maabmqabaGaamiE amaaBaaaleaacaWGPbGaaGymaaqabaGccaaISaGaeSOjGSKaaGilai aadIhadaWgaaWcbaGaamyAaiaadchaaeqaaaGccaGLOaGaayzkaaWa aWbaaSqabeaacaWGubaaaaaa@4A12@  is a p­ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiCaerbhv2BYDwAHbacfaGaa8xRaaaa@40DB@ dimensional explanatory vector (covariate vector). These are regarded as independent realizations of ( Y,X ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba WaaeWabeaacaWGzbGaaGilaiaahIfaaiaawIcacaGLPaaaaaa@3FD6@  from a super-population. We postulate a generalized linear model (GLM) on the super-population as follows. Conditioning on X, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaCiwaiaacYcaaaa@3D68@  the distribution of Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  belongs to a natural exponential family, the density of which takes the form

f( y;θ )=c( y )exp{ θyb( θ ) }.       ( 2.1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOzamaabmqabaGaamyEaiaacUdacqaH4oqCaiaawIcacaGLPaaa cqGH9aqpcaWGJbWaaeWabeaacaWG5baacaGLOaGaayzkaaGaciyzai aacIhacaGGWbWaaiWabeaacqaH4oqCcaWG5bGaeyOeI0IaamOyamaa bmqabaGaeqiUdehacaGLOaGaayzkaaaacaGL7bGaayzFaaGaaGOlai aaxMaacaWLjaWaaeWaaeaaqaaaaaaaaaWdbiaaikdacaGGUaGaaGym aaWdaiaawIcacaGLPaaaaaa@58E7@

θ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiUdehaaa@3D8D@  is known as the natural parameter of f( y;θ ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamOzamaabmqabaGaamyEaiaacUdacqaH4oqCaiaawIcacaGLPaaa aaa@41BF@  such that b ( θ )=E[ Y|X ]μ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmOyayaafaWaaeWabeaacqaH4oqCaiaawIcacaGLPaaacqGH9aqp caWGfbWaamWabeaadaabcaqaaiaadMfaaiaawIa7aiaadIfaaiaawU facaGLDbaacqGHHjIUcqaH8oqBaaa@4A9D@  and b ( θ )=Var[ Y|X ] σ 2 MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GabmOyayaafyaafaWaaeWabeaacqaH4oqCaiaawIcacaGLPaaacqGH 9aqpcaqGwbGaaeyyaiaabkhadaWadeqaamaaeiaabaGaamywaaGaay jcSdGaamiwaaGaay5waiaaw2faaiabggMi6kabeo8aZnaaCaaaleqa baGaaGOmaaaaaaa@4D86@ , and c( y ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4yamaabmqabaGaamyEaaGaayjkaiaawMcaaaaa@3F47@  is a non-negative base measure. The influence of the explanatory variable X MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabeabaaGcba GaaCiwaaaa@3CB9@  on Y MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaamywaaaa@3CB5@  is expressed through g( μ )= X T β MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4zamaabmqabaGaeqiVd0gacaGLOaGaayzkaaGaeyypa0JaaCiw amaaCaaaleqabaGaamivaaaakiaahk7aaaa@4438@  for some assumed linkage function g( ), MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4zamaabmqabaGaeyyXICnacaGLOaGaayzkaaGaaiilaaaa@4147@  where the vector β= { β 1 ,, β p } T MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaCOSdGGabiab=1da9maacmaabaGaeqOSdi2aaSbaaSqaaiaaigda aeqaaOGaaGilaiablAciljaaiYcacqaHYoGydaWgaaWcbaGaamiCaa qabaaakiaawUhacaGL9baadaahaaWcbeqaaiaadsfaaaaaaa@491D@  is the p­ MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaamiCaerbhv2BYDwAHbacfaGaa8xRaaaa@40DB@ dimensional regression coefficient. If g( ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4zamaabmqabaGaeyyXICnacaGLOaGaayzkaaaaaa@4097@  is the canonical link, i.e., g( μ )=θ, MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba Gaam4zamaabmqabaGaeqiVd0gacaGLOaGaayzkaaGaeyypa0JaeqiU deNaaiilaaaa@436F@  then we have θ= X T β. MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaeqiUdeNaeyypa0JaaCiwamaaCaaaleqabaGaamivaaaakiaahk7a caGGUaaaaa@4274@  For simplicity, we focus on the canonical link in this paper.

Based on this model, the effect of the explanatory variable is characterized through the size of the corresponding regression coefficient. In applications, a complex model with many variables often leads to over-fitting and a poor interpretive value. Hence, it is desirable to fit the data with a parsimonious model in which many regression coefficients are estimated to be zero. Explanatory variables with nonzero coefficients are then considered to be influential on the response. To this end, we assume that β MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbcvPDwzYbGeaGqiVu0Je9sqqrpepC0xbbL8F4rq1rFfpeea0x e9LqFf0xe9q8qqvqFr0dXdHiVc=bYP0xb9sq=fFfeu0RXxb9qr0dd9 q8qi0lf9Fve9Fve9vapdbaqaaeGacaGaaiaabeqaamaabaabaaGcba GaaCOSdaaa@3D15@  is ideally sparse, and address the variable selection problem through identifying a sparse model formed by the covariates with nonzero coefficients.

Previous | Next

Date modified: