6 Concluding remarks

Chen Xu, Jiahua Chen and Harold Mantel

Previous

In this paper, we have addressed the variable selection problem in the analysis of complex surveys. When units are selected through disproportionate sampling, the data correlation structure reflected in the sample can be distorted. Incorporating sampling weights in the selection process is protective against the biased selection results. In this spirit, we derived a survey-weighted BIC criterion based on the pseudo-likelihood and further proposed an efficient procedure (PPL) for its implementation. With some regularity conditions, we showed that our criterion consistently identifies the influential variables under a joint randomization framework. The decent performances of proposed method was confirmed by numerical studies.

Acknowledgements

The authors are grateful to the associate editor and the two anonymous referees for their insightful comments and valuable suggestions. The authors are indebted to Professor J.N.K. Rao of Carleton University for his constructive comments to an earlier manuscript. This work was supported by Statistics Canada and MITACS.

Appendix

Table A.1
Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied.

Table summary
This table displays variables for analysis of SLCDC data with non-response adjustments. The information is grouped by variable (appearing as row headers), description, levels, missing, adjust (appearing as column headers).
Variable Description Levels Missing Adjust
1 BMHX_02  Blood pressure control status  2 1.60%
2 GEO_QB  Provinces grouped by region - QC  2 - -  NA 
3 GEO_ON  Provinces grouped by region - ON  2 - -  NA 
4 GEO_BC  Provinces grouped by region - BC  2 - -  NA 
5 GEO_PR  Provinces grouped by region - PR  2 - -  NA 
6 DHHX_AGE  Age  Cont.  - -  NA 
7 DHHX_SEX  Sex  2 - -  NA 
8 GENXDHMH  Perceived mental health  2 0.20%
9 CNHX_05  High blood pressure - age when diagnosed  Cont.  2.70%
10 MEHX_02  No. of medications taken  Cont.  0.30%
11 MEHX_03  No. of times per day medications taken  Cont.  0.10%
12 MEHXGMED  No. of medications for high blood pressure  Cont.  2.00%
13 MEHX_06  No. of times per day bp medication taken  Cont.  1.00%
14 MEHXDMCO  Medication compliance - overall  2 0.20%
15 HUHXDHP  Consulted family doctor about hbp  2 0.10%
16 SMHX_11A  Smoked at any time since being diagnosed 2 0.10%
17 SMHX_13A  Drank alcohol since being diagnosed 2 0.20%
18 SMHXDSLT  Daily salt intake  2 0.20%
19 SMHXDFDC  Dietary foods  2 0.10%
20 SMHXDPAC  Exercise/physical activity  2 0.10%
21 SMHXDBW  Body weight control  2 0.20%
22 MOHXDBPM  Self-monitoring of blood pressure  2 0.30%
23 MOHX_02  Correct use of bp measurement device  2 0.50%
24 INHX_01A  Info from family doctor  2 2.40%
25 INHX_01F  Info from family member/friend  2 2.40%
26 INHX_02A  Info from book, pamphlet, brochure  2 1.50%
27 INHX_02C  Info from package insert with medication  2 1.50%
28 INHX_02G  Info from media  2 1.50%
29 INHX_02H  Info from internet  2 1.50%
30 INHX_04  Info received - emotional impact of hbp  2 0.80%
31 INHX_06  Info received - correct use of medication  2 0.60%
32 INHX_07  Info received - additional information  2 0.90%
33 CPGFGAM  Gambling activity  2 0.50%
34 DHHDECF  Household type  2 0.20%
35 EDUDH04  Highest level of education in household  2 3.40%
36 FGVCTOT  Daily consumption - fruits and vegetables  2 5.20%
37 GEODUR2  Urban and rural areas  2 - - NA 
38 HWTDBMI  Body mass index (BMI) self-report  Cont.  2.10%
39 INCDRPR  Household income - provincial level  10 9.60%
40 SACDTOT  Total number hours - sedentary activities  Cont.  1.50%

 

Table A.2
Influential and design variables in simulation settings: * - influential variable to the response; • - design variable affecting sampling probabilities in the 1st plan; ◊ - design variable affecting sampling probabilities in the 2nd plan.

Table summary
This table displays influential and design variables in simulation settings. The information is grouped by variable (appearing as row headers), Model 1, Model 2, Model 3, Model 4 (appearing as column headers).
Variable Model 1 Model 2 Model 3 Model 4
6 DHHX_AGE  *•◊ *•◊ •◊ •◊
7 DHHX_SEX  •◊ •◊ *•◊ *•◊
8 GENXDHMH      *◊ *◊
10 MEHX_02  * *    
18 SMHXDSLT  *◊ *◊    
22 MOHXDBPM  *◊ *◊    
26 INHX_02A      *◊ *◊
28 INHX_02G        *
30 INHX_04    *    
34 DHHDECF    *    
36 FVCGTOT        *

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, (Eds., B.N. Petrox and F. Caski), 267-281.

Binder, D. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292.

Binder, D., and Roberts, G. (2003). Analysis of Survey Data, Chapter: Design-based and model-based methods for estimating model parameters. Wiley Series in Survey Methodology, Chichester.

Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377-403.

Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.

Frank, I.E., and Friedman, J.H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109-148.

Fuller, W.A. (2009). Sampling Statistics. Wiley, Hoboken.

Gelber, R.P., Gaziano, J.M., Manson, J.E., Buring, J.E. and Sesso, H.D. (2007). A prospective study of body mass index and the risk of developing hypertension in men. American Journal of Hypertension, 20, 370-377.

Godambe, V.P., and Thompson, M.E. (1986). Parameters of superpopulation and survey population: Their relationship and estimation. International Statistical Review, 54, 127-138.

Kalton, G. (1983). Models in the practice of survey sampling. International Statistical Review, 51, 175-188.

Korn, E.L., and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley & Sons, Inc.

Kott, P.S. (1991). A model-based look at linear regression with survey data. The American Statistician, 45, 107-112.

Liu, X., Wang, L. and Liang, H. (2011). Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica, 21, 1225-1248.

Lohr, S.L., and Liu, J. (1994). A comparison of weighted and unweighted analyses in the NCVS. Journal of Quantitative Criminology, 10, 343-360.

Mallows, C.L. (1973). Some Comments on Cp. Technometrics, 15, 661-675.

Molina, E.A., and Skinner, C.J. (1992). Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics & Data Analysis, 13, 395-405.

Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317-337.

Pfeffermann, D., and Holmes, D.J. (1985). Robustness considerations in the choice of a method of inference for regression analysis of survey data. Journal of the Royal Statistical Society, Series A, 148, 268-278.

Rahiala, M., and Teräsvirta, T. (1993). Business survey data in forecasting the output of Swedish and Finnish metal and engineering industries: A Kalman filter approach. Journal of Forecasting, 12, 255-271.

Royall, M. (1976). The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71, 657-664.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.

She, Y. (2011). An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis, in press.

Skinner, C. (2012). Weighting in the regression analysis of survey data with a cross-national application. Canadian Journal of Statistics, manuscript.

Statistics Canada (2009). Survey on living with chronic diseases in Canada 2009: User guide. Supplementary documentation.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B, 39, 111-147.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.

Wang, H., Li, R. and Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553-568.

Wolfson, W.G. (2004). Analysis of labour force survey data for the information technology occupations 2000-2003. Report for the Software Human Resource Council, WGW Services Ltd., Ottawa, Ontario.

Xie, B., Pan, W. and Shen, X. (2008). Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics, 64, 921-930.

Xu, C., and Chen, J. (2012). Technical supplement to "Pseudo-Likelihood-Based Bayesian Information Criterion for Variable Selection in Survey Data�. Available from the first author.

Yan, L.L., Liu, K., Matthews, K.A., Daviglus, M., Ferguson, T.F. and Kiefe, C.I. (2003). Psychosocial factors and risk of hypertension: The coronary artery risk development in young adults (CARDIA) study. The Journal of the American Medical Association, 290, 2138-2148.

Previous

Date modified: