6 Concluding remarks
Chen Xu, Jiahua Chen and Harold Mantel
In this paper, we have addressed the variable selection problem in the analysis of complex surveys. When units are selected through disproportionate sampling, the data correlation structure reflected in the sample can be distorted. Incorporating sampling weights in the selection process is protective against the biased selection results. In this spirit, we derived a survey-weighted BIC criterion based on the pseudo-likelihood and further proposed an efficient procedure (PPL) for its implementation. With some regularity conditions, we showed that our criterion consistently identifies the influential variables under a joint randomization framework. The decent performances of proposed method was confirmed by numerical studies.
Acknowledgements
The authors are grateful to the associate editor and the two anonymous referees for their insightful comments and valuable suggestions. The authors are indebted to Professor J.N.K. Rao of Carleton University for his constructive comments to an earlier manuscript. This work was supported by Statistics Canada and MITACS.
Appendix
Variable | Description | Levels | Missing | Adjust |
---|---|---|---|---|
1 BMHX_02 | Blood pressure control status | 2 | 1.60% | S |
2 GEO_QB | Provinces grouped by region - QC | 2 | - - | NA |
3 GEO_ON | Provinces grouped by region - ON | 2 | - - | NA |
4 GEO_BC | Provinces grouped by region - BC | 2 | - - | NA |
5 GEO_PR | Provinces grouped by region - PR | 2 | - - | NA |
6 DHHX_AGE | Age | Cont. | - - | NA |
7 DHHX_SEX | Sex | 2 | - - | NA |
8 GENXDHMH | Perceived mental health | 2 | 0.20% | A |
9 CNHX_05 | High blood pressure - age when diagnosed | Cont. | 2.70% | S |
10 MEHX_02 | No. of medications taken | Cont. | 0.30% | M |
11 MEHX_03 | No. of times per day medications taken | Cont. | 0.10% | M |
12 MEHXGMED | No. of medications for high blood pressure | Cont. | 2.00% | M |
13 MEHX_06 | No. of times per day bp medication taken | Cont. | 1.00% | M |
14 MEHXDMCO | Medication compliance - overall | 2 | 0.20% | A |
15 HUHXDHP | Consulted family doctor about hbp | 2 | 0.10% | A |
16 SMHX_11A | Smoked at any time since being diagnosed | 2 | 0.10% | A |
17 SMHX_13A | Drank alcohol since being diagnosed | 2 | 0.20% | A |
18 SMHXDSLT | Daily salt intake | 2 | 0.20% | A |
19 SMHXDFDC | Dietary foods | 2 | 0.10% | A |
20 SMHXDPAC | Exercise/physical activity | 2 | 0.10% | A |
21 SMHXDBW | Body weight control | 2 | 0.20% | A |
22 MOHXDBPM | Self-monitoring of blood pressure | 2 | 0.30% | A |
23 MOHX_02 | Correct use of bp measurement device | 2 | 0.50% | A |
24 INHX_01A | Info from family doctor | 2 | 2.40% | A |
25 INHX_01F | Info from family member/friend | 2 | 2.40% | A |
26 INHX_02A | Info from book, pamphlet, brochure | 2 | 1.50% | A |
27 INHX_02C | Info from package insert with medication | 2 | 1.50% | A |
28 INHX_02G | Info from media | 2 | 1.50% | A |
29 INHX_02H | Info from internet | 2 | 1.50% | A |
30 INHX_04 | Info received - emotional impact of hbp | 2 | 0.80% | A |
31 INHX_06 | Info received - correct use of medication | 2 | 0.60% | A |
32 INHX_07 | Info received - additional information | 2 | 0.90% | A |
33 CPGFGAM | Gambling activity | 2 | 0.50% | A |
34 DHHDECF | Household type | 2 | 0.20% | A |
35 EDUDH04 | Highest level of education in household | 2 | 3.40% | A |
36 FGVCTOT | Daily consumption - fruits and vegetables | 2 | 5.20% | A |
37 GEODUR2 | Urban and rural areas | 2 | - - | NA |
38 HWTDBMI | Body mass index (BMI) self-report | Cont. | 2.10% | M |
39 INCDRPR | Household income - provincial level | 10 | 9.60% | A |
40 SACDTOT | Total number hours - sedentary activities | Cont. | 1.50% | M |
Variable | Model 1 | Model 2 | Model 3 | Model 4 |
---|---|---|---|---|
6 DHHX_AGE | *•◊ | *•◊ | •◊ | •◊ |
7 DHHX_SEX | •◊ | •◊ | *•◊ | *•◊ |
8 GENXDHMH | *◊ | *◊ | ||
10 MEHX_02 | * | * | ||
18 SMHXDSLT | *◊ | *◊ | ||
22 MOHXDBPM | *◊ | *◊ | ||
26 INHX_02A | *◊ | *◊ | ||
28 INHX_02G | * | |||
30 INHX_04 | * | |||
34 DHHDECF | * | |||
36 FVCGTOT | * |
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, (Eds., B.N. Petrox and F. Caski), 267-281.
Binder, D. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292.
Binder, D., and Roberts, G. (2003). Analysis of Survey Data, Chapter: Design-based and model-based methods for estimating model parameters. Wiley Series in Survey Methodology, Chichester.
Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377-403.
Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.
Frank, I.E., and Friedman, J.H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109-148.
Fuller, W.A. (2009). Sampling Statistics. Wiley, Hoboken.
Gelber, R.P., Gaziano, J.M., Manson, J.E., Buring, J.E. and Sesso, H.D. (2007). A prospective study of body mass index and the risk of developing hypertension in men. American Journal of Hypertension, 20, 370-377.
Godambe, V.P., and Thompson, M.E. (1986). Parameters of superpopulation and survey population: Their relationship and estimation. International Statistical Review, 54, 127-138.
Kalton, G. (1983). Models in the practice of survey sampling. International Statistical Review, 51, 175-188.
Korn, E.L., and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley & Sons, Inc.
Kott, P.S. (1991). A model-based look at linear regression with survey data. The American Statistician, 45, 107-112.
Liu, X., Wang, L. and Liang, H. (2011). Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica, 21, 1225-1248.
Lohr, S.L., and Liu, J. (1994). A comparison of weighted and unweighted analyses in the NCVS. Journal of Quantitative Criminology, 10, 343-360.
Mallows, C.L. (1973). Some Comments on Cp. Technometrics, 15, 661-675.
Molina, E.A., and Skinner, C.J. (1992). Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics & Data Analysis, 13, 395-405.
Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317-337.
Pfeffermann, D., and Holmes, D.J. (1985). Robustness considerations in the choice of a method of inference for regression analysis of survey data. Journal of the Royal Statistical Society, Series A, 148, 268-278.
Rahiala, M., and Teräsvirta, T. (1993). Business survey data in forecasting the output of Swedish and Finnish metal and engineering industries: A Kalman filter approach. Journal of Forecasting, 12, 255-271.
Royall, M. (1976). The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71, 657-664.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.
She, Y. (2011). An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis, in press.
Skinner, C. (2012). Weighting in the regression analysis of survey data with a cross-national application. Canadian Journal of Statistics, manuscript.
Statistics Canada (2009). Survey on living with chronic diseases in Canada 2009: User guide. Supplementary documentation.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B, 39, 111-147.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.
Wang, H., Li, R. and Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553-568.
Wolfson, W.G. (2004). Analysis of labour force survey data for the information technology occupations 2000-2003. Report for the Software Human Resource Council, WGW Services Ltd., Ottawa, Ontario.
Xie, B., Pan, W. and Shen, X. (2008). Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics, 64, 921-930.
Xu, C., and Chen, J. (2012). Technical supplement to "Pseudo-Likelihood-Based Bayesian Information Criterion for Variable Selection in Survey Data�. Available from the first author.
Yan, L.L., Liu, K., Matthews, K.A., Daviglus, M., Ferguson, T.F. and Kiefe, C.I. (2003). Psychosocial factors and risk of hypertension: The coronary artery risk development in young adults (CARDIA) study. The Journal of the American Medical Association, 290, 2138-2148.
- Date modified: