6 Concluding remarks

Chen Xu, Jiahua Chen and Harold Mantel

In this paper, we have addressed the variable selection problem in the analysis of complex surveys. When units are selected through disproportionate sampling, the data correlation structure reflected in the sample can be distorted. Incorporating sampling weights in the selection process is protective against the biased selection results. In this spirit, we derived a survey-weighted BIC criterion based on the pseudo-likelihood and further proposed an efficient procedure (PPL) for its implementation. With some regularity conditions, we showed that our criterion consistently identifies the influential variables under a joint randomization framework. The decent performances of proposed method was confirmed by numerical studies.

Acknowledgements

The authors are grateful to the associate editor and the two anonymous referees for their insightful comments and valuable suggestions. The authors are indebted to Professor J.N.K. Rao of Carleton University for his constructive comments to an earlier manuscript. This work was supported by Statistics Canada and MITACS.

Appendix

Table A.1
Variables for analysis of SLCDC data with non-response adjustments: A: allocate to other categories; D: delete from the data; M: impute by mean values; NA: no adjustment applied.
Table summary
This table displays variables for analysis of SLCDC data with non-response adjustments. The information is grouped by variable (appearing as row headers), description, levels, missing, adjust (appearing as column headers).
Variable	Description	Levels	Missing	Adjust
1 BMHX_02	Blood pressure control status	2	1.60%	S
2 GEO_QB	Provinces grouped by region - QC	2	- -	NA
3 GEO_ON	Provinces grouped by region - ON	2	- -	NA
4 GEO_BC	Provinces grouped by region - BC	2	- -	NA
5 GEO_PR	Provinces grouped by region - PR	2	- -	NA
6 DHHX_AGE	Age	Cont.	- -	NA
7 DHHX_SEX	Sex	2	- -	NA
8 GENXDHMH	Perceived mental health	2	0.20%	A
9 CNHX_05	High blood pressure - age when diagnosed	Cont.	2.70%	S
10 MEHX_02	No. of medications taken	Cont.	0.30%	M
11 MEHX_03	No. of times per day medications taken	Cont.	0.10%	M
12 MEHXGMED	No. of medications for high blood pressure	Cont.	2.00%	M
13 MEHX_06	No. of times per day bp medication taken	Cont.	1.00%	M
14 MEHXDMCO	Medication compliance - overall	2	0.20%	A
15 HUHXDHP	Consulted family doctor about hbp	2	0.10%	A
16 SMHX_11A	Smoked at any time since being diagnosed	2	0.10%	A
17 SMHX_13A	Drank alcohol since being diagnosed	2	0.20%	A
18 SMHXDSLT	Daily salt intake	2	0.20%	A
19 SMHXDFDC	Dietary foods	2	0.10%	A
20 SMHXDPAC	Exercise/physical activity	2	0.10%	A
21 SMHXDBW	Body weight control	2	0.20%	A
22 MOHXDBPM	Self-monitoring of blood pressure	2	0.30%	A
23 MOHX_02	Correct use of bp measurement device	2	0.50%	A
24 INHX_01A	Info from family doctor	2	2.40%	A
25 INHX_01F	Info from family member/friend	2	2.40%	A
26 INHX_02A	Info from book, pamphlet, brochure	2	1.50%	A
27 INHX_02C	Info from package insert with medication	2	1.50%	A
28 INHX_02G	Info from media	2	1.50%	A
29 INHX_02H	Info from internet	2	1.50%	A
30 INHX_04	Info received - emotional impact of hbp	2	0.80%	A
31 INHX_06	Info received - correct use of medication	2	0.60%	A
32 INHX_07	Info received - additional information	2	0.90%	A
33 CPGFGAM	Gambling activity	2	0.50%	A
34 DHHDECF	Household type	2	0.20%	A
35 EDUDH04	Highest level of education in household	2	3.40%	A
36 FGVCTOT	Daily consumption - fruits and vegetables	2	5.20%	A
37 GEODUR2	Urban and rural areas	2	- -	NA
38 HWTDBMI	Body mass index (BMI) self-report	Cont.	2.10%	M
39 INCDRPR	Household income - provincial level	10	9.60%	A
40 SACDTOT	Total number hours - sedentary activities	Cont.	1.50%	M

Table A.2
Influential and design variables in simulation settings: * - influential variable to the response; • - design variable affecting sampling probabilities in the 1st plan; ◊ - design variable affecting sampling probabilities in the 2nd plan.
Table summary
This table displays influential and design variables in simulation settings. The information is grouped by variable (appearing as row headers), Model 1, Model 2, Model 3, Model 4 (appearing as column headers).
Variable	Model 1	Model 2	Model 3	Model 4
6 DHHX_AGE	*•◊	*•◊	•◊	•◊
7 DHHX_SEX	•◊	•◊	*•◊	*•◊
8 GENXDHMH			*◊	*◊
10 MEHX_02	*	*
18 SMHXDSLT	*◊	*◊
22 MOHXDBPM	*◊	*◊
26 INHX_02A			*◊	*◊
28 INHX_02G				*
30 INHX_04		*
34 DHHDECF		*
36 FVCGTOT				*

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In 2^nd International Symposium on Information Theory, (Eds., B.N. Petrox and F. Caski), 267-281.

Binder, D. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51, 279-292.

Binder, D., and Roberts, G. (2003). Analysis of Survey Data, Chapter: Design-based and model-based methods for estimating model parameters. Wiley Series in Survey Methodology, Chichester.

Craven, P., and Wahba, G. (1979). Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31, 377-403.

Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348-1360.

Frank, I.E., and Friedman, J.H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, 109-148.

Fuller, W.A. (2009). Sampling Statistics. Wiley, Hoboken.

Gelber, R.P., Gaziano, J.M., Manson, J.E., Buring, J.E. and Sesso, H.D. (2007). A prospective study of body mass index and the risk of developing hypertension in men. American Journal of Hypertension, 20, 370-377.

Godambe, V.P., and Thompson, M.E. (1986). Parameters of superpopulation and survey population: Their relationship and estimation. International Statistical Review, 54, 127-138.

Kalton, G. (1983). Models in the practice of survey sampling. International Statistical Review, 51, 175-188.

Korn, E.L., and Graubard, B.I. (1999). Analysis of Health Surveys. New York: John Wiley & Sons, Inc.

Kott, P.S. (1991). A model-based look at linear regression with survey data. The American Statistician, 45, 107-112.

Liu, X., Wang, L. and Liang, H. (2011). Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica, 21, 1225-1248.

Lohr, S.L., and Liu, J. (1994). A comparison of weighted and unweighted analyses in the NCVS. Journal of Quantitative Criminology, 10, 343-360.

Mallows, C.L. (1973). Some Comments on C_p. Technometrics, 15, 661-675.

Molina, E.A., and Skinner, C.J. (1992). Pseudo-likelihood and quasi-likelihood estimation for complex sampling schemes. Computational Statistics & Data Analysis, 13, 395-405.

Pfeffermann, D. (1993). The role of sampling weights when modeling survey data. International Statistical Review, 61, 317-337.

Pfeffermann, D., and Holmes, D.J. (1985). Robustness considerations in the choice of a method of inference for regression analysis of survey data. Journal of the Royal Statistical Society, Series A, 148, 268-278.

Rahiala, M., and Teräsvirta, T. (1993). Business survey data in forecasting the output of Swedish and Finnish metal and engineering industries: A Kalman filter approach. Journal of Forecasting, 12, 255-271.

Royall, M. (1976). The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association, 71, 657-664.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464.

She, Y. (2011). An iterative algorithm for fitting nonconvex penalized generalized linear models with grouped predictors. Computational Statistics and Data Analysis, in press.

Skinner, C. (2012). Weighting in the regression analysis of survey data with a cross-national application. Canadian Journal of Statistics, manuscript.

Statistics Canada (2009). Survey on living with chronic diseases in Canada 2009: User guide. Supplementary documentation.

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society, Series B, 39, 111-147.

Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267-288.

Wang, H., Li, R. and Tsai, C. (2007). Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika, 94, 553-568.

Wolfson, W.G. (2004). Analysis of labour force survey data for the information technology occupations 2000-2003. Report for the Software Human Resource Council, WGW Services Ltd., Ottawa, Ontario.

Xie, B., Pan, W. and Shen, X. (2008). Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics, 64, 921-930.

Xu, C., and Chen, J. (2012). Technical supplement to "Pseudo-Likelihood-Based Bayesian Information Criterion for Variable Selection in Survey Data�. Available from the first author.

Yan, L.L., Liu, K., Matthews, K.A., Daviglus, M., Ferguson, T.F. and Kiefe, C.I. (2003). Psychosocial factors and risk of hypertension: The coronary artery risk development in young adults (CARDIA) study. The Journal of the American Medical Association, 290, 2138-2148.

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

6 Concluding remarks

Acknowledgements

Appendix

References