12-539 Data Quality Guidelines

Archived Content

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Survey steps >

Estimation

Scope and purpose

Estimation is a process that approximates unknown population parameters using only that part of the population that is included in a sample. Inferences about these unknown parameters are then made, using the sampled data and associated design. Where population parameters are functions of population totals, their estimators are generally corresponding functions of the estimated population totals. Examples of parameters include simple descriptive statistics such as totals, means, ratios and percentiles, as well as more complicated analytical statistics such as regression coefficients.

Measures of precision are usually computed to evaluate the quality of a population parameter estimate and to obtain valid inferences. Although the quality of the computed estimates is in large part dependent on the preceding survey steps, the choice of an estimation method also plays an important role. In particular, auxiliary data can be used judiciously to improve the precision of these estimates.

Principles

A typical survey objective is to estimate a descriptive population quantity using the sample. The total survey error in the estimate is the amount by which the estimate differs from the true value of the quantity for the survey population (Thompson, 1997). The total survey error can be written as the sum of the sampling error and nonsampling error. The sampling error represents the error associated with estimating a parameter of interest using data from only a sample. Nonsampling errors reflect other reasons for having an imperfect estimator. These include coverage errors (imperfect frame), measurement errors and nonresponse errors.

The estimation method and the sampling design determine the properties of the sampling error. Criteria to evaluate the magnitude of the sampling error include the sampling bias and the sampling variance. Estimation methods that result in both the smallest bias and the smallest sampling variance should be chosen. Design consistency is another desirable property of an estimate.

The basic design-consistent Horvitz-Thompson estimator is the most natural estimator to use if there is no auxiliary information available at the estimation stage. It weights data with the inverses of the inclusion probabilities of the sampled units. Such a weight is called a sampling weight. The sampling weight can be interpreted as the number of times that each sampled unit should be replicated to represent the full population.

The properties of the Horvitz-Thompson estimator can be improved when auxiliary information is available. Calibration is a procedure that can be used to incorporate auxiliary data. This procedure adjusts the sampling weights by multipliers known as calibration factors that make the estimates agree with known totals. The resulting weights are called calibration weights or final estimation weights. These calibration weights will generally result in estimates that are design consistent, and that have a smaller sampling variance than the Horvitz-Thompson estimator.

If there is nonresponse, the observed sample is smaller in size than the original sample selected. To compensate for nonresponse, imputation (see section on Imputation) or reweighting can be performed. Reweighting consists of adjusting the sampling weights by nonresponse adjustment factors before applying the calibration technique. The basic principle in computing the nonresponse adjustment factors is to use the inverse of the response probabilities. However, response probabilities are unknown and must be estimated, as opposed to inclusion probabilities, which are known. The key to reducing nonresponse bias and nonresponse variance is to obtain a useful nonresponse model by taking advantage of the auxiliary information available, as much as possible.

Guidelines

Proper estimation conforms to the sampling design. To that end, incorporate sampling weights in the estimation process. This implies that aspects of the sampling design such as stratification, clustering, and multi-phase or multi-stage information are reflected in the estimation of parameters and their associated variance estimators.
Use auxiliary data whenever possible to improve the reliability of the estimates. Evaluate the use of the auxiliary data. This can be done by exploration, using, for example, Statistics Canada’s Generalized Estimation System (GES), which is based on regression fitting techniques.
Whenever auxiliary data are available for sample units, together with known population totals for such data, consider using calibration estimation so that the weighted auxiliary data add up to these known totals. This may result in improved precision and lead to greater consistency between estimates from various sources. Try to constrain the range of the weights resulting from the calibration. A large heterogeneity of weights can lead to an increase in the variance of the estimates, and hence a decrease in their precision. Reducing the range of the weights can be achieved by bounding the weights (Huang and Fuller, 1978; Deville and Särndal, 1992). These bounding methods can also be used to avoid negative or excessively large weights. Singh and Mohl (1996), Stukel et al (1996), and Fuller (2002) discuss the use of auxiliary data in detail.
When the original classification of sampling units has changed between the time of sample selection and estimation, consider domain estimation so that the new classification is reflected in the estimates. Domain estimation refers to estimation for specified subsets of the population (or domains) of interest. Often the units falling in these subsets have not been, or could not have been, identified before sampling. Estimation in the presence of dead or out-of-scope units in the sample is an example of domain estimation. These units are assigned a value of zero in the estimation process (Hidiroglou and Laniel, 2001).
Since the quality of nonresponse adjustment factors in the weights depends on model assumptions, validate the chosen model through several diagnostics and make sure not to forget auxiliary variables correlated with the propensity to respond. This will ensure some protection against nonresponse bias. To obtain some robustness against a model failure, form nonresponse adjustment classes and estimate the response probabilities by the response rates within these classes. Use auxiliary variables correlated with the propensity to respond in the formation of these classes. Some methods for forming homogeneous classes are discussed in Eltinge and Yansaneh (1997). Two-phase sampling theory can be used to estimate the variance for various estimators incorporating the nonresponse adjustments. For example, such procedures are provided in chapter 15 of Särndal, Swensson and Wretman (1992). Knowledge Seeker is a software package that can be used to form homogeneous classes using the methodology described in Kass (1980).
When appropriate, use double sampling to improve estimation by incorporating auxiliary data. These are data that are available for the universe and/or the larger sample (the first-phase sample in the case of two-phase sampling). Double sampling can be used (a) to stratify the second phase sample, (b) to improve the estimate using a difference, ratio or regression estimator, or (c) to draw a sub sample of nonresponding units. A fairly general approach to two-phase sampling when auxiliary data are incorporated in the estimation process via the Generalized Regression Estimator (GREG) of total is presented in Hidiroglou and Särndal (1998). In the case of double sampling, Hidiroglou (2001) provides a general theory when auxiliary data are incorporated in the estimation process via optimal regression estimators of totals.
The sampling of units of interest may be indirect. That is, the sample of a frame of interest (representing the survey population) may be selected only via units belonging to another frame. If linkages can be established between the units of the two frames, obtain inference about the survey population by computing estimation weights for the surveyed units. These weights can be computed using the Generalised Weight Share Method given in Lavallée (2002).
Keep in mind that for longitudinal surveys, two sets of estimation weights are usually provided: the longitudinal weights and the cross sectional weights. The longitudinal weights refer to the population at the initial selection of the longitudinal sample. These weights are usually adjusted to take into account the attrition of the sample over time. The longitudinal weights are used when performing analysis of the longitudinal data. The cross sectional weights are related to the population established at each survey wave. These weights are normally used to produce point estimates, or differences of point estimates between two time periods. Because of the changes in the population through time, the cross sectional weights are generally different from the longitudinal weights.
In periodic surveys with a large sample overlap between occasions, consider the use of estimation methods that exploit the correlation over time (Binder and Hidiroglou, 1988; Singh, Kennedy and Wu, 2001). One of these estimation methods is referred to as composite estimation. These methods basically treat the data from previous occasions as auxiliary variables.
Incorporate the requirements of small domains of interest at the sampling design and sample allocation stages (Singh, Gambino and Mantel, 1994). If this is not possible at the design stage, or if the domains are only specified at a later stage, consider special estimation methods (small area estimators) at the estimation stage. These methods “borrow strength” from related areas (or domains) to minimize the mean square error of the resulting estimator (Platek et al., 1987; Ghosh and Rao, 1994; Rao, 1999).
Outliers often lead to unreliable estimates for continuous variables. Outliers might be due either to extreme values measured for some characteristics, or to very large weights attached to the outlying elements, or both. Consider using objective procedures such as outlier-resistant (robust) estimators (Hidiroglou and Srinath, 1981; Fuller, 1991; Lee, 1995; Duchesne 1999; Gwet and Lee, 2000; Chambers, Kokic, Smith and Crudas, 2000). In the case of multivariate outliers, the use of Mahalanobis' Distance and Stahel-Donoho Estimators, adapted to the survey design, is recommended (Patak, 1990; Franklin, Thomas and Brodeur, 2000).
Whenever possible, use generalized estimation software instead of tailor-made systems. Possible software packages to use include GES (Estevao, Hidiroglou and Särndal, 1995), SUDAAN 8.0 (Shah, et al., 1997), PC CARP (Schnell, et al., 1988), WesVar PC (Brick, et al., 2000), STATA (1997), and SAS 8.0. By using generalized systems, one can expect fewer programming errors, as well as some reduction in development costs and time.

Top of Page

References

Binder, D.A. and Hidiroglou, M.A. (1988). Sampling in time. Handbook of Statistics, P.K. Krishnaish and C.R. Rao (eds.), 187 211.

Brick, J., Morganstein D. and Valliant R. (2000). Analysis of complex sample data using replication. See also http://www.westat.com/wesvar/techpapers.

Brogan, D. (1998). Software for sample survey data, misuse of standard packages. In Encyclopedia of Biostatistics, Volume 5 (P. Armitage and T. Colton, Eds.). Wiley, New York, 4167-4174. See also
http://www.rti.org/sudaan/homeabout.cfm?aboutfile=whySUDAAN.cfm.

Chambers, R.L., Kokic P., Smith P. and Crudas M. (2000). Winsorization for identifying and treating outliers in business surveys. Proceedings of the Second International Conference on Establishment Surveys, June 17-21, 2000, Buffalo, New York, 717-726.

Cochran, W.G. (1977). Sampling Techniques. Wiley, New York.

Deville, J.-C. and Särndal, C.E. (1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, 376-382.

Duchesne, P. (1999). Robust calibration estimators. Survey Methodology, 25, 43-56.

Eltinge, J.L. and Yansaneh, I.S. (1997). Diagnostics for formation of nonresponse adjustment cells, with an application to income nonresponse in the U.S. Consumer Expenditure Survey. Survey Methodology, 23, 33-40.

Estevao, V., Hidiroglou, M.A. and Särndal, C.E. (1995). Methodological principles for a generalized estimation system at Statistics Canada. Journal of Official Statistics, 11, 181-204.

Franklin, Sarah F., Thomas, S. and Brodeur, M. (2000). Robust multivariate outlier detection using Mahalanobis’ distance and modified Stahel-Donoho estimators. Proceedings of the Second International Conference on Establishment Surveys, June 17-21, 2000, Buffalo, New York, 697-706.

Fuller, W.A. (1991). Simple estimators of the mean of skewed populations. Statistica Sinica, 1, 137-158.

Fuller, W.A. (2002). Regression estimation for survey samples. Survey Methodology, 28, 5-23.

Fuller, W.A. and Rao, J.N.K. (2001). A regression composite estimator with application to the Canadian Labour Force Survey. Survey Methodology, 27, 45-52.

Gambino, J., Kennedy, B. and Singh, M.P. (2001). Regression composite estimation for the Canadian Labour Force Survey: evaluation and implementation. Survey Methodology, 27, 65-74.

Ghosh, M. and Rao, J.N.K. (1994). Small area estimation: an appraisal. Statistical Science, 9, 55-93.

Gwet, J.-P. and Lee, H., (2000). An evaluation of outlier-resistant procedures in establishment surveys. Proceedings of the Second International Conference on Establishment Surveys, June 17-21, 2000, Buffalo, New York, 707-716.

Hidiroglou, M.A. (2001). Double sampling. Survey Methodology, 27, 143-154.

Hidiroglou, M.A. and Laniel, N. (2001). Sampling and estimation issues for annual and sub-annual Canadian business surveys. International Statistical Review, 69, 487-504.

Hidiroglou, M.A. and Särndal, C.E. (1998). Use of auxiliary information for two-phase sampling. Survey Methodology, 24, 11-20.

Hidiroglou, M.A. and Srinath, K.P. (1981). Some estimators of population total containing large units. Journal of the American Statistical Association, 47, 663-685.

Holt, D. and Smith, T.M.F. (1979). Post stratification. Journal of the Royal Statistical Society, A 142, 33 46.

Huang, E. and Fuller, W.A. (1978). Nonnegative regression estimation for sample survey data. Proceedings of the Social Statistics Section, American Statistical Association, 300 303.

Hulliger, B. (1995). Outlier robust Horvitz-Thompson estimators. Survey Methodology, 21, 79-87.

Kass, G.V. (1980). An exploratory technique for investigating large quantities of categorical data, Applied Statistics, 29, 119-127.

Lohr, S. (1999). Sampling: Design and Analysis. Duxbury Press.

Lavallée, P. (2001). La méthode généralisée du partage des poids et le calage sur marges. In Enquêtes, modèles et applications, J.-J. Droesbeke and L. Lebart (eds), Paris: Dunod, 396-403.

Lavallée, P. (2002). Le sondage indirect. Éditions de l’Université de Bruxelles.

Lavallée, P. and Caron, P. (2001). Estimation using the generalized weight share method: the case of record linkage. Survey Methodology, 27, 155-170.

Lee, H. (1995). Outliers in business surveys. In Business Survey Methods, B.G. Cox et al. (eds). Wiley, New York, 503-526.

Lemaître, G. and Dufour, J. (1987). An integrated method for weighting persons and families. Survey Methodology, 13, 199-207.

Patak, Z. (1990). Robust principal component analysis via projection pursuit. Master’s thesis, University of British Columbia, Canada.

Platek, R., Rao, J.N.K., Särndal, C.E. and Singh, M.P. (eds.) (1987). Small Area Statistics. Wiley, New York.

Rao, J.N.K. (1996). On the estimation with imputed survey data. Journal of the American Statistical Association, 91, 499 506.

Rao, J.N.K. (1999). Some recent advances in model-based small area estimation. Survey Methodology, 25, 175-186.

Särndal, C.E., Swensson, B. and Wretman, J.H. (1992). Model Assisted Survey Sampling. Springer-Verlag, New York.

Schnell, D., Kennedy, W.J., Sullivan, G, Park, H.J. and Fuller, W.A. (1988). Personal computer variance software for complex surveys. Survey Methodology, 14, 59-69. See also http://www.statlab.iastate.edu/survey/software/pccarp.html.

Shah, B.V., Barnwell, B.G. and Bieler, G.S. (1997). SUDAAN User’s Manual Release 7.5. Research Triangle Institute, North Carolina. See also http://www.rti.org/sudaan/home.cfm.

Singh, A.C., Kennedy, B. and Wu, S. (2001). Regression composite estimation for the Canadian Labour Force Survey with a rotating design. Survey Methodology, 23, 33-44.

Singh, A.C. and Mohl, C. (1996). Understanding calibration estimators in survey sampling. Survey Methodology, 22, 107-115.

Singh, M.P., Gambino, J. and Mantel, H. (1994). Issues and strategies for small area data. Survey Methodology, 20, 3-14.

Singh, M.P., Hidiroglou, M.A., Gambino, J. and Kovacevic, M. (2001). Estimation methods and related systems at Statistics Canada. International Statistical Review, 69, 461-486.

STATA Corporation (1997). STATA User’s Guide. STATA Press, College Station, TX. See http://www.fas.harvard.edu/~stats/survey-soft/iass.html#stata.

Stukel, D., Hidiroglou, M.A., and Särndal, C.-E. (1996). Variance estimation for calibration estimators: a comparison of Jackknifing versus Taylor series linearization. Survey Methodology, 22, 117-125.

Thompson, M.E. (1997). Theory of Sample Surveys. Chapman and Hall.

Valliant, R., Dorfman, A.H., and Royall, R.M. (2000). Finite Population Sampling and Inference. Wiley, New York.

Yung, W. and Rao, J.N.K. (1996). Jackknife linearization variance estimators under multistage sampling. Survey Methodology, 22, 23-32.

Home \| Search \| Contact Us \| Français
Date Modified: 2014-04-10	Important Notices