Integer programming formulations applied to optimal allocation in stratified sampling 1. Introduction

A large part of the statistics produced by official statistics agencies in many countries come from sample surveys. Such surveys have a well-defined survey population to be covered, including the geographic location and other eligibility criteria, use appropriate frames to guide the sample selection, and apply some well-specified sample selection procedures. The use of ‘standard’ probability sampling procedures enables producing estimates for the target population parameters with controlled precision while having data from typically small samples of the populations, at a fraction of the cost of corresponding censuses.

When designing the sampling strategy, the survey planner often seeks to optimize precision for the most important survey estimates given an available survey budget. Stratification is an important tool that enables exploring prior auxiliary information available for all the population units by forming groups of homogeneous units, and then sampling independently from within such groups. Thus stratification is very frequently used in a wide range of sample surveys.

Here we focus on element sampling designs (Särndal, Swensson and Wretman 1992) where the frame consists of one record per population unit, and besides identification and location information, some auxiliary information is also available for each population unit. Stratified sampling involves dividing the N MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamOtaaaa@384A@ units in a population U MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamyvaaaa@3851@ into H MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamisaaaa@3844@ homogeneous groups, called strata. These groups are formed considering one (or more) stratification variable(s), and such that variance within groups is small (the stratum formation problem).

Given a sample size n , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamOBaiaacY caaaa@391A@ once the strata are defined the next problem consists of specifying how many sample units should be selected in each stratum such that the variance of a specified estimator is minimized (the optimal sample allocation problem). When interest is restricted to estimating the population total (or mean) for a single survey variable, the well-known Neyman allocation (see e.g., Cochran 1977) may be used to decide on the sample allocation. Although surveys which have a single target variable are rare, Neyman’s simple allocation formula may still be useful because the allocation which is optimal for a target variable may still be reasonable for other survey variables which are positively correlated with the one used to drive the optimal allocation.

When a survey must produce estimates with specified levels of precision for a number of survey variables, and these variables are not strongly correlated, a method of sample allocation that enables producing estimates with the required precision for all the survey variables is needed. In this case, we have a problem of multivariate optimal sample allocation.

According to the literature, in such cases the allocation of the overall sample size n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamOBaaaa@386A@ to the strata may seek one of the following goals:

  1. the total variable survey cost C MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaam4qaaaa@383F@ is minimized, subject to having Coefficients of Variation (CVs) for the estimates of totals of the m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamyBaaaa@3869@ survey variables below specified thresholds; or
  2. a weighted sum of variances (or relative variances) of the estimates of totals for the m MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9Lq=Je9 vqaqFeFr0xbba9Fa0P0RWFb9fq0FXxbbf9=e0dfrpm0dXdirVu0=vr 0=vr0=fdbaqaaeGacaGaaiaabeqaamaabaabaaGcbaGaamyBaaaa@3869@ survey variables is minimized.

Note that the CV is simply the square root of the relative variance.

This paper presents a new approach based on developing and applying two binary integer programming formulations that satisfy each of these two goals, while ensuring that the resulting allocation provides the global optimum. The paper is divided as follows. Section 2 reviews some key stratified sampling concepts and definitions. Section 3 describes the new approach proposed here. Section 4 provides results for a subset of numerical experiments carried out to test the proposed approach using selected population datasets. Section 5 gives some final remarks and concludes the paper. Appendix A provides information about three populations used in the numerical experiments presented in Section 4.

Date modified: