Publications

Survey Methodology

Browse by

2 Formalization of the optimization problem

Marco Ballin and Giulio Barcaroli

Universe of alternative stratifications

We define as sampling frame $F$ a set of $N$ records containing information (organised in variables) related to $N$ individuals of the reference population. Some variables are useful for the identification of units, while some other can be used in order to define the sampling strategy. The values of the latter (from now on: auxiliary variables) can be observed by means of a census, or from other sources as administrative registers.

We assume that in the frame a set of $M$ auxiliary variables $X_{m} (m = 1, \dots, M)$ are available. This set may contain different typologies of variables (nominal, ordinal, or continuous). We assume also that continuous auxiliary variables are split into classes by applying suitable transformation algorithms.

All such variables can potentially be used to stratify the units in the frame.

Under these assumptions, we can associate to each auxiliary variable a vector $d_{m} = {x_{1}, \dots, x_{k_{m}}}$ of contiguous integer values, each of them representing an original value in the domain set.

Then, the most detailed stratification of $F$ can be considered as the result of the Cartesian product $C P = X_{1} \times X_{2} \times \dots \times X_{M} .$

The maximum number of strata will be $K = \prod_{m = 1}^{M} k_{m} - I^{*},$ where $I^{*}$ is the number of impossible or absent combinations of values in the frame. So, the most detailed stratification of the frame is such that it contains $K$ strata, corresponding to all possible combinations of values in the $M$ auxiliary variables. We call atomic strata the strata belonging to this particular stratification. Each atomic stratum is characterised by a unique combination of values of the $M$ auxiliary variables. We can assign a label $l_{k} (k = 1, \dots, K)$ to each atomic stratum.

If we consider the labelled set of atomic strata $L = {l_{1}, l_{2}, \dots, l_{K}},$ we can define the set of all its possible partitions $P_{1}, P_{2}, \dots, P_{B},$ where $B$ can be calculated by using the Bell formula:

$B_{K} = \sum_{i = 0}^{K - 1} (\begin{matrix} K - 1 \\ i \end{matrix}) \cdot B_{i} (B_{0} = 1)$

We define the set ${P_{1}, P_{2}, \dots, P_{B}}$ of partitions of $L$ as the universe (or space) of stratifications.

Assessment of a given stratification

Given a partition $P_{i}$ of $L,$ characterized by $H$ strata, let $N_{h}$ and $S_{h, g}^{2}, h = 1, \dots, H,$ $g = 1, \dots, G$ be respectively the number of units and variances in stratum $h$ of the $G$ different survey target variables $Y_{1}, \dots, Y_{G} .$ Assuming a simple random sampling of $n_{h}$ units without replacement in each stratum, the variance of the Horvitz-Thompson estimator of the total of the $g^{th}$ target variable $({\hat{T}}_{g})$ is

$Var ({\hat{T}}_{g}) = \sum_{h = 1}^{H} N_{h}^{2} (1 - \frac{n_{h}}{N_{h}}) \frac{S_{h, g}^{2}}{n_{h}} g = 1, \dots, G (2.1)$

Consider the following cost function

$C (n_{1}, \dots, n_{H}) = C_{0} + \sum_{h = 1}^{H} C_{h} n_{h} (2.2)$

where $C_{0}$ indicates a fixed cost (not dependent on the sample size) and $C_{h}$ represents the average cost of observing a unit in stratum $h .$

Given $V_{g} (g = 1, \dots, G),$ the upper bounds for the expected sampling variance for ${\hat{T}}_{1}, \dots, {\hat{T}}_{G},$ the classical optimal multivariate allocation problem (Bethel 1985) can be defined as the search for the solution of the minimum (with respect to $n_{h}$ ) of the linear function $C$ under the convex constraints $Var ({\hat{T}}_{g}) \leq V_{g} g = 1, \dots, G :$

${\begin{cases} \min C (n_{1}, \dots, n_{H}) = C_{0} + \sum_{h = 1}^{H} C_{h} n_{h} \\ Var ({\hat{T}}_{g}) = \sum_{h = 1}^{H} N_{h}^{2} (1 - \frac{n_{h}}{N_{h}}) \frac{S_{h, g}^{2}}{n_{h}} \leq V_{g} g = 1, \dots, G \end{cases} (2.3)$

Bethel (1989) suggested that the problem can be more easily solved by considering the following function of $n_{h} :$

$x_{h} = {\begin{array}{l} 1 / n_{h} if n_{h} \geq 1 \\ \infty otherwise \end{array} (2.4)$

Using $x_{h}$ the cost function can be written as

$C (x_{1}, \dots, x_{H}) = C_{0} + \sum_{h = 1}^{H} \frac{C_{h}}{x_{h}} (2.5)$

and the variances as

$Var ({\hat{T}}_{g}) = \sum_{h = 1}^{H} N_{h}^{2} (1 - \frac{1}{x_{h} N_{h}}) S_{h, g}^{2} x_{h} = \sum_{h = 1}^{H} N_{h}^{2} S_{h, g}^{2} x_{h} - N_{h} S_{h, g}^{2} g = 1, \dots, G (2.6)$

Consequently, the multivariate allocation problem can be defined as the search for the minimum (with respect to $x_{h}$ ) of the convex function (2.5) under a set of linear constraints

$\sum_{h = 1}^{H} N_{h}^{2} S_{h, g}^{2} x_{h} - N_{h} S_{h, g}^{2} \leq V_{g} g = 1, \dots, G (2.7)$

An algorithm, that is proved to converge to the solution (if it exists), was provided by Bethel by applying the Lagrangian multipliers method to this problem (an easier algorithm was previously proposed by Chromy (1987); as Bethel pointed out, the Chromy algorithm works in most of the practical cases but there is no proof that it converges if a solution exists).

The optimization approach here illustrated yields a continuous solution, which must be rounded to provide integer stratum sample sizes. The implementation we made of the Bethel algorithm provides the $n_{h}$ values as the values $1 / x_{h}$ rounded up to the upper integer.

It should be noted that the same approach can be used to deal with the multidomain problem. Let us consider the usual transformation for the domain estimation problem:

$Y_{i}^{d} = {\begin{array}{l} Y_{i} if the unit i belongs to domain d \\ 0 otherwise \end{array}$

If the quantities previously defined to describe the Bethel approach are computed using the variables $Y^{d} (d = 1, \dots, D),$ then the multivariate allocation solution is the solution for the multidomain case.

Selection of the best stratification on the basis of a complete enumeration

In order to choose the best stratification of a given frame, i.e., the one that ensures the minimum cost $C (n_{1}, \dots, n_{H})$ associated to a sample whose total size and allocation are compliant to precision constraints, it is possible to proceed as follows:

generate the most detailed stratification associated with $F,$ that is the set $L$ of atomic strata;
enumerate all partitions $P_{i}$ of $L;$
partition $P_{i},$ solve the corresponding allocation problem, that is equivalent to determine the vector $(n_{1}, \dots, n_{H}),$ and calculate the value $C_{i} (n_{1}, \dots, n_{H})$ associated to $P_{i};$
choose the partition $P_{i}$ for which $C_{i} (n_{1}, \dots, n_{H})$ is minimized.

By so doing, the optimization of the solution is obtained by considering the whole universe of stratifications.

Unfortunately, this procedure is applicable only in situations where the dimension $K$ of $L$ is low: in fact, the number of partitions (given by the Bell formula) grows very rapidly (for example, $B_{4} =$ 15, $B_{10} =$ 115,975 and $B_{100} \approx 4.76 \times 10^{115}$ ). Therefore, in most cases, the complete enumeration of the space of the solutions is not feasible. The present proposal, based on the genetic algorithm, allows to explore the universe of stratifications and to identify the one that is expected not to be far from the optimal.

The genetic algorithm

A genetic algorithm $(G A)$ is a search technique used in computing to find exact or approximate solutions to optimization and search problems. Genetic algorithms are a particular class of evolutionary algorithms that make use of techniques inspired by evolutionary biology, such as inheritance, mutation, selection and crossover (also called recombination) (Vose 1999) (Schmitt 2001 and 2004).

A $G A$ is implemented as an iterative computer simulation, in which an initial set of individuals, each one being a potential solution to the current problem (represented by a vector called genome), evolves by inheritance, mutation, selection and crossover, increasing the average fitness of next generations. Here, the fitness corresponds to the objective function defined in the optimization problem so that the evolution results into the maximization (or minimization) of the objective function.

The set of individuals treated in each iteration of the $G A$ is called generation. The evolution is the set of changes that occurs in producing consecutive generations by iterating the process.

At each iteration of the $G A,$ after having evaluated the fitness of every individual in the generation, a set of individuals are stochastically selected (privileging those with higher fitness), and modified (recombined and sometimes randomly mutated) to form a new generation. This new generation is then evaluated in the next iteration of the algorithm. As individuals with the best fitness are more likely to be selected for generating individuals for the next generation, the $G A$ produces an increase of average fitness in the course of the evolution.

The parameter mutation rate is expressed as the rate of chromosomes (the genome elements) that can be mutated for each individual at the moment of the generation of children for the next generation. A high value guarantees large differences between successive generations. It should be noted that a high mutation rate makes the $G A$ more likely to avoid stagnating at local optima, at the price of a slower convergence to the optimal solution; whilst a low value accelerates the convergence speed, increasing the risk of local optima.

Usually, the algorithm terminates when either a maximum number of iterations has been reached, or the current solution is not improved by continuing the iteration. In both cases, the optimal solution may or may not have been reached.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

2 Formalization of the optimization problem