Publications

Survey Methodology

Browse by

3 Application of the genetic algorithm to the optimal stratification problem

Marco Ballin and Giulio Barcaroli

On the basis of the $G A$ setting, the stratification allocation problem can be represented as follows:

a given stratification is considered as an individual;
the genome of an individual is a vector whose dimension is given by the number $K$

i (i = 1, \dots, K)

v_{i} (1 < v_{i} < U)

U \leq K,

U

in this way, a stratification $P (ν)$ can be identified by a vector $v = [v_{1}, \dots, v_{K}],$ where each value $v_{i}$ is positionally associated to the atomic stratum identified by the label $l_{i}$ and can assume an integer value internal to an interval $[1, U] .$ The space of all potential stratifications (or partitions) $P (ν)$ (space of solutions) is given by all possible vectors $v;$
the fitness function of an individual $P (ν)$ is the value of the cost function $C (n_{1}, \dots, n_{H_{P (v)}}) = C_{0} + \sum_{h = 1}^{H_{P (v)}} C_{h} n_{h},$ where the terms $C_{0}$ and $C_{h}$ are given constants, and the $n_{1}, \dots, n_{H_{P (v)}}$ are calculated by applying the Bethel algorithm to the stratification, under precision constraints set on the target variables.

It is worth while noting that, if we set $C_{0} = 0,$ and $C_{h} = 1$ for all the atomic strata, then the value of the cost function simply coincides with the sample size required to satisfy precision constraints.

Having defined a suitable representation of the domain of all possible solutions, and the fitness function to be calculated for each solution, in the following it is reported how $G A$ operates.

Step 0: Creation of the initial generation of individuals

The first step consists in forming an initial set of different stratifications (the initial generation of individuals): on the basis of the value of the parameter size of the generations, $p$ different individuals are generated. This means that, for the $j^{th}$ individual, $K$ integer values (one for each element of the vector representing the genome) are randomly generated from a uniform distribution in the interval $[1, U] .$ Fixing $U \leq K$ we can set an upper limit to the maximum number of distinct aggregate strata.

Step 1: Evaluation of fitness for each individual in the population

For each individual in the population (that is for each one of $p$ stratifications), its related fitness is evaluated by calculating the total cost required to satisfy precision constraints on the $G$ different ${\hat{T}}_{g}$ estimates (in order to remove the dependence on the scale (or range) of the values associated with the $G$ target variables, instead of considering the constraints expressed in the (2.7) as an upper limit to the variance of the target variables, we set constraints on their coefficient of variation $CV = \sqrt{var ({\hat{T}}_{G})} / {\hat{T}}_{G}$ ). The evaluation is carried out by applying the Bethel algorithm, that requires as input, for each stratum of the current solution:

means and standard deviations of target variables;
cost of interviewing per unit;
population (number of units).

Each one of the above items is computed on the basis of corresponding values in the atomic strata.

Let us consider a particular partition $P (ν)$ of $L$ determined by a given solution $v = [v_{1}, \dots, v_{K}] .$ Let $D_{i} (i = 1, 2, \dots, Q_{P (v)})$ be one stratum in this partition. There are two possibilities:

$D_{i}$ coincides with an atomic stratum $l_{k};$
$D_{i} = {l_{j}^{i}, \dots, l_{l}^{i}}$ is the result of the aggregation of a subset ${l_{j}^{i}, \dots, l_{l}^{i}}$ of the atomic strata.

In the first case, means and variances of target variables in the stratum are known. In the second case, means and variances in $D_{i}$ may be calculated by using the following formulas:

${\bar{Y}}_{g, D_{i}} = \frac{\sum_{l_{k} \in D_{i}} {\bar{Y}}_{g, l_{k}} N_{l_{k}}}{\sum_{l_{k} \in D_{i}} N_{l_{k}}} (3.1)$

$S_{g, D_{i}}^{2} = {(\sum_{l_{k} \in D_{i}} N_{l_{k}} - 1)}^{- 1} {\sum_{l_{k} \in D_{i}} (N_{l_{k}} - 1) S_{g, l_{k}}^{2} + \sum_{l_{k} \in D_{i}} N_{l_{k}} {({\bar{Y}}_{g, l_{k}} - {\bar{Y}}_{g, D_{i}})}^{2}} (3.2)$

where:

${\bar{Y}}_{g, D_{i}}^{}$ and ${\bar{Y}}_{g, l_{k}}^{}$ are the mean values in aggregated stratum $D_{i}$ and atomic strata $l_{k};$

$N_{l_{k}}$ is the number of units in atomic stratum $l_{k};$

$S_{g, D_{i}}^{2}$ and $S_{g, l_{k}}^{2}$ are the variances in aggregated stratum $D_{i}$ and atomic strata $l_{k} .$

The expected cost of observing a unit in a given aggregate stratum is calculated by averaging the costs in each contributing atomic stratum, weighted by their population:

$C_{D_{i}} = \frac{\sum_{l_{k} \in D_{i}} C_{l_{k}} N_{l_{k}}}{\sum_{l_{k} \in D_{i}} N_{l_{k}}} (3.3)$

Finally, we can compute the population in any aggregate stratum as the sum of the units in the contributing atomic strata:

$N_{D_{i}} = \sum_{l_{k} \in D_{i}} N_{l_{k}} (3.4)$

So, in correspondence of each potential solution, we are able to calculate dynamically all the information required to apply the optimal allocation algorithm, that produces the total cost

$C (n_{1}, \dots, n_{{^{_{H}}}_{P (ν)}}) = C_{0} + \sum_{h = 1}^{H_{P (ν)}} C_{h} n_{h}$

that is the fitness of the individual.

Step 2: Breeding a new generation

Once the fitness of each individual is evaluated, a proportion of them are selected to breed a new generation. Individuals are selected through this fitness-based process, where fitter individuals are more likely to be selected, while only a small proportion of less fit individuals are selected. The presence of this second component helps to keep the diversity of the generation large enough, preventing premature convergence on poor solutions. There is also the option of indicating the number of the best individuals (expressed as a percentage of the $p$ size of the generation) that in any case must be present also in the next generation (parameter elitism).

The next generation will thus be composed by a number of individuals from the previous generation (the best ones), plus a number of "children�, obtained by selecting and crossing "parents� from the current generation. In the $G A$ approach, the genome of a "child� individual is formed using the crossover and mutation operators:

crossover: many crossover techniques exist for $G A,$ which use different data structures and different criteria of chromosomes selection, but the general approach is to exchange a subset of chromosomes between two parents. In our implementation, once two parents have been selected with probability proportional to their fitness, a crossover-point is generated, still on a random basis. This crossover-point is an integer belonging to the interval $[1, K],$ Let $c$ be this generated crossover-point: then, the child individual will be formed by inheriting the first $c$ chromosomes from the first parent, and the remaining $(K - c)$ chromosomes from the second parent;
mutation: given the probability that an arbitrary value in a genetic sequence will be changed from its original state (mutation chance), $G A$ proceeds to draw, for each chromosome in the genome, a random value to decide if the value will be changed or not.

By applying the above methods of crossover and mutation, a new individual is created which typically shares many of the characteristics of its "parents�. New parents are selected to produce new children, and the process continues until a new generation of individuals (stratifications) of appropriate size is generated.

Step 3: Iteration and stopping criteria

Usually, the average fitness is increased moving from one generation to the next. Steps 1 and 2 are repeated until a termination condition has been reached. Common terminating conditions are:

the maximum number of iterations has been reached;
a "plateau� has been reached, such that successive iterations no longer produce better results;
a combination of the above.

In our case, the terminating condition can be considered as a combination of the above. Actually, the used rule is the maximum number of iterations, but this number is determined by analysing previous runs, in order to detect the "plateau� and be sure that additional iterations are not likely to improve the final solution.

Critical parameters of the optimal stratification algorithm

Here a distinction is made between the parameters that are common to genetic algorithm, and the ones that are peculiar to the particular problem to which it is applied, i.e., the optimal stratification of a population frame (the names of the parameters are those used in the R package SamplingStrata).

Among the first we list:

size of generation of individuals (pop);
number of iterations (iterations);
mutation chance (mut_chance);
elitism (elitism_rate).

Instead, the context parameters are:

minimum number of units per stratum (minnumstrat) (the Bethel algorithm is forced to allocate in each stratum at least the number of units indicated by this parameter);
initial number of strata (initialStrata);
possibility to increase the maximum number of strata (addStrataFactor).

As for the first group, there are no strict rules to assign values to these parameters. Given a particular problem, it is suggested to carry out a number of trials in order to assess the sensitivity of the solutions to the values of the parameters.

It is important to take into account that parameters as size of generation and elitism are in general influent on the rapidity of convergence, and not so much on the final solution, given that a "reasonable� number of iterations is given.

The reasonability of the parameter number of iterations can be assessed by analysing the behaviour of the fitness function: if the values of this function are no longer decreasing after a certain number of iterations, it is reasonable to expect that to increase the number of iterations will not produce better results.

On the contrary, the value of mutation chance has effects on both rapidity of convergence and the goodness of the final solution: a high mutation chance allows to avoid local minima, at the cost of a slower convergence.

Conversely, parameters of the second group should be given on the basis of practical considerations, related to the characteristics and requirements of the survey that is under design.

As for the parameter minimum number of units per stratum, if an adequate number of observations in all strata is to be ensured (in order to take into account the expected non response, the need of calculating sampling variance, fieldwork reasons, etc.), a value can be set higher than the default one (which is set to 2).

The parameter initial number of strata is very important. First of all, its value, if associated with a value of the parameter addStrataFactor equal to zero, determines the maximum acceptable number of strata in the final solution. This possibility may be useful not only for fieldwork reasons (if, for example, for organizational considerations the number of strata is to be limited), but especially because the final solution is very sensitive to the value of this parameter. We have experimented that if the algorithm with different values of initialStrata is run, from low values up to the maximum given by the number of atomic strata, solutions can be very different. It is possible to let the algorithm to choose for us, in this way: we set initialStrata by assigning a low value to it, together with a high value of parameter addStrataFactor (the parameter addStrataFactor is used to increase dynamically the value set by parameter initialStrata: each time a mutation takes place, a random number between 0 and 1 is generated, and if it is greater than the quantity (1-addStrataFactor), the maximum number of strata is increased of one unit) (by default, it is equal to 0). Manoeuvring these two parameters, there are different possibilities:

for any given value of initialStrata, if addStrataFactor is set equal to 0, then the algorithm has to consider that value as a fixed limit, and all solutions to be explored will be characterised by that maximum number of strata;
otherwise, if addStrataFactor is set to a value greater then 0, then the algorithm may explore solutions varying the number of strata, from an initial value given by initialStrata, up to a maximum number given by the number of atomic strata.

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search