3 Application of the genetic algorithm to the optimal stratification problem
Marco Ballin and Giulio Barcaroli
Previous | Next
On the basis of the
setting, the stratification allocation problem
can be represented as follows:
-
a given stratification is considered as an individual;
- the genome of an individual is a vector
whose dimension is given by the number
of atomic strata;
each position
in the vector is associated to a given atomic
stratum, and contains an integer value with whereis defined as the maximum number of strata in
the final solution: if some elements of the vector have the same value, it
means that the corresponding atomic strata collapse into a new stratum
identified by this value;
-
in this way, a
stratification can be identified by a vector where each value is positionally associated to the atomic
stratum identified by the label and can assume an integer value internal to an
interval The space
of all potential stratifications (or partitions) (space of solutions) is given by all possible
vectors
-
the fitness function of an individual is the value of the cost function where the terms and are given constants, and the
are calculated by applying the Bethel
algorithm to the stratification, under precision constraints set on the target
variables.
It is worth while noting that, if we set
and
for all the atomic strata, then the value of
the cost function simply coincides with the sample size required to satisfy
precision constraints.
Having defined a suitable representation of the domain
of all possible solutions, and the fitness function to be calculated for each
solution, in the following it is reported how
operates.
Step 0: Creation of the initial generation of individuals
The first step consists in forming an initial set of
different stratifications (the initial generation of individuals): on the basis
of the value of the parameter size of the
generations,
different individuals are generated. This
means that, for the
individual,
integer values (one for each element of the
vector representing the genome)
are randomly generated from a uniform distribution in the interval
Fixing
we can set an upper limit to the maximum
number of distinct aggregate strata.
Step 1: Evaluation of fitness for each individual in the population
For each individual in the population (that is for each
one of
stratifications), its related fitness is
evaluated by calculating the total cost required to satisfy precision
constraints on the
different
estimates (in order to remove the dependence
on the scale (or range) of the values associated with the
target variables, instead of considering the
constraints expressed in the (2.7) as an upper limit to the variance of the
target variables, we set constraints on their coefficient of variation
). The evaluation is carried out by applying
the Bethel algorithm, that requires as input, for each stratum of the current
solution:
- means and
standard deviations of target variables;
- cost of
interviewing per unit;
- population
(number of units).
Each one of the above items is
computed on the basis of corresponding values in the atomic strata.
Let us consider a particular partition
of
determined by a given solution
Let
be one stratum in this partition. There are
two possibilities:
-
coincides with an atomic stratum
-
is the result of the aggregation of a subset
of the atomic strata.
In the first case, means and variances of target
variables in the stratum are known. In the second case, means and variances in
may be calculated by using the following
formulas:
where:
and
are the mean values in aggregated stratum
and atomic strata
is the number of units in atomic stratum
and
are the variances in aggregated stratum
and atomic strata
The expected cost of observing a unit in a given
aggregate stratum is calculated by averaging the costs in each contributing
atomic stratum, weighted by their population:
Finally, we can compute the population in any aggregate
stratum as the sum of the units in the contributing atomic strata:
So, in correspondence of each potential solution, we are
able to calculate dynamically all the information required to apply the optimal
allocation algorithm, that produces the total cost
that is the fitness
of the individual.
Step 2: Breeding a new generation
Once the fitness of each individual is evaluated, a
proportion of them are selected to breed a new generation. Individuals are
selected through this fitness-based
process, where fitter individuals are more likely to be selected, while only a
small proportion of less fit individuals are selected. The presence of this
second component helps to keep the diversity of the generation large enough,
preventing premature convergence on poor solutions. There
is also the option of indicating the number of the best individuals (expressed
as a percentage of the
size of the generation) that in any case must
be present also in the next generation (parameter elitism).
The next generation will thus be composed by a number of
individuals from the previous generation (the best ones), plus a number of
"children�, obtained by selecting and crossing "parents� from the current
generation. In the
approach, the genome of a "child� individual is formed using the crossover and mutation operators:
- crossover: many crossover techniques exist for
which use different data structures and
different criteria of chromosomes selection, but the general approach is to
exchange a subset of chromosomes between two parents. In our implementation,
once two parents have been selected with probability proportional to their
fitness, a crossover-point is
generated, still on a random basis. This crossover-point is an integer
belonging to the interval
Let
be this generated crossover-point: then, the
child individual will be formed by inheriting the first
chromosomes from the first parent, and the
remaining
chromosomes from the second parent;
- mutation: given the probability that an arbitrary value in a
genetic sequence will be changed from its original state (mutation chance),
proceeds to draw, for each chromosome in the
genome, a random value to decide if the value will be changed or not.
By applying the above methods of crossover and mutation,
a new individual is created which typically shares many of the characteristics
of its "parents�. New parents are selected to produce new children, and the
process continues until a new generation of individuals (stratifications) of
appropriate size is generated.
Step 3: Iteration and stopping criteria
Usually, the average fitness is increased moving from
one generation to the next. Steps 1 and 2 are repeated
until a termination condition has been reached. Common terminating conditions
are:
- the maximum number of iterations has been reached;
- a "plateau� has
been reached, such that successive iterations no longer produce better results;
- a combination of the above.
In our case, the terminating condition can be considered
as a combination of the above. Actually, the used rule is the maximum number of
iterations, but this number is determined by analysing previous runs, in order
to detect the "plateau� and be sure that additional iterations are not likely
to improve the final solution.
Critical parameters of the optimal stratification
algorithm
Here a distinction is made between the parameters that
are common to genetic algorithm, and the ones that are peculiar to the
particular problem to which it is applied, i.e., the optimal stratification of a population frame (the names of the parameters
are those used in the R package SamplingStrata).
Among the first we
list:
- size of
generation of individuals (pop);
- number of
iterations (iterations);
-
mutation chance (mut_chance);
-
elitism
(elitism_rate).
Instead, the context
parameters are:
-
minimum number of units per stratum (minnumstrat) (the Bethel algorithm is forced to allocate in each
stratum at least the number of units indicated by this parameter);
-
initial number of
strata (initialStrata);
-
possibility to
increase the maximum number of strata (addStrataFactor).
As for the first group, there are no strict rules to
assign values to these parameters. Given a particular problem, it is suggested
to carry out a number of trials in order to assess the sensitivity of the
solutions to the values of the parameters.
It is important to take into account that parameters as size of generation and elitism are in general influent on the
rapidity of convergence, and not so much on the final solution, given that a
"reasonable� number of iterations is given.
The reasonability of the parameter number of iterations can be assessed by analysing the behaviour of
the fitness function: if the values of this function are no longer decreasing
after a certain number of iterations, it is reasonable to expect that to
increase the number of iterations will not produce better results.
On the contrary, the value of mutation chance has effects on both rapidity of convergence and the
goodness of the final solution: a high mutation chance allows to avoid local
minima, at the cost of a slower convergence.
Conversely, parameters of the second group should be
given on the basis of practical considerations, related to the characteristics
and requirements of the survey that is under design.
As for the parameter minimum number of units per
stratum, if an adequate number of
observations in all strata is to be ensured (in order to take into account the
expected non response, the need of calculating sampling variance, fieldwork
reasons, etc.), a value can be set
higher than the default one (which is set to 2).
The parameter initial
number of strata is very important. First of all, its value, if associated
with a value of the parameter addStrataFactor equal to zero, determines the maximum acceptable number of strata in the final
solution. This possibility may be useful not only for fieldwork reasons (if,
for example, for organizational considerations the number of strata is to be
limited), but especially because the final solution is very sensitive to the
value of this parameter. We have experimented that if the algorithm with
different values of initialStrata is
run, from low values up to the maximum given by the number of atomic strata,
solutions can be very different. It is possible to let the algorithm to choose
for us, in this way: we set initialStrata by assigning a low value to it, together with a high value of parameter addStrataFactor (the parameter addStrataFactor is used to increase
dynamically the value set by parameter initialStrata: each time a mutation takes place, a random number between 0 and 1 is generated,
and if it is greater than the quantity (1-addStrataFactor),
the maximum number of strata is increased of one unit) (by default, it is equal
to 0). Manoeuvring these two parameters, there are different possibilities:
- for any given value of initialStrata, if addStrataFactor is set equal to 0, then the algorithm has to consider that value as a fixed
limit, and all solutions to be explored will be characterised by that maximum
number of strata;
- otherwise, if addStrataFactor is set to a value greater then 0, then the algorithm may explore solutions
varying the number of strata, from an initial value given by initialStrata, up to a maximum number
given by the number of atomic strata.
Previous | Next