2 Formalization of the optimization problem
Marco Ballin and Giulio Barcaroli
Previous | Next
Universe of alternative stratifications
We define as sampling
frame
a set of
records containing information (organised in
variables) related to
individuals of the reference population. Some
variables are useful for the identification of units, while some other can be
used in order to define the sampling strategy. The values of the latter (from
now on: auxiliary variables) can be
observed by means of a census, or from other sources as administrative
registers.
We assume that in the frame a set of
auxiliary variables
are available. This set may contain different
typologies of variables (nominal, ordinal, or continuous). We assume also that
continuous auxiliary variables are split into classes by applying suitable
transformation algorithms.
All such variables can potentially be used to stratify
the units in the frame.
Under these assumptions, we can associate to each
auxiliary variable a vector
of contiguous integer values, each of them
representing an original value in the domain set.
Then, the most detailed stratification of
can be considered as the result of the
Cartesian product
The maximum number of strata will be
where
is the number of impossible or absent
combinations of values in the frame. So, the most detailed stratification of
the frame is such that it contains
strata, corresponding to all possible
combinations of values in the
auxiliary variables. We call atomic strata the strata belonging to
this particular stratification. Each atomic stratum is characterised by a unique combination of values of the
auxiliary variables. We can assign a label
to each atomic stratum.
If we consider the labelled set of atomic strata
we can define the set of all its possible
partitions
where
can be calculated by using the Bell formula:
We define the set
of partitions of
as the universe
(or space) of stratifications.
Assessment of a given stratification
Given a partition
of
characterized by
strata, let
and
be respectively the number of units and variances
in stratum
of the
different survey target variables
Assuming a simple random sampling of
units without replacement in each stratum, the
variance of the Horvitz-Thompson estimator of the total of the
target variable
is
Consider the following cost function
where
indicates a fixed cost (not dependent on the
sample size) and
represents the average cost of observing a
unit in stratum
Given
the upper bounds for the expected sampling
variance for
the classical optimal multivariate allocation
problem (Bethel 1985) can be defined as the search for the solution of the
minimum (with respect to
) of the linear function
under the convex
constraints
Bethel (1989) suggested that the problem can be more
easily solved by considering the following function of
Using
the cost function can be written as
and the variances as
Consequently, the multivariate allocation problem can be
defined as the search for the minimum (with respect to
) of the convex function (2.5) under a set of linear
constraints
An algorithm, that is proved to converge to the solution
(if it exists), was provided by Bethel by applying the Lagrangian multipliers
method to this problem (an easier algorithm was previously proposed by Chromy
(1987); as Bethel pointed out, the Chromy algorithm works in most of the
practical cases but there is no proof that it converges if a solution exists).
The optimization approach here illustrated yields a
continuous solution, which must be rounded to provide integer stratum sample sizes.
The implementation we made of the Bethel algorithm provides the
values as the values
rounded up to the upper integer.
It should be noted that the same approach can be used to
deal with the multidomain problem. Let us consider the usual transformation for
the domain estimation problem:
If the quantities previously defined to describe the
Bethel approach are computed using the variables
then the multivariate allocation solution is
the solution for the multidomain case.
Selection of the best stratification on the basis of a
complete enumeration
In order to choose the best stratification of a given
frame, i.e., the one that ensures the
minimum cost
associated to a sample whose total size and
allocation are compliant to precision constraints, it is possible to proceed as
follows:
-
generate the most
detailed stratification associated with
that is the set
of atomic strata;
-
enumerate all
partitions
of
-
partition
solve the corresponding allocation problem,
that is equivalent to determine the vector
and calculate the value
associated to
-
choose the
partition
for which
is minimized.
By so doing, the optimization of the solution is
obtained by considering the whole universe of stratifications.
Unfortunately, this procedure is applicable only in
situations where the dimension
of
is low: in fact, the number of partitions
(given by the Bell formula) grows very rapidly (for example,
15,
115,975 and
). Therefore, in most cases, the complete
enumeration of the space of the solutions is not feasible. The present
proposal, based on the genetic
algorithm, allows to explore the universe of stratifications and to
identify the one that is expected not to be far from the optimal.
The genetic algorithm
A genetic algorithm
is a search technique used in computing to
find exact or approximate solutions to optimization and search problems.
Genetic algorithms are a particular class of evolutionary algorithms that make
use of techniques inspired by evolutionary biology, such as inheritance, mutation, selection and crossover (also called recombination)
(Vose 1999) (Schmitt 2001 and 2004).
A
is implemented as an iterative computer
simulation, in which an initial set of individuals, each one being a potential solution to the current problem (represented by a
vector called genome), evolves by inheritance, mutation, selection and crossover, increasing the average fitness of next generations.
Here, the fitness corresponds to the
objective function defined in the optimization problem so that the evolution
results into the maximization (or minimization) of the objective function.
The set of individuals treated in each iteration of the
is called generation.
The evolution is the set of changes
that occurs in producing consecutive generations by iterating the process.
At each iteration of the
after having evaluated the fitness of every
individual in the generation, a set of individuals are stochastically selected
(privileging those with higher fitness),
and modified (recombined and sometimes randomly mutated) to form a new
generation. This new generation is then evaluated in the next iteration of the
algorithm. As individuals with the best fitness are more likely to be selected
for generating individuals for the next generation, the
produces an increase of average fitness in the
course of the evolution.
The parameter mutation
rate is expressed as the rate of chromosomes (the genome elements) that can be
mutated for each individual at the moment of the generation of children for the next generation. A high
value guarantees large differences between successive generations. It should be
noted that a high mutation rate makes the
more likely to avoid stagnating at local
optima, at the price of a slower convergence to the optimal solution; whilst a low
value accelerates the convergence speed, increasing the risk of local optima.
Usually, the algorithm terminates when either a maximum
number of iterations has been reached, or the current solution is not improved
by continuing the iteration. In both cases, the optimal solution may or may not
have been reached.
Previous | Next