3. Markov models for contingency tables with nonresponse
Andrés Gutiérrez, Leonardo Trujillo and Pedro Luis do
Nascimento Silva
Previous | Next
Consider the
problem of estimating gross flows between two consecutive periods of time using
categorical data obtained from a panel survey and under nonresponse. Also,
suppose that the outcome of every interview is the classification of the
respondent into any of possible pairwise disjoint
categories, and the aim is to estimate the gross flows between these categories
using the information from individuals who were interviewed at two consecutive
periods of time. Individuals who either did not answer in one or two periods or
were excluded or included for only one of the two periods shall not have a
definite classification among the categories. Then, there is one group of
individuals with classification between the two periods, a group of individuals
who only have the information for one of the two periods and a group of
individuals who did not respond in any of the two periods of the survey.
For those
individuals responding on times and , the classification data can be summarized in a matrix of dimension . The available information for those individuals not responding the
survey at time but responding at time can be summarized in a column
complement; the information for those individuals not responding at time but responding at time can be summarized in a row
complement. Finally, individuals not responding at any of the two times are
included in a single cell counting the number of individuals with missing data
at both times.
The whole matrix
is illustrated in Table 3.1, where ( ) denotes the number of
individuals in the population having classification at time and classification at time , denotes the number of individuals
not responding at time and having classification at time , denotes the number of individuals
not responding at time and had classification at time , and denotes the number of individuals
in the sample not responding in any of the two times. It is important to
mention that this analysis does not take into account nonresponse due to the
rotation in the survey; it only takes into account individuals belonging to the
matched sample ignoring those individuals not responding because they were not
selected in the sample.
Table 3.1
Gross flows at two consecutive periods of time.
Table summary
This table displays the results of Gross flows at two consecutive periods of time.. The information is grouped by Time (appearing as row headers), Time (appearing as column headers).
Time
|
Time
|
1 |
2 |
|
|
Row complement |
1 |
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Column complement |
|
|
|
|
|
This paper
considers ideas from Stasny (1987) and Chen and Fienberg (1974) - in the sense
of considering a maximum likelihood approach in contingency tables for
partially classified data - and data resulting from a two-stage process as
follows:
- In the
first stage (nonobservable), the individuals are located among the cells of a
matrix
according to the probabilities of
a Markov chain process. Let
be the initial probability of an
individual being at the category
at the time
with
and
be the transition probability
from the category
to category
, where
for every
.
- In the
second stage (observable) of the process, every individual in cell
can either be nonrespondent at
time
losing the classification by row;
nonrespondent at time
losing the classification by
column; or nonrespondent at both times, losing both classifications.
- Let
be the initial probability of an
individual in cell
responding at time
- Let
be the transition probability of
classification of the individual in cell
responding at time
and responding at time
- Let
be the transition probability of
an individual in cell
being a nonrespondent at time
and becoming a nonrespondent at
time
These probabilities do not depend on the
classification stage of the individual.
Data is observed
only after the second stage. The aim is to make inferences for the probabilities
in the Markov chain process generating the data but also in the chain
generating the nonresponse mechanism. In the context of this two-stage model,
the corresponding probabilities are shown in Table 3.2.
Table 3.2
Gross flow probabilities at two consecutive times.
Table summary
This table displays the results of Gross flow probabilities at two consecutive times.. The information is grouped by Time (appearing as row headers), Time (appearing as column headers).
Time
|
Time
|
1 |
2 |
|
|
|
|
Row complement |
1 |
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Column complement |
|
|
In this way, the
likelihood function for the observed data under this two-stage model is
proportional to
3.1 Parameters of interest
Data are only
observed after the second stage and the aim is to make inferences for both the
probabilities at the Markov chain generating the data and the chain generating
nonresponse. Under this two-stage model, the probabilities of the matrix of
data are shown in Table 3.2 and they constitute some of the parameters of
interest.
On the other hand,
coming from the non-observable process, it is necessary to consider other
parameters of interest as follows. Suppose a finite population exists, having a classification
in two periods of time for all its individuals. This is a non-observable
process as, even when census data is obtained, it would be not possible to have
a complete classification since not all the individuals will be willing to
respond. Considering this non-observable process and assuming that there are possible classifications at each
time, the distribution of the gross flows at the population level are shown in
Table 3.3.
is the number of units at the
finite population with classification at time and classification at time ( ). The population size, , must satisfy the expression:
Table 3.3
Population gross flows (non-observable process) at two consecutive periods of time.
Table summary
This table displays the results of Population gross flows (non-observable process) at two consecutive periods of time.. The information is grouped by Time (appearing as row headers), Time (appearing as column headers).
Time
|
Time
|
1 |
2 |
|
|
|
|
1 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Following the
non-observable process from the last section, it is supposed that the vector
corresponding to the entries at the last contingency table follows a
multinomial distribution with a probability vector containing the values . This assumes a superpopulation model where the contingency table
counts are considered random. In terms of notation, the probability measure
considering these counts will be denoted with the subindex Then, the probability of
classification at cell for the -th individual is
This treats as a random variable and if the
finite population has individuals, its expected value
based on the model is given by
Note that this
expected value is one of the most important
parameters to be estimated on this paper as it corresponds to the expected
value of the gross flows at the population of interest at the two consecutive
periods. On the other hand, it is also important to understand that is a parameter for the two-stage
model. Also, the estimators for and are interdependent and determined
by the estimations of the defined parameters at the second stage. Let be the vector containing the
parameters ; and be the vector containing the
parameters , for every . The final parameters of interest are:
-
the model
parameters, determined by the vector
-
the
expected value vector of the population counts defined as
Previous | Next