Browse by

3. Markov models for contingency tables with nonresponse

Andrés Gutiérrez, Leonardo Trujillo and Pedro Luis do Nascimento Silva

Consider the problem of estimating gross flows between two consecutive periods of time using categorical data obtained from a panel survey and under nonresponse. Also, suppose that the outcome of every interview is the classification of the respondent into any of $G$ possible pairwise disjoint categories, and the aim is to estimate the gross flows between these categories using the information from individuals who were interviewed at two consecutive periods of time. Individuals who either did not answer in one or two periods or were excluded or included for only one of the two periods shall not have a definite classification among the categories. Then, there is one group of individuals with classification between the two periods, a group of individuals who only have the information for one of the two periods and a group of individuals who did not respond in any of the two periods of the survey.

For those individuals responding on times $t - 1$ and $t$ , the classification data can be summarized in a matrix of dimension $G \times G$ . The available information for those individuals not responding the survey at time $t - 1$ but responding at time $t$ can be summarized in a column complement; the information for those individuals not responding at time $t$ but responding at time $t - 1$ can be summarized in a row complement. Finally, individuals not responding at any of the two times are included in a single cell counting the number of individuals with missing data at both times.

The whole matrix is illustrated in Table 3.1, where $N_{i j}$ ( $i, j = 1, \dots, G$ ) denotes the number of individuals in the population having classification $i$ at time $t - 1$ and classification $j$ at time $t$ , $R_{i}$ denotes the number of individuals not responding at time $t$ and having classification $i$ at time $t - 1$ , $C_{j}$ denotes the number of individuals not responding at time $t - 1$ and had classification $j$ at time $t$ , and $M$ denotes the number of individuals in the sample not responding in any of the two times. It is important to mention that this analysis does not take into account nonresponse due to the rotation in the survey; it only takes into account individuals belonging to the matched sample ignoring those individuals not responding because they were not selected in the sample.

Table 3.1
Gross flows at two consecutive periods of time.
Table summary
This table displays the results of Gross flows at two consecutive periods of time.. The information is grouped by Time $t - 1$ (appearing as row headers), Time $t$ (appearing as column headers).
Time $t - 1$	Time $t$
Time $t - 1$	1	2	$\dots$	$G$	Row complement
1	$N_{11}$	$N_{12}$	$\dots$	$N_{1 G}$	$R_{1}$
2	$N_{21}$	$N_{22}$	$\dots$	$N_{2 G}$	$R_{2}$
$⋮$	$⋮$	$⋮$	$⋮$	$⋮$	$⋮$
$G$	$N_{G 1}$	$N_{G 2}$	$\dots$	$N_{G G}$	$R_{G}$
Column complement	$C_{1}$	$C_{2}$	$\dots$	$C_{G}$	$M$

This paper considers ideas from Stasny (1987) and Chen and Fienberg (1974) - in the sense of considering a maximum likelihood approach in contingency tables for partially classified data - and data resulting from a two-stage process as follows:

In the first stage (nonobservable), the individuals are located among the cells of a matrix $G \times G$ according to the probabilities of a Markov chain process. Let $η_{i}$ be the initial probability of an individual being at the category $i$ at the time $t - 1$ with $\sum_{i} η_{i} = 1,$ and $p_{i j}$ be the transition probability from the category $i$ to category $j$ , where $\sum_{j} p_{i j} = 1$ for every $i$ .
In the second stage (observable) of the process, every individual in cell i j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamyAaiaadQ gaaaa@37C4@ can either be nonrespondent at time t − 1 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiDaiabgk HiTiaaigdacaGGSaaaaa@3938@ losing the classification by row; nonrespondent at time t , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq=Jc9 vqaqpepm0xbba9pwe9Q8fs0=yqaqpepae9pg0FirpepeKkFr0xfr=x fr=xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaamiDaiaacY caaaa@3790@ losing the classification by column; or nonrespondent at both times, losing both classifications.
- Let $ψ$ be the initial probability of an individual in cell $i j$ responding at time $t - 1.$
- Let $ρ_{R R}$ be the transition probability of classification of the individual in cell $i j$ responding at time $t - 1$ and responding at time $t .$
- Let $ρ_{M M}$ be the transition probability of an individual in cell $i j$ being a nonrespondent at time $t - 1$ and becoming a nonrespondent at time $t .$
These probabilities do not depend on the classification stage of the individual.

Data is observed only after the second stage. The aim is to make inferences for the probabilities in the Markov chain process generating the data but also in the chain generating the nonresponse mechanism. In the context of this two-stage model, the corresponding probabilities are shown in Table 3.2.

Table 3.2
Gross flow probabilities at two consecutive times.
Table summary
This table displays the results of Gross flow probabilities at two consecutive times.. The information is grouped by Time $t - 1$ (appearing as row headers), Time $t$ (appearing as column headers).
Time $t - 1$	Time $t$
Time $t - 1$	1	2	$\dots$	$j$	$\dots$	$G$	Row complement
1
2
$⋮$
$i$	${η_{i} p_{i j} ψ ρ_{R R}}$						${\sum_{j} η_{i} p_{i j} ψ (1 - ρ_{R R})}$
$⋮$
$G$
Column complement	${\sum_{i} η_{i} p_{i j} (1 - ψ) (1 - ρ_{M M})}$						$\sum_{i} η_{i} p_{i j} \sum_{j} (1 - ψ) ρ_{M M}$

In this way, the likelihood function for the observed data under this two-stage model is proportional to

$\begin{array}{l} \prod_{i} \prod_{j} {[ψ ρ_{R R} η_{i} p_{i j}]}^{N_{i j}} \times \prod_{i} {[\sum_{j} ψ (1 - ρ_{R R}) η_{i} p_{i j}]}^{R_{i}} \\ \times \prod_{j} {[\sum_{i} (1 - ψ) (1 - ρ_{M M}) η_{i} p_{i j}]}^{C_{j}} \times {[\sum_{i} \sum_{j} (1 - ψ) ρ_{M M} η_{i} p_{i j}]}^{M} . (3.1) \end{array}$

3.1 Parameters of interest

Data are only observed after the second stage and the aim is to make inferences for both the probabilities at the Markov chain generating the data and the chain generating nonresponse. Under this two-stage model, the probabilities of the matrix of data are shown in Table 3.2 and they constitute some of the parameters of interest.

On the other hand, coming from the non-observable process, it is necessary to consider other parameters of interest as follows. Suppose a finite population $U$ exists, having a classification in two periods of time for all its individuals. This is a non-observable process as, even when census data is obtained, it would be not possible to have a complete classification since not all the individuals will be willing to respond. Considering this non-observable process and assuming that there are $G$ possible classifications at each time, the distribution of the gross flows at the population level are shown in Table 3.3.

$X_{i j}$ is the number of units at the finite population with classification $i$ at time $t - 1$ and classification $j$ at time $t$ ( $i, j = 1, \dots, G$ ). The population size, $N$ , must satisfy the expression:

$N = \sum_{i} \sum_{j} X_{i j} .$

Table 3.3
Population gross flows (non-observable process) at two consecutive periods of time.
Table summary
This table displays the results of Population gross flows (non-observable process) at two consecutive periods of time.. The information is grouped by Time $t - 1$ (appearing as row headers), Time $t$ (appearing as column headers).
Time $t - 1$	Time $t$
Time $t - 1$	1	2	$\dots$	$j$	$\dots$	$G$
1	$X_{11}$	$X_{12}$	$\dots$	$X_{1 j}$	$\dots$	$X_{1 G}$
2	$X_{21}$	$X_{22}$	$\dots$	$X_{2 j}$	$\dots$	$X_{2 G}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$	$⋱$	$⋮$
$i$	$X_{i 1}$	$X_{i 2}$	$\dots$	$X_{i j}$	$\dots$	$X_{i G}$
$⋮$	$⋮$	$⋮$	$⋱$	$⋮$	$⋱$	$⋮$
$G$	$X_{G 1}$	$X_{G 2}$	$\dots$	$X_{G j}$	$\dots$	$X_{G G}$

Following the non-observable process from the last section, it is supposed that the vector corresponding to the entries at the last contingency table follows a multinomial distribution with a probability vector containing the values ${η_{i} p_{i j}}_{i, j = 1, \dots, G}$ . This assumes a superpopulation model where the contingency table counts are considered random. In terms of notation, the probability measure considering these counts will be denoted with the subindex $ξ .$ Then, the probability of classification at cell $i, j$ for the $k$ -th individual is

$\begin{array}{l} P_{ξ} (k has got a classification i at t - 1 and classification j at t) \\ = P_{ξ} (k has got a classification i at t - 1) \\ \times P_{ξ} (k has got a classification j at t | k has got a classification i at t - 1) \\ = η_{i} p_{i j} . \end{array}$

This treats $X_{i j}$ as a random variable and if the finite population has $N$ individuals, its expected value based on the model is given by

$E_{ξ} (X_{i j}) = N η_{i} p_{i j} = μ_{i j} . (3.2)$

Note that this expected value $μ_{i j}$ is one of the most important parameters to be estimated on this paper as it corresponds to the expected value of the gross flows at the population of interest at the two consecutive periods. On the other hand, it is also important to understand that $μ_{i j}$ is a parameter for the two-stage model. Also, the estimators for $η_{i}$ and $p_{i j}$ are interdependent and determined by the estimations of the defined parameters at the second stage. Let $η$ be the vector containing the parameters $η_{i}$ ; and $p$ be the vector containing the parameters $p_{i j}$ , for every $i, j = 1, \dots, G$ . The final parameters of interest are:

the model parameters, determined by the vector

$θ = {(ψ^{'}, {ρ^{'}}_{R R}, {ρ^{'}}_{M M}, η^{'}, p^{'})}^{'};$
the expected value vector of the population counts defined as

$μ = {(μ_{11}, \dots, μ_{i j}, \dots, μ_{G G})}^{'} .$

Previous | Next

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Publications

Survey Methodology

Browse by

3. Markov models for contingency tables with nonresponse

3.1 Parameters of interest