Combining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities
7. Conclusions and suggestions for future researchCombining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities
7. Conclusions and suggestions for future research
The
results of the simulation studies carried out in this research indicate that
the two proposed estimators of
and
perform
reasonably well in different situations. This evidences their robustness to several
types of deviations from the assumed model. (Although
seems to be
sensitive to deviations from the assumed normal distribution of the
On the other
hand, the two proposed estimators of
and
present problems
of bias and especially problems of instability if the sampling fraction in
is not large
enough, say it is not larger than 50%. In addition, small sampling fractions
along with deviations from the assumed model for the link-probabilities
increase the risk of numerical convergence problems. The two proposed
estimators of
and
perform
similarly to the estimators
and
if
is much greater
than
(as in the case
of the artificial populations), perform similarly to the estimators
and
if
is much greater
than
(as in the case
of the Colorado Springs population), and perform as a combination of the
performance of the estimators of
and
if the values of
these parameters are not very different from each other. Finally, the
estimators derived under the assumption of homogeneous link-probabilities
present serious problems of bias if this assumption is not satisfied.
It
is worth noting that our conclusion about the proposed estimators of
is based on the
results of several small simulation studies that we carried out using sampling
fractions greater than those used in the Monte Carlo studies reported in this
paper. In one study carried out with the artificial populations, we increased
the values of the link-probabilities
so that their
average values were
and
and kept the
sizes of the initial samples at
These changes
yielded the sampling fractions
and
In another study
also with the artificial populations, we reduced the values of the
so that
and increased
the sizes of the initial samples to
in Populations
and to
in Population
IV. These changes yielded the sampling fractions
and
In both studies
the estimators
and
performed
acceptably well. (The results are not shown.) These outcomes indicate that
these estimators also seem to have properties of robustness to deviations from
the assumed models provided that large sampling fractions be used.
However, in a study with the Colorado Springs
population using initial samples of sizes
which yielded
sampling fractions
and
the estimators
and
presented
serious problems of bias
and
which affected
the values of their
and mdare
Why these
estimators did not perform well even with large sampling fractions? We think
that the bad performance of these estimators is consequence of the very small
average value of the
and the way the
Monte Carlo studies were carried out. To clarify this statement, note the
following. When
is very small,
say less than 0.02, the expected number of elements in
that are linked
to at least one site in the frame is much less than
For instance, in
Population I when
this expected
number was about
Therefore, if
the sampling fraction
is large enough,
the estimates
and
will be close to
and will be much
greater than the expected number of elements linked to at least one site in the
frame. Thus, if we supposed that the Colorado Springs population was generated
by a random process, then the 794 contacts linked to at least one site in the
frame, and which we used as the value of
would be a much
smaller value than the actual size of
Consequently,
the performance of
and
as estimators of
the assumed value 794 of
would be very
bad. This explanation was suggested and confirmed by the results of a small
simulation study in which we considered Population I (in which every one of the
assumptions is satisfied), but instead of carrying the study as is described in
Subsection 5.1, we generated the complete set of values
of
by sampling from
the Bernoulli distributions with means
and we kept them
fixed. Then, we defined the value of
as the number of
elements of
linked to at
least one site in the frame. We considered two cases: large values of
and small values
In the first
case
whereas in the
second one
To have
comparable results the sizes of the initial samples were set to
in the first
case and to
in the second
case, so that in both cases the number of sampled elements were about 220. The
results of the numerical study showed that in the case of large values of
the estimators
and
performed well
(because
whereas in the
case of small values of
these estimators
performed badly (because
We think that
the results obtained in the last case are illustrative and explain the ones
obtained in the Colorado Springs population.
With
respect to the two types of proposed CIs: profile likelihood and bootstrap CIs,
we can conclude that they need larger sample sizes than the point estimators to
perform reasonably well. They are more sensitive to deviations from the assumed
models than the point estimators. In addition, if small sampling fractions are
used and deviations from the assumed model for the
are present, the
occurrence of numerical convergence problems will be greater than in the case
of point estimators.
From the previous observations we can conclude that in
actual applications of this sampling methodology, a good strategy is to
construct a sampling frame that covers the largest possible portion of the
target population. This way,
would be close
to
and the
estimates of
could be used as
estimates of
The advantage of
this strategy is that the estimators
and
perform better
than the estimators
and
because the
first ones incorporate the information about the cluster sizes
Furthermore,
this strategy makes possible to use the design-based estimator
as an estimator
of
The other factor
that must be taken into account to have good estimates is to use large sampling
fractions, say larger than 0.5. This suggestion is in agreement with the result
reported by Xi, Watson and Yip (2008), who in the context of capture-recapture
studies indicate that in presence of heterogeneous capture probabilities, a
population size between 300 and 500, and a number of sampling occasions between
10 and 20, the minimum sampling fraction to have reliable estimates is at least
60%. Since the estimation of
is basically the
same problem as that of estimating the population size in a capture-recapture
study, we think that this conclusion also applies to our situation. In this
line, we have developed a method to determine the size of the initial sample in
order to have desired values of
and
Although this
procedure is based on stringent assumptions such as the homogeneity of the
effects
associated with
the sites and the necessity of large values of the
the results seem
to be satisfactory. For instance, the situation illustrated at the end of
Section 5 correspond to that of the artificial populations considered in the
Monte Carlo study and we can see that the results obtained by our procedure are
very close to those reported for Populations I and II (Table 6.2), where the
estimators of
and
performed
acceptably.
Finally, despite the drawbacks of the proposed point
and interval estimators, they are a better alternative for making inferences
about the population size than those based on the assumption of homogeneous
link-probabilities. Obviously, our proposal need to be improved. The two major
problems that need to be considered in future research are the instability of
the estimators of
when the
sampling fraction is not large enough and the not satisfactory performance of
the confidence intervals. A possible solution to both problems is to use the
Bayesian approach to construct estimators that incorporate information prior to
sampling that the researcher has about the parameters. The point and interval
estimators obtained by this approach might be more stable than those proposed
in this paper because of the additional information used to construct them.
Other possible solution to the problem of lack of robustness of the confidence
intervals is to replace the assumption of the Poisson distribution of the
by a more
flexible distribution such as the negative binomial, and the assumption of the
normal distribution of the effects
by one of the
distributions ordinarily used to increase the robustness of the estimators such
as a mixture of normal distributions or a Student’s T distribution.
Acknowledgements
This
research was supported by grant PROFAPI 2008/054 of the Universidad Autónoma de Sinaloa and grant APOY-COMPL-2008/89777 of
the Consejo Nacional de Ciencia y
Tecnología, Mexico. We thank John Potterat and Steve Muth for allowing us
to use the data from the Colorado Springs study. We also thank the reviewers
for their helpful suggestions and comments which improved this work.
References
Agresti, A. (2002). Categorical Data Analysis, Second
edition. New York: John Wiley & Sons, Inc.
Booth, J.G., Butler, R.W.
and Hall, P. (1994). Bootstrap methods for finite populations. Journal of the American Statistical
Association, 89, 1282-1289.
Cormack, R.M. (1992).
Interval estimation for mark-recapture studies of closed populations. Biometrics, 48, 567-576.
Coull, B.A., and Agresti,
A. (1999). The use of mixed logit models to reflect heterogeneity in
capture-recapture studies. Biometrics, 55, 294-301.
Dávid, B., and Snijders,
T.A.B. (2002). Estimating the size of the homeless population in Budapest,
Hungary. Quality & Quantity, 36,
291-303.
Davison, A.C., and
Hinkley, D.V. (1997). Bootstrap Methods
and their Applications. New York: Cambridge University Press.
Evans, M.A., Kim, H.-M.
and O’Brien, T.E. (1996). An application of profile-likelihood based confidence
interval to capture-recapture estimators. Journal
of Agricultural, Biological and Environmental Statistics, 1, 131-140.
Félix-Medina, M.H., and
Monjardin, P.E. (2006). Combining link-tracing sampling and cluster sampling to
estimate the size of hidden populations: A bayesian-assisted approach. Survey Methodology, 32, 2, 187-195.
Félix-Medina, M.H., and
Monjardin, P.E. (2010). Combining link-tracing sampling and cluster sampling to
estimate totals and means of hidden human populations. Journal of Official Statistics, 26, 603-631.
Félix-Medina, M.H., and
Thompson, S.K. (2004). Combining cluster sampling and link-tracing sampling to
estimate the size of hidden populations. Journal
of Official Statistics, 20, 19-38.
Frank, O., and Snijders,
T.A.B. (1994). Estimating the size of hidden populations using snowball
sampling. Journal of Official Statistics, 10, 53-67.
Gimenes, O., Choquet, R.,
Lamor, R., Scofield, P., Fletcher, D., Lebreton, J.-D. and Pradel, R. (2005).
Efficient profile-likelihood confidence intervals for capture-recapture models. Journal of Agricultural, Biological and
Environmental Statistics, 10, 1-13.
Heckathorn, D.D. (2002).
Respondent-driven sampling II: deriving valid population estimates from
chain-referral samples of hidden populations. Social Problems, 49, 11-34.
Johnston, L.G., and
Sabin, K. (2010). Sampling hard-to-reach populations with respondent driven
sampling. Methodological Innovations
Online, 5, 2, 38-48.
Kalton, G. (2009).
Methods for oversampling rare populations in social surveys. Survey Methodology, 35, 2, 125-141.
Karon, J.M., and Wejnert,
C. (2012). Statistical methods for the analysis of time-location sampling data. Journal of Urban Health, 89, 565-586.
MacKellar, D., Valleroy,
L., Karon, J., Lemp, G. and Janssen, R. (1996). The young men’s survey: Methods
for estimating HIV sero-prevalence and risk factors among young men who have
sex with men. Public Health Reports, 111, supplement 1, 138-144.
Magnani, R., Sabin, K.,
Saidel, T. and Heckathorn, D. (2005). Review of sampling hard-to-reach
populations for HIV surveillance. AIDS, 19, S67-S72.
McKenzie, D.J., and
Mistiaen, J. (2009). Surveying migrant households: a comparison of
census-based, snowball and intercept point surveys. Journal of the Royal Statistical Society, Series A, 172, 339-360.
Munhib, F.B., Lin, L.S.,
Stueve, A., Miller, R.L., Ford, W.L., Johnson, W.D. and Smith, P. (2001). A
venue-based method for sampling hard-to-reach populations. Public Health Reports, 116, supplement 1, 216-222.
Pledger, S. (2000).
Unified maximum likelihood estimates for closed capture-recapture models using
mixtures. Biometrics, 56, 434-442.
Potterat, J.J.,
Woodhouse, D.E., Muth, S.Q., Rothenberg, R.B., Darrow, W.W., Klovdahl, A.S. and
Muth, J.B. (2004). Network dynamism: History and lessons of the Colorado
Springs study. In Network Epidemiology: A
Handbook for Survey Design and Data Collection, (Ed., M. Morris), New
York: Oxford University Press, 87-114.
Potterat,
J.J., Woodhouse, D.E., Rothenberg, R.B., Muth, S.Q., Darrow, W.W., Muth, J.B.
and Reynolds, J.U. (1993). AIDS in Colorado Springs: Is there an epidemic? AIDS, 7, 1517-1521.
R Development Core Team
(2013). R: A Language and Environment for
Statistical Computing. Vienna, Austria: R Foundation for Statistical
Computing.
Ratkowsky, D.A. (1988). Handbook of Nonlinear Regression Models.
New York: Marcel Dekker.
Rothenberg, R.B.,
Woodhouse, D.E., Potterat, J.J., Muth, S.Q., Darrow, W.W. and Klovdahl, A.S.
(1995). Social networks in disease transmission: The Colorado Springs study. In Social Networks, Drug Abuse, and HIV
Transmission, (Eds., R.H. Needle, S.G. Genser and R.T. II Trotter)
NIDA Research Monograph 151. Rockville, MD: National Institute of Drug Abuse,
3-19.
Sanathanan, L. (1972).
Estimating the size of a multinomial population. Annals of Mathematical Statistics, 43, 142-152.
Semaan, S. (2010).
Time-space sampling and respondent-driven sampling with hard-to-reach
populations. Methodological Innovations
Online, 5, 2, 60-75.
Spreen, M. (1992). Rare
populations, hidden populations, and link-tracing designs: What and why? Bulletin de Méthodologie Sociologique, 36, 34-58.
Staudte, R.G., and Sheather, S.J. (1990). Robust Estimation and Testing. New York: John Wiley & Sons, Inc.
Thompson, S.K., and
Frank, O. (2000). Model-based estimation with link-tracing sampling designs. Survey Methodology, 26, 1, 87-98.
Venzon, D.J., and
Moolgavkar, S.H. (1988). A method for computing profile-likelihood-based
confidence intervals. Applied Statistics, 37, 87-94.
Williams, B.K., Nichols,
J.D. and Conroy, M.J. (2002). Analysis
and Management of Animal Populations. San Diego, California: Academic
Press.
Xi, L., Watson, R. and
Yip, P.S.F. (2008). The minimum capture proportion for reliable estimation in
capture-recapture models. Biometrics, 64, 242-249.
Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.
Submission of Manuscripts
Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).
Note of appreciation
Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.
Standards of service to the public
Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.
Copyright
Published by authority of the Minister responsible for Statistics Canada.