Combining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities 7. Conclusions and suggestions for future researchCombining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities 7. Conclusions and suggestions for future research

The results of the simulation studies carried out in this research indicate that the two proposed estimators of $τ_{1} : {\hat{τ}}_{1}^{(U)}$ and ${\hat{τ}}_{1}^{(C)},$ perform reasonably well in different situations. This evidences their robustness to several types of deviations from the assumed model. (Although ${\hat{τ}}_{1}^{(C)}$ seems to be sensitive to deviations from the assumed normal distribution of the $β_{j}^{(k)} ’ s .)$ On the other hand, the two proposed estimators of $τ_{2} : {\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)},$ present problems of bias and especially problems of instability if the sampling fraction in $U_{2}$ is not large enough, say it is not larger than 50%. In addition, small sampling fractions along with deviations from the assumed model for the link-probabilities increase the risk of numerical convergence problems. The two proposed estimators of $τ : {\hat{τ}}^{(U)}$ and ${\hat{τ}}^{(C)},$ perform similarly to the estimators ${\hat{τ}}_{1}^{(U)}$ and ${\hat{τ}}_{1}^{(C)}$ if $τ_{1}$ is much greater than $τ_{2}$ (as in the case of the artificial populations), perform similarly to the estimators ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ if $τ_{2}$ is much greater than $τ_{1}$ (as in the case of the Colorado Springs population), and perform as a combination of the performance of the estimators of $τ_{1}$ and $τ_{2}$ if the values of these parameters are not very different from each other. Finally, the estimators derived under the assumption of homogeneous link-probabilities present serious problems of bias if this assumption is not satisfied.

It is worth noting that our conclusion about the proposed estimators of $τ_{2}$ is based on the results of several small simulation studies that we carried out using sampling fractions greater than those used in the Monte Carlo studies reported in this paper. In one study carried out with the artificial populations, we increased the values of the link-probabilities $p_{i j}^{(k)}$ so that their average values were ${\bar{p}}_{i j}^{(1)} \approx 0.088$ and ${\bar{p}}_{i j}^{(2)} \approx 0.071$ and kept the sizes of the initial samples at $n = 15.$ These changes yielded the sampling fractions $f_{1} \approx 0.65$ and $f_{2} \approx 0.55.$ In another study also with the artificial populations, we reduced the values of the $p_{i j}^{(k)}$ so that ${\bar{p}}_{i j}^{(1)} = {\bar{p}}_{i j}^{(2)} \approx 0.016$ and increased the sizes of the initial samples to $n = 78$ in Populations $I - III$ and to $n = 67$ in Population IV. These changes yielded the sampling fractions $f_{1} \approx 0.78$ and $f_{2} \approx 0.55.$ In both studies the estimators ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ performed acceptably well. (The results are not shown.) These outcomes indicate that these estimators also seem to have properties of robustness to deviations from the assumed models provided that large sampling fractions be used.

However, in a study with the Colorado Springs population using initial samples of sizes $n = 42$ which yielded sampling fractions $f_{1} \approx 0.64$ and $f_{2} \approx 0.56,$ the estimators ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ presented serious problems of bias $(r - bias \approx 1.0$ and $mdre \approx 0.85)$ which affected the values of their $r - mse$ $(\sqrt{r - mse} \approx 1.0)$ and mdare $(mdare \approx 0.85) .$ Why these estimators did not perform well even with large sampling fractions? We think that the bad performance of these estimators is consequence of the very small average value of the $p_{i j}^{(2)} ’ s ({\bar{p}}_{i j}^{(2)} \approx 0.018)$ and the way the Monte Carlo studies were carried out. To clarify this statement, note the following. When ${\bar{p}}_{i j}^{(2)}$ is very small, say less than 0.02, the expected number of elements in $U_{2}$ that are linked to at least one site in the frame is much less than $τ_{2} .$ For instance, in Population I when ${\bar{p}}_{i j}^{(2)} = 0.015,$ this expected number was about $300 < 400 = τ_{2} .$ Therefore, if the sampling fraction $f_{2}$ is large enough, the estimates ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ will be close to $τ_{2}$ and will be much greater than the expected number of elements linked to at least one site in the frame. Thus, if we supposed that the Colorado Springs population was generated by a random process, then the 794 contacts linked to at least one site in the frame, and which we used as the value of $τ_{2},$ would be a much smaller value than the actual size of $U_{2} .$ Consequently, the performance of ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ as estimators of the assumed value 794 of $τ_{2}$ would be very bad. This explanation was suggested and confirmed by the results of a small simulation study in which we considered Population I (in which every one of the assumptions is satisfied), but instead of carrying the study as is described in Subsection 5.1, we generated the complete set of values $x_{i j}^{(2)}$ of $X_{i j}^{(2)}$ by sampling from the Bernoulli distributions with means $p_{i j}^{(2)}, i = 1, \dots, N; j = 1, \dots,400$ and we kept them fixed. Then, we defined the value of $τ_{2}$ as the number of elements of $U_{2}$ linked to at least one site in the frame. We considered two cases: large values of $p_{i j}^{(2)} ({\bar{p}}_{i j}^{(2)} = 0.071)$ and small values $({\bar{p}}_{i j}^{(2)} = 0.015) .$ In the first case $τ_{2} = 388,$ whereas in the second one $τ_{2} = 300.$ To have comparable results the sizes of the initial samples were set to $n = 15$ in the first case and to $n = 78$ in the second case, so that in both cases the number of sampled elements were about 220. The results of the numerical study showed that in the case of large values of $p_{i j}^{(2)}$ the estimators ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ performed well (because $388 \approx 400),$ whereas in the case of small values of $p_{i j}^{(2)}$ these estimators performed badly (because $300 << 400) .$ We think that the results obtained in the last case are illustrative and explain the ones obtained in the Colorado Springs population.

With respect to the two types of proposed CIs: profile likelihood and bootstrap CIs, we can conclude that they need larger sample sizes than the point estimators to perform reasonably well. They are more sensitive to deviations from the assumed models than the point estimators. In addition, if small sampling fractions are used and deviations from the assumed model for the $p_{i j}^{(k)}$ are present, the occurrence of numerical convergence problems will be greater than in the case of point estimators.

From the previous observations we can conclude that in actual applications of this sampling methodology, a good strategy is to construct a sampling frame that covers the largest possible portion of the target population. This way, $τ_{1}$ would be close to $τ,$ and the estimates of $τ_{1}$ could be used as estimates of $τ .$ The advantage of this strategy is that the estimators ${\hat{τ}}_{1}^{(U)}$ and ${\hat{τ}}_{1}^{(C)}$ perform better than the estimators ${\hat{τ}}_{2}^{(U)}$ and ${\hat{τ}}_{2}^{(C)}$ because the first ones incorporate the information about the cluster sizes $m_{i} .$ Furthermore, this strategy makes possible to use the design-based estimator $N \sum_{1}^{n} m_{i} / n$ as an estimator of $τ .$ The other factor that must be taken into account to have good estimates is to use large sampling fractions, say larger than 0.5. This suggestion is in agreement with the result reported by Xi, Watson and Yip (2008), who in the context of capture-recapture studies indicate that in presence of heterogeneous capture probabilities, a population size between 300 and 500, and a number of sampling occasions between 10 and 20, the minimum sampling fraction to have reliable estimates is at least 60%. Since the estimation of $τ_{2}$ is basically the same problem as that of estimating the population size in a capture-recapture study, we think that this conclusion also applies to our situation. In this line, we have developed a method to determine the size of the initial sample in order to have desired values of $\sqrt{V ({\hat{τ}}_{k}) / τ_{k}^{2}}, k = 1, 2$ and $\sqrt{V (\hat{τ}) / τ^{2}} .$ Although this procedure is based on stringent assumptions such as the homogeneity of the effects $α_{i}^{(k)} ’ s$ associated with the sites and the necessity of large values of the $m_{i} ’ s,$ the results seem to be satisfactory. For instance, the situation illustrated at the end of Section 5 correspond to that of the artificial populations considered in the Monte Carlo study and we can see that the results obtained by our procedure are very close to those reported for Populations I and II (Table 6.2), where the estimators of $τ_{2}$ and $τ$ performed acceptably.

Finally, despite the drawbacks of the proposed point and interval estimators, they are a better alternative for making inferences about the population size than those based on the assumption of homogeneous link-probabilities. Obviously, our proposal need to be improved. The two major problems that need to be considered in future research are the instability of the estimators of $τ_{2}$ when the sampling fraction is not large enough and the not satisfactory performance of the confidence intervals. A possible solution to both problems is to use the Bayesian approach to construct estimators that incorporate information prior to sampling that the researcher has about the parameters. The point and interval estimators obtained by this approach might be more stable than those proposed in this paper because of the additional information used to construct them. Other possible solution to the problem of lack of robustness of the confidence intervals is to replace the assumption of the Poisson distribution of the $M_{i} ’ s$ by a more flexible distribution such as the negative binomial, and the assumption of the normal distribution of the effects $β_{j}^{(k)}$ by one of the distributions ordinarily used to increase the robustness of the estimators such as a mixture of normal distributions or a Student’s T distribution.

Acknowledgements

This research was supported by grant PROFAPI 2008/054 of the Universidad Autónoma de Sinaloa and grant APOY-COMPL-2008/89777 of the Consejo Nacional de Ciencia y Tecnología, Mexico. We thank John Potterat and Steve Muth for allowing us to use the data from the Colorado Springs study. We also thank the reviewers for their helpful suggestions and comments which improved this work.

References

Agresti, A. (2002). Categorical Data Analysis, Second edition. New York: John Wiley & Sons, Inc.

Booth, J.G., Butler, R.W. and Hall, P. (1994). Bootstrap methods for finite populations. Journal of the American Statistical Association, 89, 1282-1289.

Cormack, R.M. (1992). Interval estimation for mark-recapture studies of closed populations. Biometrics, 48, 567-576.

Coull, B.A., and Agresti, A. (1999). The use of mixed logit models to reflect heterogeneity in capture-recapture studies. Biometrics, 55, 294-301.

Dávid, B., and Snijders, T.A.B. (2002). Estimating the size of the homeless population in Budapest, Hungary. Quality & Quantity, 36, 291-303.

Davison, A.C., and Hinkley, D.V. (1997). Bootstrap Methods and their Applications. New York: Cambridge University Press.

Evans, M.A., Kim, H.-M. and O’Brien, T.E. (1996). An application of profile-likelihood based confidence interval to capture-recapture estimators. Journal of Agricultural, Biological and Environmental Statistics, 1, 131-140.

Félix-Medina, M.H., and Monjardin, P.E. (2006). Combining link-tracing sampling and cluster sampling to estimate the size of hidden populations: A bayesian-assisted approach. Survey Methodology, 32, 2, 187-195.

Félix-Medina, M.H., and Monjardin, P.E. (2010). Combining link-tracing sampling and cluster sampling to estimate totals and means of hidden human populations. Journal of Official Statistics, 26, 603-631.

Félix-Medina, M.H., and Thompson, S.K. (2004). Combining cluster sampling and link-tracing sampling to estimate the size of hidden populations. Journal of Official Statistics, 20, 19-38.

Frank, O., and Snijders, T.A.B. (1994). Estimating the size of hidden populations using snowball sampling. Journal of Official Statistics, 10, 53-67.

Gimenes, O., Choquet, R., Lamor, R., Scofield, P., Fletcher, D., Lebreton, J.-D. and Pradel, R. (2005). Efficient profile-likelihood confidence intervals for capture-recapture models. Journal of Agricultural, Biological and Environmental Statistics, 10, 1-13.

Heckathorn, D.D. (2002). Respondent-driven sampling II: deriving valid population estimates from chain-referral samples of hidden populations. Social Problems, 49, 11-34.

Johnston, L.G., and Sabin, K. (2010). Sampling hard-to-reach populations with respondent driven sampling. Methodological Innovations Online, 5, 2, 38-48.

Kalton, G. (2009). Methods for oversampling rare populations in social surveys. Survey Methodology, 35, 2, 125-141.

Karon, J.M., and Wejnert, C. (2012). Statistical methods for the analysis of time-location sampling data. Journal of Urban Health, 89, 565-586.

MacKellar, D., Valleroy, L., Karon, J., Lemp, G. and Janssen, R. (1996). The young men’s survey: Methods for estimating HIV sero-prevalence and risk factors among young men who have sex with men. Public Health Reports, 111, supplement 1, 138-144.

Magnani, R., Sabin, K., Saidel, T. and Heckathorn, D. (2005). Review of sampling hard-to-reach populations for HIV surveillance. AIDS, 19, S67-S72.

McKenzie, D.J., and Mistiaen, J. (2009). Surveying migrant households: a comparison of census-based, snowball and intercept point surveys. Journal of the Royal Statistical Society, Series A, 172, 339-360.

Munhib, F.B., Lin, L.S., Stueve, A., Miller, R.L., Ford, W.L., Johnson, W.D. and Smith, P. (2001). A venue-based method for sampling hard-to-reach populations. Public Health Reports, 116, supplement 1, 216-222.

Pledger, S. (2000). Unified maximum likelihood estimates for closed capture-recapture models using mixtures. Biometrics, 56, 434-442.

Potterat, J.J., Woodhouse, D.E., Muth, S.Q., Rothenberg, R.B., Darrow, W.W., Klovdahl, A.S. and Muth, J.B. (2004). Network dynamism: History and lessons of the Colorado Springs study. In Network Epidemiology: A Handbook for Survey Design and Data Collection, (Ed., M. Morris), New York: Oxford University Press, 87-114.

Potterat, J.J., Woodhouse, D.E., Rothenberg, R.B., Muth, S.Q., Darrow, W.W., Muth, J.B. and Reynolds, J.U. (1993). AIDS in Colorado Springs: Is there an epidemic? AIDS, 7, 1517-1521.

R Development Core Team (2013). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.

Ratkowsky, D.A. (1988). Handbook of Nonlinear Regression Models. New York: Marcel Dekker.

Rothenberg, R.B., Woodhouse, D.E., Potterat, J.J., Muth, S.Q., Darrow, W.W. and Klovdahl, A.S. (1995). Social networks in disease transmission: The Colorado Springs study. In Social Networks, Drug Abuse, and HIV Transmission, (Eds., R.H. Needle, S.G. Genser and R.T. II Trotter) NIDA Research Monograph 151. Rockville, MD: National Institute of Drug Abuse, 3-19.

Sanathanan, L. (1972). Estimating the size of a multinomial population. Annals of Mathematical Statistics, 43, 142-152.

Semaan, S. (2010). Time-space sampling and respondent-driven sampling with hard-to-reach populations. Methodological Innovations Online, 5, 2, 60-75.

Spreen, M. (1992). Rare populations, hidden populations, and link-tracing designs: What and why? Bulletin de Méthodologie Sociologique, 36, 34-58.

Staudte, R.G., and Sheather, S.J. (1990). Robust Estimation and Testing. New York: John Wiley & Sons, Inc.

Thompson, S.K., and Frank, O. (2000). Model-based estimation with link-tracing sampling designs. Survey Methodology, 26, 1, 87-98.

Venzon, D.J., and Moolgavkar, S.H. (1988). A method for computing profile-likelihood-based confidence intervals. Applied Statistics, 37, 87-94.

Williams, B.K., Nichols, J.D. and Conroy, M.J. (2002). Analysis and Management of Animal Populations. San Diego, California: Academic Press.

Xi, L., Watson, R. and Yip, P.S.F. (2008). The minimum capture proportion for reliable estimation in capture-recapture models. Biometrics, 64, 242-249.

Editorial policy

Survey Methodology publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves. All papers will be refereed. However, the authors retain full responsibility for the contents of their papers and opinions expressed are not necessarily those of the Editorial Board or of Statistics Canada.

Submission of Manuscripts

Survey Methodology is published twice a year in electronic format. Authors are invited to submit their articles in English or French in electronic form, preferably in Word to the Editor, (statcan.smj-rte.statcan@canada.ca, Statistics Canada, 150 Tunney’s Pasture Driveway, Ottawa, Ontario, Canada, K1A 0T6). For formatting instructions, please see the guidelines provided in the journal and on the web site (www.statcan.gc.ca/SurveyMethodology).

Note of appreciation

Canada owes the success of its statistical system to a long-standing partnership between Statistics Canada, the citizens of Canada, its businesses, governments and other institutions. Accurate and timely statistical information could not be produced without their continued co-operation and goodwill.

Standards of service to the public

Statistics Canada is committed to serving its clients in a prompt, reliable and courteous manner. To this end, the Agency has developed standards of service which its employees observe in serving its clients.

Copyright

Published by authority of the Minister responsible for Statistics Canada.

Catalogue no. 12-001-X

Frequency: semi-annual

Ottawa

Date modified:: 2017-09-20

Language selection

Search and menus

Search

Acknowledgements

References