Combining link-tracing sampling and cluster sampling to estimate the size of a hidden population in presence of heterogeneous link-probabilities 1. Introduction

Conventional sampling methods are not appropriate for sampling hidden or hard-to-reach human populations, such as drug users, sexual-workers and homeless people, because of the lack of suitable sampling frames. For this reason, several specific sampling methods for this type of population have been proposed. See Magnani, Sabin, Saidel and Heckathorn (2005) and Kalton (2009) for reviews of some of them. According to Heckathorn (2002) two types of sampling methods for hidden populations are the most commonly used in actual studies. One is location sampling, also known as time-and-space sampling, aggregation point sampling or intercept point sampling. The other is snowball sampling, also known as link-tracing sampling (LTS) or chain referral sampling.

In location sampling a frame of primary units is constructed. The primary units are combinations of places and time segments where the elements of the population tend to gather. The frame is not assumed to cover the whole population. A probability sample of primary units is selected and from each sampled unit a sort of systematic sample of elements is drawn. Although design-based estimators of different parameters can be constructed, the main drawback of location sampling is that inferences are valid only for the part of the population covered by the frame. For reviews of this method see MacKellar, Valleroy, Karon, Lemp and Janssen (1996), Munhib, Lin, Stueve, Miller, Ford, Johnson and Smith (2001), McKenzie and Mistianen (2009), Semaan (2010) and Karon and Wejnert (2012). Location sampling has been used in the Young Men’s Survey to estimate HIV seroprevalence in young men who have sex with men. (See McKellar et al. 1996.) In this study the primary units were venues attended by young men such as dance clubs, bars and street locations.

In LTS an initial sample of members of the population is selected and the sample size is increased by asking the sampled people to name or to refer other members of the population to be included in the sample. The named people who are not in the initial sample might be asked to refer other persons, and the process might continue in this way until a specified stopping rule is satisfied. For reviews of several variants of LTS see Spreen (1992), Thompson and Frank (2000) and Johnston and Sabin (2010). LTS was used in the Colorado Springs study on heterosexual transmission of HIV/AIDS. (See Potterat, Woodhouse, Rothenberg, Muth, Darrow, Muth and Reynolds 1993; Rothemberg, Woodhouse, Potterat, Muth, Darrow and Klovdahl 1995 and Potterat, Woodhouse, Muth, Rothenberg, Darrow, Klovdahl and Muth 2004.) In this research an initial non probabilistic sample of people presumably at high risk of acquiring and transmitting HIV was obtained and they were asked for a complete enumeration of their personal contacts who were also included in the sample.

Frank and Snijders (1994) proposed a variant of LTS that allows the sampler to estimate the population size. In their variant they assume an initial Bernoulli sample, that is, that every element of the population has the same probability of being included in the sample and that the inclusions are independent. In addition, they assume that the probability that person i MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9LqFf0x e9q8qqvqFr0dXdbrVc=b0P0xb9peuD0xXddrpe0=1qpeea0=yrVue9 Fve9Fve8meaabaqaciaacaGaaeqabaWaaeaaeaaakeaacaWGPbaaaa@3953@ in the initial sample refers person j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr 4rNCHbGeaGqiFu0Je9sqqrpepC0xbbL8F4rqqrpipeea0xe9LqFf0x e9q8qqvqFr0dXdbrVc=b0P0xb9peuD0xXddrpe0=1qpeea0=yrVue9 Fve9Fve8meaabaqaciaacaGaaeqabaWaaeaaeaaakeaacaWGQbaaaa@3954@ in the population, which we will call link-probability, is a constant, and that the referrals are independent. We will name the first of the additional premises the assumption of homogeneity of the link-probabilities. Based on these hypotheses these authors derive several estimators of the population size. They indicate that their method yielded reasonable estimates of the number of heroin users in Groningen. However, Dávid and Snijders (2002) reported an underestimate of the number of homeless in Budapest using this method. They indicate that the underestimation might be caused by deviations from the assumption of an initial Bernoulli sample.

The problem of satisfying in actual applications of LTS the assumption of an initial Bernoulli sample of members of the population motivated Félix-Medina and Thompson (2004) to develop a variant of LTS in which the initial sample is selected by a probabilistic design. To do this they assume, as in location sampling, that the sampler can construct a sampling frame of sites or venues where the members of the population tend to gather, such as bars, parks and blocks. The frame is not assumed to cover the whole population, but only a portion of it. Then, a simple random sample without replacement (SRSWOR) of sites is selected and the members of the population who belong to the sampled sites are identified. Finally, as in ordinary LTS, the people in the initial sample are asked to name other members of the population.

These authors propose maximum likelihood estimators (MLEs) of the population size derived under a probability model that describes the numbers of people found in the sampled sites and a model that regards that the link-probabilities between the elements of the population and the sampled sites are homogeneous, that is, that they depend on the sampled sites, but not on the potentially named people. Later, Félix-Medina and Monjardin (2006) consider this same variant of LTS and propose estimators of the population size derived also under the assumption of homogeneity, but using a Bayesian-assisted approach, that is, the functional forms of the estimators are obtained using the Bayesian approach, but inferences are made under the frequentist approach.

Although the variant of LTS proposed by Félix-Medina and Thompson (2004) has not been used in any actual study, we would expect that estimators of the population size derived under the assumption of homogeneity will present problems of underestimation if this hypothesis is not satisfied as occurs in capture-recapture studies. We think this because these estimators resemble those used in that field.

In this paper, we extend the work by Félix-Medina and Thompson (2004) to the case in which the link-probabilities depend on the named people, that is, we assume heterogeneous link-probabilities. The structure of the paper is as follows. In Section 2 we introduce the LTS variant proposed by Félix-Medina and Thompson (2004). In Section 3 we present a model for the link-probabilities that takes into account their heterogeneity and derive unconditional and conditional MLEs of the population size. In Section 4 we construct profile likelihood and bootstrap confidence intervals for the population size. In Section 5 we present a procedure for determining the size of the initial sample in order to achieve a specified value of the relative error of the estimation. In Section 6 we describe the results of two simulation studies, and finally, in Section 7 we present some conclusions and suggestions for future research.

Date modified: