{
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"{!q.op=AND}id:\"704164\"",
"fq":"!embargo_i:1",
"wt":"json"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"thumb_s":"/9b/7a/9b7ac6b215bf5b75f1ace256710650ff0f6060bd.jpg",
"description_t":"Population geneticists work with a nonrandom sample of the human genome. Conventional practice ensures that unusually variable loci are most likely to be discovered and thus included in the sample of loci. Consequently, estimates of average heterozygosity are biased upward. In what follows we describe a model of this bias. When the mutation rate varies among loci, bias is increased. This effect is only moderate, however, so that a model of invariant mutation rates provides a reasonable approximation. Bias is pronounced when estimated heterozygosity is < approximately 35% Consequently, it probably affects estimates from classical polymorphisms as well as from restriction-site polymorphisms. Estimates from short-tandem-repeat polymorphisms have negligible bias, because of their high heterozygosity. Bias should vary not only among categories of polymorphism but also among populations. It should be largest in European populations, since these are the populations in which most polymorphisms were discovered. As this argument predicts, European estimates exceed those of Africa and Asia at systems with large bias. The magnitude of this European excess is consistent with the version of our model in which mutation rates vary across loci.",
"metadata_cataloger_t":"nc; mfb",
"restricted_i":0,
"rights_management_t":"(c) University of Chicago Press",
"ark_t":"ark:/87278/s67h22w3",
"identifier_t":"ir-main,1849",
"creator_t":"Rogers, Alan R.; Jorde, Lynn B.",
"subject_mesh_t":"Models, Genetic; Genetics, Population",
"parent_i":0,
"format_medium_t":"application/pdf",
"first_page_t":"1033",
"unid_t":"28949",
"publisher_t":"University of Chicago Press",
"file_s":"/bf/7e/bf7e2e48175cea9c4bb94decbad09fcaf9cd468e.pdf",
"date_digital_t":"2007-10-31",
"date_t":"1996-05",
"type_t":"Text",
"created_tdt":"2012-06-13T00:00:00Z",
"publication_type_t":"Journal Article",
"subject_t":"Bias (Epidemiology); Biometry; Heterozygote",
"format_extent_t":"1,286,118 bytes",
"mass_i":1515011812,
"title_t":"Ascertainment bias in estimates of average heterozygosity",
"setname_s":"ir_uspace",
"department_t":"Anthropology; Human Genetics",
"bibliographic_citation_t":"Rogers, A. R., & Jorde, L. B. (1996). Ascertainment bias in estimates of average heterozygosity. American Journal of Human Genetics, 58, 1033-41.",
"language_t":"eng",
"id":704164,
"oldid_t":"uspace 1979",
"format_t":"application/pdf",
"date_modified_t":"2008-11-25",
"modified_tdt":"2012-06-13T00:00:00Z",
"last_page_t":"1041",
"school_or_college_t":"School of Medicine; College of Social & Behavioral Science",
"_version_":1694699588763516928,
"ocr_t":"Am. ]. Hum. Genet. 5 8 :1 0 3 3 -1 0 4 1 , 1996 Ascertainment Bias in Estimates of Average Heterozygosity Alan R. Rogers1 and Lynn B. Jorde2 Departments of 'Anthropology and 2Human Genetics, University of Utah, Salt Lake City Summary Population geneticists work with a nonrandom sample of the human genome. Conventional practice ensures that unusually variable loci are most likely to be discovered and thus included in the sample of loci. Consequently, estimates of average heterozygosity are biased upward. In what follows we describe a model of this bias. When the mutation rate varies among loci, bias is increased. This effect is only moderate, however, so that a model of invariant mutation rates provides a reasonable approximation. Bias is pronounced when estimated heterozygosity is <~35%. Consequently, it probably affects estimates from classical polymorphisms as well as from restriction- site polymorphisms. Estimates from short-tandem-repeat polymorphisms have negligible bias, because of their high heterozygosity. Bias should vary not only among categories of polymorphism but also among populations. It should be largest in European populations, since these are the populations in which most polymorphisms were discovered. As this argument predicts, European estimates exceed those of Africa and Asia at systems with large bias. The magnitude of this European excess is consistent with the version of our model in which mutation rates vary across loci. The Problem Students of human population genetics are seldom lucky enough to work with loci drawn at random from the genome. More often, we work with loci chosen for their variability. Our sample of loci is therefore unusually variable, and estimates of heterozygosity are biased upward. This bias interferes with inference in various ways. It confounds comparisons of human heterozygosity with that of other species; it also confounds comparisons among human populations. Biased estimates of average heterozygosity also generate biased estimates of effective population size. Received March 10, 1995; accepted for publication January 29, 1996. Address for correspondence and reprints: Dr. Alan R. Rogers, Department of Anthropology, University of Utah, 102 Stewart Hall, Salt Lake City, UT 84112. E-mail: rogers@anthro.utah.edu © 1996 by The American Society of Human Genetics. All rights reserved. 0002-9297/96/5805-0016$02.00 Several mechanisms have introduced bias into the sample of human polymorphisms. Early work relied on blood groups, which are recognized by antigen-antibody reactions. Since reactions occur only between individuals who carry different alleles, polymorphic loci are most likely to be discovered and therefore included in the sample of loci. This inclusion introduces an ascertainment bias, which inflates estimates of heterozygosity. Lewontin (1967) pointed out that this bias would have been largest in the earliest studies, since they compared only limited numbers of individuals: \"Rare variants will be seen only as the number of bloods examined becomes larger and larger, so that at any particular time the sample of loci is biased toward polymorphic loci; but this bias will grow smaller as the number of bloods examined grows larger. Eventually, when all antigen-specifying loci are known, the bias would disappear\" (Lewontin 1967, p. 681). Lewontin used this argument to interpret the data in figure 1. There, \"cumulative heterozygosity\" in year x refers to the average heterozygosity over loci that had been discovered by year x. Cumulative heterozygosity declines with time, as Lewontin observed. Although the graph is nearly flat just before 1962, subsequent years saw a continued decline (Nei and Roychoudhury 1974, 1982). Ascertainment bias is also a problem in table 1, which uses various categories of data to compare heterozygosity estimates from different human populations. The columns are arranged from left to right in order of increasing European heterozygosity. Since these loci were nearly all ascertained using European subjects, the European estimates should have the largest bias (Bowcock et al. 1991; Cavalli-Sforza et al. 1994, pp. 141-42). That may explain the high European values in columns a and c-e. It is intriguing that the heterozygosity estimate for protein systems (column b) is also highest among Europeans, since many of these systems were ascertained not on the basis of variability but rather in an effort to estimate the overall heterozygosity level in humans (Harris and Hopkinson 1972). Indeed, the data set includes 18 monomorphic loci. However, Nei and Roychoudhury suggest that \"it is also possible that monomorphic loci have not been reported as often as have polymorphic loci in recent research, the reason being that many investigators are primarily interested in polymor- 1033 1034 Am. J. Hum. Genet. 5 8 :1 0 3 3 -1 0 4 1 , 1996 heterozygosity Figure 1 ontin 1967). Year Cumulative heterozygosity as a function of time (Lew- phism\" (Nei and Roychoudhury 1982, p. 8). If so, then ascertainment bias may account for the elevated European value in column b as well as those in columns a and c-e. The European excess disappears in columns f-h. To understand why, one must consider two opposing effects. First, there is sample size. Lewontin's argument implies that by the 1950s ascertainment of classical polymorphisms had come to involve large samples. Modern molecular polymorphisms, on the other hand, are ascertained using small samples. This distinction reflects their primary function-mapping disease genes. Since highly polymorphic loci are most useful in gene mapping, these polymorphisms are ascertained using a small number of subjects-usually no more than eight. Loci are ascertained as polymorphic only if there is some polymorphism in these small samples (Mountain and Cavalli- Sforza 1994). These considerations suggest that ascertainment samples were larger for classical polymorphisms (columns a-c of table 1) than for molecular polymorphisms (columns d-h). Since bias is greatest when ascertainment samples are small, we might expect the greatest bias in molecular polymorphisms-columns d- h in the table-and predict a pattern unlike that in the table. If bias were most pronounced in molecular polymorphisms, the European estimates should be large in columns d-h rather than in columns a-e. In addition to this sample-size effect, there is also an effect of heterozygosity. Bias results when loci with low heterozygosity are excluded. But short-tandem-repeat (STR) loci are so extremely variable that few loci may be excluded. If so, ascertainment bias should be weak in STR loci. This argument is consistent with the pattern in table 1. It suggests that the high European heterozygosity seen in columns a-e reflects ascertainment bias, which is important in those columns because of their relatively low heterozygosity. Mountain and Cavalli- Sforza (1994) used computer simulation to show that this idea is plausible. Yet several questions remain. First, it is not yet clear that the effect of heterozygosity on bias outweighs the effect of sample size-Mountain and Cavalli-Sforza did not consider the two effects separately. Neither is it clear that the crossover from high European values to high African values occurs at the right level of heterozygosity. After all, the RFLP and RSP loci (RFLPs consisting solely of restriction-site polymorphisms) in table 1 are much more heterozygous than the classical polymorphisms in columns a-c. Perhaps the heterozygosity effect would predict a crossover between columns c and d rather than between columns e and f. To answer such questions, we need a model relating heterozygosity to bias and to the size of samples used in ascertainment. In what follows, Table 1 Average Heterozygosity Population Blood Group3 Protein*5 Classical0 RFLP*1 RSP* STR-4f STR-28 STR-3h Africa .164 .179 .163 .297 .322 .769 .807 .850 Asia .145 .164 .189 .327 .377 .681 .685 .820 Europe .179 .186 .202 .379 .432 .724 .730 .807 Note.-Largest entry in each column is underlined. Columns are in order of increasing European heterozygosity. a 32 blood groups (Nei et al. 1993). b 80 protein polymorphisms (Nei et al. 1993). c 110 classical polymorphisms (Bowcock et al. 1994). d 79 RFLPs (Bowcock et al. 1994). e 30 RFLPs consisting solely of restriction site polymorphisms (Jorde et al. 1995^). f 30 tetranucleotide STRs (Jorde et al. 1995a). g 30 dinucleotide STRs. The difference between Africa and Europe is significant (Bowcock et al. 1994). h 5 trinucleotide STRs (Watkins et al. 1995).Rogers and Jorde: Bias in Average Heterozygosity 1035 we describe such a model and apply it to the data of figure 1 and table 1. In building such a model, one must assume something about the statistical distribution from which mutation rates are drawn. Our model will assume that selective neutrality and stationary population size have prevailed long enough for the population to reach a mutation- drift equilibrium at each locus. Model We imagine that research proceeds in two stages. In stage I, the ascertainment stage, a small number of subjects are typed at a large number of loci to determine which loci are polymorphic. In stage II, a large number of subjects are typed at the polymorphic loci to estimate heterozygosity. Bias arises if the loci studied in stage II are more heterozygous than randomly chosen loci would have been. We refer to the sample of stage I as the \"ascertainment sample.\" In stage II, we calculate only the expected value of the estimate of heterozygosity. This step makes it unnecessary to deal explicitly with the sample size in stage II. In stage I, we assume that loci are ascertained as polymorphic by typing a sample of z statistically independent individuals (or 2z independent genes). If the 2z genes are identical, then the locus is deemed to be monomor- phic and is discarded. Otherwise, the locus is ascertained as polymorphic. We denote by the event that a given locus was ascertained as polymorphic by this method. Our assumption accepts a locus as polymorphic if even a single variant gene is found in the sample. Procedures that require more variants than this will induce a larger bias. Thus, our assumption provides a lower bound on the bias for samples of a given size. In addition to providing a lower bound, our assumption is also a fair description of recent practice. It provides only a crude approximation, however, to the procedures by which older polymorphisms were ascertained. In those cases it provides only a lower bound on the bias. We assume that each locus has K alleles and denote the vector of allele frequencies by x = (xu x2, ..., xK). We also assume that the mutational process is symmetric, so that each allele is equally likely to mutate to each of the K - 1 other alleles. These assumptions imply that the probability density p of x is symmetric-the density of x is equal to that of every permutation of x. This symmetry applies not only to p, but also to the conditional density p i of x, given - A. Because of this symmetry, the conditional heterozygosity given A can be written as where E denotes the expectation operator. The first section of the appendix shows that the expectation in this equation equals , E[xf] - £[*?\"] -IK------------ 1 - KEl^l------------' (2) To proceed further, it is necessary to specify the probability distribution of x, and we rely for this purpose on the assumption of mutation-drift equilibrium. This assumption implies that x has a Dirichlet distribution with density (Ewens 1979, Eq. [5.108]) where T is the Gamma function (Abramowitz and Stegun 1964), a s Q/(K - 1) , 0 s 4Nu . Here, u is the mutation rate and N the effective population size. Conventionally, population geneticists have treated u as a constant. We employ this assumption below in model A and then relax it in developing model B. Model A: Fixed u We assume for the moment that all loci have the same mutation rate, u, and consequently have the same values of 0 = 4Nu and of a = Q/(K - 1). This assumption implies that each of the K marginal distributions are Beta distributions with parameters a and (K - l)a and with mean 7*. When a is small, most alleles have frequencies near 0 or 1, and heterozygosity is low. When a is large, most allele frequencies are near % and heterozygosity is high, approaching 1 - 7* as a -► °°. Substituting equations (11) and (13) (from the appendix) into equations (2) and (1) leads to the conditional heterozygosity, h ,= 1 - K r(q + 2) Y(a + 2z + 2) T(a + 2)r(q + 2z) T(Ka + 2) T(Ka + 2z + 2) r(a)T(Ka + 2z + 2) X r«x) r(q + 2z) • r(Ktx) r(Ka + 2 z) h.,m 1 I.V'; i- (1) (4)1036 Am. J. Hum. Genet. 5 8 :1 0 3 3 -1 0 4 1 , 1996 z = 6 2 = 500 Figure 2 Biased-against-unbiased heterozygosity under model A. The left and right panels show the bias when ascertainment samples are (z = 6) and (z = 500), respectively. Meanwhile, unconditional heterozygosity is h m 1 - KE[x\\] = 1 - (a + l)/(Ka + 1), (5) as shown in the last section of the appendix (Ewens 1979, eq. [5.118]). In the limit as K -► oo these become -±- - ep(2z + 2,0) - P(2z, 0 + 2) h,= 1-----------, 9.+ 1------------ (6) 1 - 0p(2z, 0) h=- e + i (7) Here, p(1%. This procedure, together with the assumptions that N = 10,000 and u = 10-7 (Mountain 1994, pp. 117-19) led in their simulations to a biased European heterozygosity of .379 ± .015 (Mountain and Cavalli-Sforza 1994, p. 6517). Our procedure, on the other hand, looks at a small \"ascertainment sample\" and accepts loci if at least two alleles are found within this sample. To compare these two procedures, we used their simulation parameters (see above) to set 0 and then used our model to calculate biased heterozygosity, h ^ under various assumptions about the number of alleles and the size of the ascertainment sample. In no case was our h t as large as their estimate. The maximal value under our model is obtained with an ascertainment sample of one individual under the infinite-alleles model: h = .3349. This value is not far below the lower bound, .3496, of the confidence interval surrounding their estimate. Thus, there is no strong evidence that the two procedures produce different biases. There is a weak indication, however, that their procedure introduces a greater bias, which would have reduced their chances of rejecting the hypothesis of ascertainment bias. Since they did reject this hypothesis, the difference between our results must reflect assumptions 1 and/or 2. We turn finally to the heterozygosity estimates from STR loci (see table 1). These loci differ from all others in suggesting that heterozygosity is highest in Africa rather than Europe. Heterozygosity is extremely high in these data (>70%) and figure 2 shows that this eliminates nearly all ascertainment bias. These results may still be artifacts of sampling error, for the high African value is significant in only one of the three columns, and that one significant result may be spurious. (It treats linked loci as statistically independent [Bowcock et al. 1994].) But if our model is even approximately correct, then the STR loci are probably not affected much by ascertainment bias. It is interesting that STR-3 loci yield estimates so similar to the other STRs, since each of the STR-3 loci can cause genetic disease (Jorde et al. 1995b). These loci also imply a pattern of population relationships that is consistent with that implied by other sets of loci (Watkins et al. 1995). Thus, although selection has certainly affected these loci, it has produced no obvious distortion in genetic distances or in average heterozygosity. The high African values at STR loci cast doubt on the suggestion (Mountain and Cavalli-Sforza 1994) that European heterozygosity is elevated in RFLP loci because the European population is admixed. Admixture should elevate heterozygosity at STR loci, too, yet the data show no evidence of this. It seems likely that much of the European excess results from the ascertainment of polymorphisms in European populations. On the other hand, other factors may also be at work: 1) We have not studied the effect of variation in mutation rates on correlations between the bias observed in different groups. When mutation rates vary among loci, the loci that are ascertained as polymorphic will tend to have high mutation rates, inflating heterozygosity estimates in all groups. When h.A is inflated in Europe, it will tend also to be inflated in Africa and Asia. When we account for this effect, it may turn out that ascertainment bias cannot account for the observed group differences. 2) Our analysis is conservative in using the European bias to place an upper bound on the difference between African and European biases. If we could calculate this difference directly, as Mountain and Cavalli-Sforza do (1994), we might reject the hypothesis of ascertainment bias. Conclusions The sample of human genetic loci is biased in favor of polymorphic loci, and estimates of average heterozygosity are therefore biased upward. The apparent asymptote in the graph of average heterozygosity against time (fig. 1) does not imply that classical polymorphisms yield unbiased estimates of average heterozygosity. Because the procedure used in ascertaining modern molecular polymorphisms is fairly well described, one can calculate the bias that it introduces into estimates of heterozygosity. When estimated heterozygosity is below ~.3, bias is large. As estimated heterozygosity increases, bias decreases and eventually becomes negligible. The point at which this occurs varies among models. With two alleles and a fixed mutation rate, bias is negligible when estimated heterozygosity exceeds ~.35. Because of their high heterozygosity, STR loci are essentially free of ascertainment bias. These loci are therefore uniquely useful for comparing populations. Race differences in estimated heterozygosity are larger than predicted by the version of our model that assumes all loci to have equal rates of mutation. When varying mutation rates are allowed, however, the magnitude of bias is consistent with observed race differences.1040 A m .} . Hum. Genet. 5 8 :1 0 3 3 -1 0 4 1 , 1996 Acknowledgments We thank L. Luca Cavalli-Sforza, Henry Harpending, Li Jin, and Joanna Mountain for comments, and Scott Watkins for providing the data in the STR-3 column in table 1. Wen- Hsiung Li pointed out to us that the geographic structure of a sample can affect estimates of heterozygosity. This research was supported in part by National Science Foundation grant DBS-9310105. Appendix example, E[xJ is obtained by setting s, = 1 and setting Sj = 0 for all / =£ i; E[xfxfz\\ is obtained by setting s, = 2, Sj = 2z, and setting all the other sk equal to zero. For the Dirichlet distribution, the general moment is (Wilks 1962, eq. [7.7.6] on p. 179) m( s) = T(Ka) T(Ka + X sf) t-t /r(q + Sj)\\ t\\\\ r(a) )' (10) Derivation of Expression for E[x? | A] Bayes's rule allows the conditional density of x to be written as p.4(x) = PrM|x)p(x)/PrM) . (9) Given x, the conditional probability of A is K PrMx) = 1 - I xfz . i=i The unconditional probability of