A very brief intro to probability in formal genetics

Some terminology

Organisms, such as humans, with two sets of chromosomes (and consequently two sets of genes) are diploid. However, the gametes (the reproductive cells) of such organisms are formed by meiosis, a process whereby the double set of chromosomes is reduced to one. In other words, gametes are haploid: they have only one set of chromosomes. At fertilization, two gametes join, and the offspring will be diploid, with a double set of chromosomes, one from each parent.

Locus: the location of a gene in the DNA strand.

Allele: a variant of a single gene at a locus, that is, a sequence of nucleotides (the building blocks of DNA; in humans about 5,000 on average make up one gene) coding for mRNA (messenger RNA). In a population, there may be many alleles. They are usually represented by alphabet letters, e. g., A or a or b, etc.

Genotype: the set of two genes at a locus possessed by a diploid individual.

Homozygote: a genotype with the same two alleles in a locus, such as AA or aa.

Heterozygote: a genotype with two different alleles in one locus, such as Aa.

The Mendelian ratios

The Mendelian ratios give the proportions of the genotypes in the offspring of a given combination of genotypes. The ratios are obtained by constructing a tree in which the mix of alleles is random. The easiest case is one involving only one locus. For example, suppose that an AA and an Aa mate. Then, 1/2 of the offspring will be AA and 1/2 Aa; by contrast, if two Aa mate, 1/4 of the offspring will be AA, 1/4 will be aa, and 1/2 will be Aa.

Frequencies

The frequency of X in a given population is the probability of picking X randomly, that is, the number of X's divided by the number of individuals in the population. For example, consider a population of 8 individuals with two possible alleles, A and a, per locus such that the 8 individuals have the following genotypes:

AA, Aa, aa, AA, Aa, Aa, AA, Aa.

Then, the frequency P of the genotype AA (that is, P = Pr[AA is picked at random]) is P = 3/8; that of aa is R = 1/8; that of Aa is Q = 4/8. Note that P+Q+R=1, as it should be.

Now let p and q be the frequencies of alleles A and a, respectively. In our case, since there are 10 A's and 16 alleles (2 per individual), p=10/16, and q=6/16. What is the relation between P, Q, R, and p, q? Since p is the number of A's divided by the number of alleles, which is twice the number of individuals, one way to obtain p is to take twice the number of AA's, add the number of Aa's, and divide the whole by twice the number of individuals. In other words,

p = (16P+8Q)/16 = P + Q/2. (1)

Analogously:

q = R + Q/2. (2)

Again, note that p+q=1, as it should be, as an allele must be A or a. So, as we saw, in our bird population, p = .625, and q = .375.

The Hardy-Weinberg theorem.

Evolution requires the preservation of variations from one generation to the next. This entails that the offspring's features are not a blending (an average) of those of the parents. For example, if the height of the offspring is an average of those of the parents, eventually everybody will have the same height, and evolution with respect to height will stop. So, how can diversity be preserved? Darwin did not have a satisfactory answer.

The solution consists of two parts. One is that at the genetic level inheritance is particulate, as Mendel had discovered already in 1866, unbeknown to Darwin. The second consists of a simple mathematical argument due to Hardy, a mathematician, and, later generalized by Weinberg, a physician.

Suppose that we have a population in which the frequencies of A and a are p and q, respectively. Let us assume that:

The population is large enough so that if the frequency of an allele (A, for example) is p, then for all intents and purposes the frequency of A remains p even if one of the A's is taken away.

Mating is random.

There is no selective force acting on the genotypes.

No genetic drift (a random process by which allele distribution is changed) is taking place.

Let us use indexes to denote generations, so that (AA)₂ denotes the genotype AA in the second generation, and the same for the other symbols. Then, the probability that A₁ will join with another A₁ (i.e., the probability of randomly picking two A's) to produce an (AA)₂ is

Pr(AA)₂ = P₂ = p x p = p² . (3)

Analogously,

Pr(aa)₂ = R₂ = q² (4)

and

Pr(Aa)₂ = Q₂ = 2pq. (5)

By (1) and (2) applied to the second generation

p₂ = P₂ + Q₂/2 (6)

and

q₂ = R₂ + Q₂/2. (6’)

Applying (3)-(5) to (6)-(6’), we obtain

p₂ = p² + pq = p(p+q) = p (7)

as p+q =1.

Similarly,

q₂ = q. (8)

In short, allele frequencies remain unchanged from one generation to the next, which means that genetic diversity is preserved. As long as (i)-(iv) are satisfied, any population will satisfy (7)-(8) in any generation. In that case, the population's genotypes (the genotypes responsible for the MN blood group system in humans, for example) are at the Hardy-Weinberg equilibrium. By contrast, when experimental evidence suggests that the genotypes do not satisfy (7)-(8), then at least one of previous four conditions is not met.

Introducing selection

Suppose that a population satisfies the Hardy-Weinberg theorem. However, imagine that at some point it fails to meet condition (iii) because one allele at one locus is favored. For example, suppose that while all the AA and Aa grow to adulthood and reproduce, only a percentage of aa does. This is expressed by a selection coefficient measuring aa's reduction in fitness with respect to the best genotype. Let us assume that all AA and Aa become adults and reproduce, so that

Pr(AA reproduces) = Pr(Aa reproduces) = 1, (9)

while some aa’s die, so that

Pr(aa reproduces) = 1-s, (10)

where s is a number between 0 and 1. Then, the relative frequency of newborn second generation AA (of (AA)₂, that is) will remain the same, namely p² x 1 = p²; similarly, that of (Aa)₂ will remain 2pq x 1 = 2pq. By contrast, because not all aa reproduce, the relative frequency of newborn (aa)₂ will change to q²(1-s).

Normalization

However, we are faced with a difficulty: p², 2pq, and q² (1-s) are not the (absolute) frequencies because they do not add up to 1, as they should. (Indeed, they cannot add up to 1 unless s=1, which would mean that there is no selection, because p² +2pq + q² =1). Rather, they are relative proportions. The problem is easy to see: the number of reproducing adults is less than the number at birth, and this must be expressed in the frequencies. To go from relative proportions to absolute proportions, frequencies or probabilities, that is, we need to apply normalization.

Since every genotype is AA or Aa or aa, the sum of their frequencies must be equal to 1. Hence, there must be some number c such that

1 = c[p² + 2pq + q²(1-s)]. (11)

Our goal is to determine c.

Developing, we obtain

1 = c[p² + 2pq + q² – q²s], (12)

and since

p² + 2pq + q² = 1, (13)

we obtain

1 = c[1-q²s], (14)

so that

c = 1/ (1-q²s). (15)

Consequently, the absolute frequencies of (AA)₁, (Aa)₁, and (aa)₁ at reproductive adulthood are

Pr[(AA)₁ reproduces] = Pr[(AA)₂] = p² /(1-q²s), (16)

Pr[(Aa)₁ reproduces] = Pr[(Aa)₂]= (2pq)/(1-q²s), (17)

and

Pr[(aa)₁ reproduces] = Pr[(aa)₂] = [q² (1-s)]/(1-q²s). (18)

The generational change in allele frequency

Using equations (1) and (16)-(17) and the fact that p+q=1, we know that the frequency of allele A in the second generation is

Pr(A₁ reproduces) = {p² /(1-q²s)} + {pq/(1-q²s)} = p/(1-q²s). (19)

(Note that since s is a positive number, 1-q²s<1, and therefore p/(1-q² s)>p). Hence, the frequency generational change for allele A is

Dp = Pr(A₁ reproduces) - Pr(A) = {p/(1-q² s)} - p = (spq²)/(1-q²s). (20)

Every generation, the frequency of allele A will change by some determinable Dp depending on p and s. (Note that while the value of s is constant, that of p changes at every generation). So, if we know the number of generations, we can predict the changes in gene frequency. In addition, from (20) we can determine the selection coefficient

s = [Dp/q²] [(1-q²s)/p] = Dp/{q² Pr(A₁ reproduces}. (21)

The Wahlund Effect

Imagine we have two separate populations of equal size on two separate islands. Suppose that in population 1 the gene frequency is .2 for A and .8 for a, and in population 2 it is .7 for A and .3 for a. Then, if the Hardy-Weinberg conditions are satisfied or nearly so, in one generation the frequency of genotypes is:

Pr(AA)₂ = .04; Pr(Aa)₂ = .32; Pr(aa)₂ = .64,

and

Pr(AA)₂ = .49; Pr(Aa)₂ = .42; Pr(aa)₂ = .09.

Note that the average of AA in the two populations together is .265, that of Aa is .37, and that of aa is .365.

Suppose, however, that the two populations fuse (let's say that now the two islands are connected by a bridge and interbreeding occurs). Then,

Pr(A) = .45 and Pr(a) = .55.

Hence, if the Hardy-Weinberg conditions are satisfied, in one generation the (rounded off) values are

Pr(AA)₂ = .2; Pr(Aa)₂ = .5; Pr(aa)₂ = .3.

The frequency of Aa (of heterozygotes) has gone up from .37 to .5; in other words, the frequency of heterozygotes in a fused population is higher than its average in the equivalent subdivided population. This is the Wahlund effect. Typically, rare recessive genetic diseases are associated with homozygotes (AA or aa); consequently, when isolated populations merge, the incidence of such diseases goes down as soon as the two populations interbreed.