M241 Probability
Chapter 2: Repeated Trials and Sampling
Introduction
Model: Repeated trials each of which result in success (event happening)
or failure (event not happening.)
Each trial is called a Bernoulli (p) trial where p is probability
of success and q = 1 - p is the probability of failure. (see Section 1.3,
page 27)
Examples are:
-
tossing a fair coin (p = 1/2)
-
rolling a die wanting a six (p = 1/6)
-
rolling a pair of dice wanting two sixes (p = 1/36)
-
giving birth (wanting a girl) ( p = .487)
Appendix 1 Basic counting Principles
-
Multiplication Rule for counting -- used for sequence of choices
-
Formula for number of sequences is a consequence of multiplication rule
-
Summation Rule for counting -- used for alternative choices
-
Formula for Number of Orderings (Also called k-permutation)
-
Factorial notation -- n!
-
0!
-
Formula for Number of Combinations (subsets) -- Binomial coefficient
-
Formula for Number of Subsets of a Set of n Elements
Applications of counting principles:
1. (Summation Rule) Suppose you must pick a faculty member to
serve as the Division Chair of Natural Sciences -- there are 3 faculty
in Chem, 4 in Physics, 3 in Bio. and 2 in Earth Sciences. How many
different ways can you choose a faculty member
2 (Multiplication Rule) Suppose you must choose member from
each department to in Natural Sciences to be on an Assessment Committee.
How many different ways can you make up the committee.
3. Suppose you have 8 students and must choose any 3 to work on
a special project.
How many ways can you make up the group.
4. Suppose you have 10 contestants and must choose 5 finalists
(unordered)? Same question except ordered -- i.e. pick the winner,
first runner-up, etc.
(2.1) The Binomial Distribution
What is the probability of getting exactly k successes in n Bernoulli (p)
trials?
for k = 0, 1, 2, ... n
This distribution is called the Binomial (n,p) distribution.
For n = 4, a tree diagram can be used to determine the answer.
P(0 successes in 4 trials) = q4
P(1 success in 4 trials) = 4pq3
P(2 successes in 4 trials) = 6 p2q2
P(3 successes in 4 trials) = 4 p3q
P(4 successes in 4 trials) = p4
This is really a "condensed" tree since paths that lead to the same
number of successes are joined and labeled with the sum of the probabilities
along those paths.
For example, the node in the tree labeled 3pq2,
which represents the probability of 1 success and two failures in 3 trials,
has 3 different paths to it: FFS, FSF, SFF.
What would the next row in this tree look like? (The row representing
k successes in 5 trials)
In general we write
-
P(k successes in n trials) = n!/(k!(n-k)!)
pk q n-k
-
The binomial coefficient represents the
number of ways we can choose k of n places for success.
-
This one of the basic rules of counting
-- see Appendix 1
Look at Example 1:
-
Problem 1: Probability of getting four or more heads in six tosses
of a fair coin.
-
Problem 2: Probability that among five families, each with six children
at least 3 of the families have four or more girls. (Assuming each child
is equally likely to be a girl or a boy).
Facts (we will not show):
-
Most likely number of successes (Mode) is int(np + p)
-
Expected number of Successes (Mean or Average) is np
( 2.2) Normal Approximation: Method
-
Histograms of the binomial distribution . (See page 88-89)
-
Figure 4 shows the distribution for p = ½ and increasing
values for n between 10 and 100.
As n increases, the distribution shifts to the right, remaining
centered about the mean, np = n/2
It also becomes more spread out, but still symmetric.
-
Figure 5 shows how the distribution changes when p is changed while
holding n fixed at 100. The distribution shifts to the right as p
increased, remaining symmetric about the mean np = 100p.
-
In both cases the histograms remain bell-shaped.
-
The area of each bar in the histogram represents the probability of k
successes where k is the integer along the x axis over which the bar is
centered
-
This means that we can use the area under the "normal curve" to approximate
probabilities in binomial distributions when n is large enough.
-
In general, normal curve represents a "continuous histogram" and areas
under
the sections of the curve represent probabilities.
-
The normal curve has the equation
-
Here is m the mean and is s
the standard deviation (to be explained later in chapter 3).
-
When is 0 and is 1, the we have the standard normal curve:
-
This function is called the standard normal density function and
we denote it by f(x)
-
We use areas under the curve (i.e. integrals of the curve) to approximate
probabilities. The function
represents the area under the curve between - and z
represents the total area under the curve and IS EQUAL TO 1.
is the area under the curve between a and b.
-
We can look up values for using the Normal Table (Appendix 5),
or calculate on computer or calculator.
The chart only gives values for positive arguments. Since the curve
is symmetric about 0, we can use:
-
If we have a normal distribution with mean m,
standard deviation s,
then we must scale -- The area under the normal curve
with mean m,
standard deviation s is
This is easily shown by using the following change of variable in the
integration of the normal curve over (a, b):
-
Summary of properties of the standard normal cumulative distribution
function (Normal c.d.f.) F(x):
-
F(a,b) =
F(b)
- F(a) is the area between a and b under the
standard normal curve and represents the probability of the interval (a,b)
for a normal distribution.
-
limit F(x) as x -> infinity is 1. (Area under
entire standard normal density curve is 1)
-
F(-x) = 1 - F(x)
(It is symmetric about the origin.)
-
2F(b) - 1 is the area between -b and b under
the standard normal curve.
-
F((b-m)/s)
- F((a- m)/s)
is the area under the normal curve with mean m,
and standard deviation s.
-
F(-1,1) is
approximately .68; F(-2,2)
is approximately .95; F(-3,3)
is approximately .997
The Normal Approximation to the Binomial Distribution
-
The binomial(n,p) distribution has mean m = np and standard deviation =
square root of npq. (Not shown until section 3.3. of text).
-
For large enough n and p not too close to 0 or 1, the probability
that the number of successes in n trials will be between a and b
is the approximated by the area under the normal curve between a-1/2
and b+1/2. (See figure 4 page 98).
-
Using scaling, this is given by the formula:
-
See Example 1, page 99 for examples of the using normal approximation to
binomial distribution.
-
We know that
-
P(m - s to m
+ s successes in n trials) is approximately
.68
-
P(m - 2s to m
+ 2s successes in n trials) is approximately
.95
-
P(m - 3s to m
+ 3s successes in n trials) is approximately
.997
Application: Working with Confidence Intervals:
Typical Problem:
Conduct a sequence of Bernoulli trials, where you don't know the true
probability p of success.
Estimate the true probability p with p-est = #successes/#trials
Give a confidence interval: range about your estimate in which
you have a level of confidence that the true value of p lies within.
The level of confidence is the probability that you are correct
- i.e. that the true value of p lies within the confidence interval you
specified!
Method of finding confidence interval and confidence level:
-
Choose a value of z so that F(-z,z) = 2F(z)
- 1 from the Normal chart gives you the desired confidence level.
-
Then we can say that the true value of p will be within the confidence
range:
Section 2.4 Poisson Approximation
-
Even when n is large, if p is close enough to either 0 or 1 in the Binomial
Distribution, the Normal Approximation is not accurate at all. This is
because the standard deviation (which measures the "spread" of the distribution)
is small. When p is close to 0, the number of successes will likely be
very small relative to the number of trials and not symmetric. See the
sample histograms on page 117.
-
A good approximation for the Binomial Distribution, when n is large
and p is either close to 1 or close to 0, is the Poisson Approximation.
Section 2.5 Random Sampling
Sampling without Replacement:
Model: Population of size N. Choose n elements one a time at random,
replacing each in the population after it is drawn. Assume each element
is either Good ( and the number of good elements is G) or Bad (and the
number of bad elements is B) ( and B + G = N)
this then is just the model we have already studied (where Good just
is success and Bad is failure).
The number of Good elements picked in the sample of size n is represented
by the Binomial Distribution.
Here the probability of success p = (G/N) and the probability of failure
q = (B/N) -- Nothing new here.
Sampling Without Replacement: : Population of size N. Choose n elements
one a time at random, but do not replace it. Assume each element is either
Good ( and the number of good elements is G) or Bad (and the number of
bad elements is B) ( and B + G = N)
The following formula hold:
-
For Sampling with Replacement (The Binomial Distribution)
-
For Sampling without Replacement (The Hypergeometric Distribution)
Using the rule for probability of equally likely outcomes, counting
principles, we get:
When N (and consequently B and G) are very large compared with the sample
sizes n (and consequently b and g), whether or not replacement occurs is
insignificant (i.e. the Binomial Distribution and the Hypergeometric are
very close), so the simpler binomial distribution is used to approximately
the hypergeometric distribution. Also if appropriate we may therefore use
the Normal Approximation, or the Poisson Approximation as before.
Examples: Exercises from 2.5
Exercise 1. Apply formulas with :
-
N = 50
-
G = 20
-
B = 30
-
g = 4
-
b = 6
-
n = 10
Exercise 2. Combines ideas from previous sections.
-
N = 52
-
G = 26
-
B = 26
-
n = 3
a) Use Multiplication Rule for Conditional Probability: (26/52)(26/51)(25/50)
b) Use Hypergeometric formula with g = 1, b =2
c) P(at least one red) = 1 – P(no reds) = 1- P(3 blacks) = (26/52)(25/51)(24/50)
Exercise 4. Sampling without replacement
-
N = 100,000 (large compared to n)
-
G = 40,000
-
B = 60,000
-
g = 45
-
b = 55
-
n = 100
Exact expression for probability would be the sum of hypergeometric probabilities
for g varying from 45 to 100.
Approximated well by Binomial Distribution, since N is large -- which is
well approximated by the Normal Distribution (since p = .40 is close to
1/2) Mean is np = 40; Standard Deviation = square root (npq) = square root
(240).
We want the "right tail" of the Normal Distribution: