Distribution of a data set or random variable: The values
and the frequencies (or probabilities) taken on by the data set (or random
variable).
Histograms can be used to represent the distribution:
Height of bar over each value or range of value represents the relative
frequency of occurrence of that value.
Three interpretations of average:
Mean (Most common meaning of average): mean = (sum of all values) / (number of values).
Median: Middle value (equal number of values are above and below this value. If the number of values is even, take the value halfway between of the two middle most values.)
Mode: Most frequently occurring value or group of values.
See Example 1
Shapes of
Distributions: See figures 6.2,
6.3, 6.4, 6.5
Number of Peaks
Symmetric or Skewed:
Symmetric if left half
of distribution is mirror image of right half. In this case mean = median =
mode.
Skewed left: Outliers at lower (left) values -- Mode and median are
greater than mean.
Skewed right: Outliers at upper (right) values -- Mode and median are
less than mean.
Heights of women --
symmetric
Income -- skewed right
Speeds of cars under police using radar -- skewed left
Variation: How widely the valued
are scattered about the center
Range: Maximum value - minimum value
Can be misleading: See example 1 -- Two different quizzes with similar
range, but due to outlier in first quiz.
Five Number Summary
Min (Low) -- lowest value
Lower or first quartile (25th percentile): Point where 1/4 of the data values lie at or below.
Middle or second quartile (50th percentile): Median
Upper or third quartile (75th percentile): Point where 3/4 of the data values lie at or below.
Max (High) -- highest value
Can use box and whisker
plot for showing five number summary.
See figure 6.10; Example 2, figure 6.11
Standard Deviation: A single number used to describe
variation in a distribution.
Formulas for standard deviation
Population Standard Deviation: Measures variation of entire population
Sample Standard Deviation: Used when only have representative sample of population
Range Rule of Thumb for Estimating Standard Deviation:
Standard deviation is
approximately equal to range/4
Low Value is approximately equal to mean - 2* (standard deviation)
High Value is approximately equal to mean + 2* (standard deviation)
Works when data values are distributed fairly evenly. Does not work well when extreme high or low values are outliers.
Example 4 : Compares Rule of Thumb
calculations to actual standard deviation formula result.
Normal Distribution: Symmetric, bell-shaped with single peak at
mean (which because of symmetry is also median and mode).
See figure 6.12: -- There is not just
one Normal Distribution. They same bell shape, but may have different
means and standard deviations.
General characteristics for a Normal
Distribution:
1. Clustered near the mean
2. Evenly spread about mean (symmetric)
3. As move away from the mean in either direction tapers off to near zero
4. Types of data that will be normally distribution result from a combination of many different factors.
Examples 2: -- Scores on a very easy quiz will not be normally
distributed. Shoes sizes of women will be normally distributed.
Standard Deviations in Normal
Distributions
In a normal distribution
· 68.3% of data values fall within 1 standard deviation of the mean.
· 95.4% of data values fall within 2 standard deviations of the mean.
· 99.7% of data values fall within 3 standard deviations of the mean.
-- see figure 6.14 and example 3 -- figure 6.14, page 383.
Example 4: Detecting counterfeit quarters. Machine rejects quarters more than two standard deviations from the mean weight of legal quarters. Mean = 5.67 grams. Standard deviation = .0700 grams. Rejects quarters > 5.67 + 2*.0700 = 5.81 or < 5.67 - 2*.0700 = 5.67 grams. 95.4 % of legal quarters will be accepted (4.6% of legal quarters rejected).
Example 5: Auto Prices.
10,000 people. Prices normally distributed with mean of $16500 and standard deviation of $500.
68% paid between 1 standard deviation of mean ($16000 to $17000). 68% *10000 = 6,800 people
32% did not -- of those 1/2 paid less than $16000 (by symmetry) so 16% paid less than $16000, or 1,600 people
99.7% paid within 3 standard deviations of the mean ($14000 to $18000) so .3% are not within this range. Half of these (.015% paid more than %18000 -- only 15 people).
Standard Scores and Percentiles: Using a Normal table (z-score table).
nth percentile: value for which n% of data values are less than or equal to that value.
Can use the z-table to compute percentile for given data value
1. Calculate z-score: z = (data value - mean)/(standard deviation).
2. Look up percentile in tableExample 7: Cholesterol Levels. z = (190 - 178)/41 or approx. .29 a z-score corresponding to approximately 61st percentile.
To determine data value corresponding to percentile:
1. Look up percentile in table and get z-score.
2. Set data value = (Standard Deviation)* z-score + meanExample 7, cont.: 90.32% corresponds to z-score of 1.3
z-score = (data value - 178)/41 = 1.3
so data value = 1.3*41 + 178, or approximately 231. (Less than 10% have cholesterol higher then 231).
Goal of most statistical studies is to infer a conclusion about a population
from results for a sample.
Statistical Significance: A set of measurements from a statistical study
is statistically significant if it is unlikely to have occurred by chance.
If the probability of an observed
difference occurring by chance is 0.05 (5% or 1 in 20) or less, the
difference is said to be statistically significant at the 0.05 level.
If the probability of an observed
difference occurring by chance is 0.01 (1% or 1 in 100) or less, the
difference is said to be statistically significant at the 0.01 level.
How do we determine these levels of
statistical significance?
They are based on the Central Limit Theorem. Informally
this theorem states that the distribution of the sample means of size n from an
underlying population will be approximately normal with mean = the
population mean and standard deviation = population's standard
deviation/sqrt(n).
One of the most frequent applications
this theorem for statistical significance is in opinions polls and
margins of errors.
In opinion polls where respondents have two possible answers (Yes or No, Agree
or Disagree), the underlying population has a binomial
distribution. We can associate 1 with a Yes vote and 0 with a
no vote. Then by computing the sample mean we get the proportion of
people who said yes. We can show that the 95% confidence
interval for such polls is given by
![]()
where
![]()
is the sample mean and E is called the
(95%) margin of error. For values of
such that
we can use the
approximate formula for the margin of error:
![]()
The underlying population mean has a 95%
probability of being within this confidence interval.
See Example 4 -- Poll Margins:
a) 500 people surveyed. 52% plan to vote for Smith. E =
1.96*sqrt((.52*.48)/500) -- approximately 0.044
Since 500*(.52) =260 and 500*(.48) =240 are both
well above 5, we can also use the approximate formula for E:
1/sqrt(500.) which is about .0.045.
b) E := 1.96*sqrt((.87*.13)/1500) or approximately .017 (or
1.7%). The rule of thumb esitmate 1/sqrt(1500.) is approximately
0.026, which is much "rougher"
We can also use this formula to
determine how large a sample size is needed for a desired margin of error for a
confidence interval of 95%.
Example: Suppose we are conducting a poll, and would like
a 2% margin of error. We need
to be less
than .02
It can be shown using simple calculus of
derivatives, that
![]()
Using this in the formula for E, we need

We can generalize this to the
following: To get a 95% confidence interval with a margin of error
of +/- E (E as a decimal), sample size n should be at least 0.9604 /E2.
Also, if the sample size is n, the margin of error will always be less
than 0.98 / sqrt(n).
Example: A USA Today poll after the liberation of Kuwait
reported that 91% of the respondents approved of George Bush's performance as
president, with a margin of error of 4%. What was the sample size
used in the poll?
Using our formula above,
sample size n should be at least 0.9604 /E2 or 600.25 people.
We can get a more precise answer by
using the t E = .04 and
=.91 in our equation :
![]()
and solving for n we get n= 196.641 (at least 197).
The reason for a smaller requisite n, is
the high value of the sample mean, .91. While
we see 0.91*(1-0.91) = .0819 is
smaller than this largest possible value of 0.25.
Example: Testing the Gender Choice Product.
Product claims that using this product can increase significantly the chance
that one will give birth to a girl.
Testing method: Choose a random sample of 100 babies born to women
who used the Gender choice product.
Null Hypothesis: The claim that gender choice does not work, and
therefore the proportion of baby girls born should be close to the expected
proportion of girls born in the general population. (50%)
Alternative Hypothesis: The claim that the Gender Choice does work
and the proportion of girls is greater than 50%.
Definition of Null and Alternative
Hypotheses:
Null Hypothesis: Claims a specific value for a population
parameter (such as the mean of the underlying population).
Alternative Hypothesis: The claim that is accepted if the null
hypothesis is false. (I.e., its negation).
Form of null hypothesis, and corresponding alternative hypothesis.
One of three forms
|
Null Hypothesis |
Alternative Hypothesis |
|
1. Population parameter = claimed value. |
1. Population parameter is not equal to claimed value. |
|
2. Population parameter >= claimed value. |
2. Population parameter < claimed value. |
|
3. Population parameter <= claimed value. |
3. Population parameter > claimed value. |
Steps for Hypothesis Testing: Testing for the Mean with a large sample (
>= 30).
1. For the claim being tested, state the Null H0 and
Alternative Hypothesis Ha.
2. Specify the level of
significance, a. This is the maximum allowable probability of
making an error of rejecting the null hypothesis when it is actually true.
3. Determine the critical value z0
and rejection regions.
If a left
tailed hypothesis test (Case 2 above), find the z-score in text corresponding
to percentage a.
If a right
tailed hypothesis test (Case 3 above), find the z-score corresponding to
percentage 1-a.
If a two-
tailed hypothesis test (Case 1 above), find the two z-scores corresponding to percentage
a/2 and 1-a/2.
4. Find the standardized z score
for your sample mean (and if desired, corresponding P-value)
z-score = (sample mean - population mean) / (sample standard deviation/sqrt(n)), where n is the sample size.
Corresponding P-value:
If a left tailed hypothesis test (Case 2 above), P-value = percent corresponding area in left tail
If a right tailed hypothesis test (Case 3 above), P-value = percent corresponding area in right tail
table.
If a two- tailed hypothesis test (Case 1 above), P-value = percent corresponding to 2* area in tail.
5. Determine whether to reject or fail to reject the null hypothesis.
If the standardized
z-score is in the rejection region, reject the null hypothesis.
(Or, equivalently, reject if its correspond P-value is greater
than a.
If the standardized z-score is NOT in the rejection region fail to reject the
null hypothesis.
6. Interpret
decision in context of original claim.