Chapter 6 Putting Statistics to Work

6A Fundamentals of Statistics

6B  Measures of Variation

6C  The Normal Distribution

6D  Statistical Inference

6A Fundamentals of Statistics


Distribution of a data set or random variable:  The values and the frequencies (or probabilities) taken on by the data set (or random variable).
Histograms can be used to represent the distribution:  Height of bar over each value or range of value represents the relative frequency of occurrence of that value.

Three interpretations of average:

Mean (Most common meaning of average):  mean = (sum of all values) / (number of values).
Median:  Middle value (equal number  of values are above and below this value.  If the number of values is even, take the value halfway between of the two middle most values.)
Mode:  Most frequently occurring value or group of values.
See Example 1

Shapes of Distributions: See figures 6.2, 6.3, 6.4, 6.5
 

Number of Peaks
Symmetric or Skewed:

Symmetric if left half of distribution is mirror image of right half. In this case mean = median = mode.
Skewed left:  Outliers at lower (left) values -- Mode and median are greater than mean.
Skewed right:  Outliers at upper (right) values -- Mode and median are less than mean.

Heights of women -- symmetric
Income -- skewed right
Speeds of cars  under police using radar -- skewed left

Variation:  How widely the valued are scattered about the center

6B  Measures of Variation

Range:  Maximum value - minimum value
Can be misleading:  See example 1 -- Two different quizzes with similar range, but due to outlier in first quiz.

Five Number Summary

Min (Low)  -- lowest value
Lower or first quartile (25th percentile):  Point where 1/4 of the data values lie at or below.
Middle or second quartile (50th percentile):  Median
Upper or third quartile (75th percentile):  Point where 3/4 of the data values lie at or below.
Max (High) -- highest value

Can use box and whisker plot for showing five number summary.
See figure 6.10; Example 2, figure 6.11

Standard Deviation:  A single number used to describe variation in a distribution.

Formulas for standard deviation
 

Population Standard Deviation:  Measures variation of entire population

Sample Standard Deviation:  Used when only have representative sample of population

Range Rule of Thumb for Estimating Standard Deviation:

Standard deviation is approximately equal to range/4
Low Value is approximately equal to mean - 2* (standard deviation)
High Value is approximately equal to mean + 2* (standard deviation)

Works when data values are distributed fairly evenly.  Does not work well when extreme high or low values are outliers.

Example 4 :  Compares Rule of Thumb calculations to actual standard deviation formula result.
 
 

6C  The Normal Distribution


Normal Distribution:  Symmetric, bell-shaped with single peak at mean (which because of symmetry is also median and mode).

See figure 6.12: -- There is not just one Normal Distribution.  They same bell shape, but may have different means and standard deviations.

General characteristics for a Normal Distribution:

1.  Clustered near the mean
2.  Evenly spread about mean (symmetric)
3.  As move away from the mean in either direction tapers off to near zero
4.  Types of data that will be normally distribution result from a combination of many different factors.


Examples 2: -- Scores on a very easy quiz will not be normally distributed.  Shoes sizes of women will be normally distributed.

Standard Deviations in Normal Distributions
In a normal distribution

·  68.3% of data values fall within 1 standard deviation of the mean.

·  95.4% of data values fall within 2 standard deviations of the mean.

·  99.7% of data values fall within 3 standard deviations of the mean.
-- see figure 6.14 and example 3 -- figure 6.14, page 383.

Example 4:  Detecting counterfeit quarters.  Machine rejects quarters more than two standard deviations from the mean weight of legal quarters.  Mean = 5.67 grams.  Standard deviation = .0700 grams.  Rejects quarters > 5.67 + 2*.0700 = 5.81 or < 5.67 - 2*.0700 = 5.67 grams.  95.4 % of legal quarters will be accepted (4.6% of legal quarters rejected).

Example 5:  Auto Prices.
10,000 people.  Prices normally distributed with mean of $16500 and standard deviation of $500.
68% paid between 1 standard deviation of mean ($16000 to $17000).  68% *10000 = 6,800 people
32% did not  -- of those 1/2 paid less than $16000 (by symmetry) so 16% paid less than $16000, or 1,600 people
99.7% paid within 3 standard deviations of the mean ($14000 to $18000) so .3% are not within this range.  Half of these (.015% paid more than %18000 -- only 15 people).


Standard Scores and Percentiles: Using a Normal table (z-score table).
 

nth percentile:  value for which n% of data values are less than or equal to that value.

Can use the z-table to compute percentile for given data value
1.  Calculate z-score:   z = (data value - mean)/(standard deviation).
2.  Look up percentile in table

Example 7:  Cholesterol Levels.  z = (190 - 178)/41 or approx. .29  a z-score corresponding to approximately 61st percentile.

To determine data value corresponding to percentile:
1.   Look up percentile in table and get z-score.
2.  Set data value = (Standard Deviation)* z-score + mean

Example  7, cont.:   90.32% corresponds to z-score of 1.3
z-score = (data value - 178)/41 = 1.3
so data value = 1.3*41 + 178, or approximately 231.  (Less than 10% have cholesterol higher then 231).

Unit 6D  Statistical Inference


Goal of most statistical studies is to infer a conclusion about a population from results for a sample.

Statistical Significance:  A set of measurements from a statistical study is statistically significant if it is unlikely to have occurred by chance.

If the probability of an observed difference occurring by chance is 0.05  (5% or 1 in 20) or less, the difference is said to be statistically significant at the 0.05 level.

If the probability of an observed difference occurring by chance is 0.01  (1% or 1 in 100) or less, the difference is said to be statistically significant at the 0.01 level.
 

How do we determine these levels of statistical significance?
They are based on the Central Limit Theorem.  Informally this theorem states that the distribution of the sample means of size n from an underlying population will be approximately normal with mean =  the population mean and standard deviation = population's standard deviation/sqrt(n).

One of the most frequent applications this theorem for statistical significance is in opinions polls and margins of errors.
In opinion polls where respondents have two possible answers (Yes or No, Agree or Disagree), the underlying population has a binomial distribution.    We can associate 1 with a Yes vote and 0 with a no vote.  Then by computing the sample mean  we get the proportion of people who said yes.    We can show that the 95% confidence interval for such polls is given by

where

is the sample mean and E is called the  (95%)  margin of error.  For values of   such that  we can use the approximate formula for the margin of error:

The underlying population mean has a 95% probability of being within this confidence interval.
 

See Example 4 --  Poll Margins:
a) 500 people surveyed.  52% plan to vote for Smith.  E = 1.96*sqrt((.52*.48)/500) -- approximately 0.044
    Since 500*(.52) =260  and 500*(.48)  =240 are both well above 5, we can also use the approximate formula for E:
1/sqrt(500.) which is about .0.045.
b) E := 1.96*sqrt((.87*.13)/1500)  or approximately .017  (or 1.7%).   The rule of thumb esitmate 1/sqrt(1500.) is approximately 0.026, which is much "rougher"

We can also use this formula to determine how large a sample size is needed for a desired margin of error for a confidence interval of 95%.
 

Example:  Suppose we are conducting a poll, and would like a 2% margin of error.   We need
to be less than .02

It can be shown using simple calculus of derivatives, that

Using this in the formula for E, we need

We can generalize this to the following:  To get a 95% confidence interval  with a margin of error of +/- E (E as a decimal), sample size n should be at least 0.9604 /E2.    Also, if the sample size is n, the margin of error will always be less than 0.98  / sqrt(n).

Example:  A USA Today poll after the liberation of Kuwait reported that 91% of the respondents approved of George Bush's performance as president, with a margin of error of 4%.   What was the sample size used in the poll?

Using our formula above,   sample size n should be at least 0.9604 /E2 or 600.25  people.

We can get a more precise answer by using the t E = .04 and =.91  in our equation :

and solving for n we get n= 196.641 (at least 197).

The reason for a smaller requisite n, is the high value of  the sample mean, .91.  While 

we see  0.91*(1-0.91) = .0819 is smaller than this largest possible value of 0.25.
 
 

Hypothesis Testing

Example:  Testing the Gender Choice Product.  Product claims that using this product can increase significantly the chance that one will give birth to a girl.
Testing method:  Choose a random sample of 100 babies born to women who used the Gender choice product.
Null Hypothesis:  The claim that gender choice does not work, and therefore the proportion of baby girls born should be close to the expected proportion of girls born in the general population. (50%)
Alternative Hypothesis:  The claim that the Gender Choice does work and the proportion of girls is greater than 50%.
 

Definition of Null and Alternative Hypotheses:

Null Hypothesis:  Claims a specific value for a population parameter (such as the mean of the underlying population).
Alternative Hypothesis:  The claim that is accepted if the null hypothesis is false.  (I.e., its negation).
Form of null hypothesis, and corresponding alternative hypothesis.

One of three forms
 

Null Hypothesis

Alternative Hypothesis

1.  Population parameter = claimed value. 

1.  Population parameter is not equal to claimed value. 

2. Population parameter >= claimed value. 

2. Population parameter < claimed value. 

3. Population parameter <= claimed value. 

3. Population parameter > claimed value. 

Steps for Hypothesis Testing:  Testing for the Mean with a large sample ( >= 30).
1.  For the claim being tested, state the Null H0 and Alternative Hypothesis Ha.

2.  Specify the level of significance, a.   This is the maximum allowable probability of making an error of rejecting the null hypothesis when it is actually true.

3.  Determine the critical value z0 and rejection regions.
            If a left tailed hypothesis test (Case 2 above), find the z-score in text corresponding to percentage a.
            If a right tailed hypothesis test (Case 3 above), find the z-score corresponding to percentage 1-a.
            If a two- tailed hypothesis test (Case 1 above), find the two z-scores corresponding to percentage a/2 and 1-a/2.

4.  Find the standardized z score for your sample mean  (and if desired,  corresponding P-value)

z-score = (sample mean - population mean) / (sample standard deviation/sqrt(n)), where n is the sample size.

Corresponding P-value:
If a left tailed hypothesis test (Case 2 above),  P-value = percent corresponding area in left tail
If a right tailed hypothesis test (Case 3 above), P-value =  percent corresponding area in right tail
table.
If a two- tailed hypothesis test (Case 1 above), P-value = percent corresponding to 2* area in tail.

5.  Determine whether to reject or fail to reject the null hypothesis.

If the standardized z-score is in the rejection region, reject the null hypothesis.
   (Or, equivalently, reject if its correspond P-value is greater than a.
If the standardized z-score is NOT in the rejection region fail to reject the null hypothesis.

6.  Interpret decision in context of original claim.