Tests of Significance

In the following exercises we use the notation:

H0 = Null Hypothesis.: This is the statement about the box that is assumed to be true. It expresses the idea that an observed difference is due only to chance.
H1 = Alternative Hypothesis.: This is also a statement about the box. It expresses the idea that the observed difference is real and not just due to sampling variation.
z = Test statistic for the z-test.: It measures the difference between the data and what is expected under the NULL hypothesis. It has the general form: z = (observed - expected)/SE
P = The observed significance level.: Also known as P-value. It is the chance of getting a test statistic as extreme as, or more extreme than, the observed one when the NULL hypothesis is assumed to be true. Small P-values are evidence against the NULL but they are NOT the probability that the NULL is true.

Problem1:
According to one investigator's model the data are like 400 draws made at random from a large box. The null hypothesis says that the average of the box equals 50; the alternative says that the average of the box is more than 50. In fact, the data averaged out to 52.7 with and SD of 25. Compute z and P. What do you conclude?

Solution:

> observed_average := 52.7:
> expected_average := 50.:
> SE := 25./sqrt(400.):
> z := (observed_average - expected_average)/SE;

                               z := 2.160000000

The P-value is then the area under the normal curve from 2.16 to the right. This is the chance that the test statistic will produce a value more extreme than the value observed: 2.16. From the Normal table we have:


  z    Height  Area     z    Height  Area     z    Height  Area 
___________________    __________________    ___________________
 0.00  39.89   0.00    1.50  12.95  86.64    3.00  0.443  99.730
 0.05  39.84   3.99    1.55  12.00  87.89    3.05  0.381  99.771
	.....
 0.50  35.21  38.29    2.00   5.40  95.45    3.50  0.087  99.953
 0.55  34.29  41.77    2.05   4.88  95.96    3.55  0.073  99.961
 0.60  33.32  45.15    2.10   4.40  96.43    3.60  0.061  99.968
 0.65  32.30  48.43    2.15   3.96  96.84    3.65  0.051  99.974
 0.70  31.23  51.61    2.20   3.55  97.22    3.70  0.042  99.978

> P := (100 - 96.84)/2;

                               P := 1.580000000

i.e. P-value is between 1% and 2%. This is a small chance so we reject the NULL in favor of the alternative that the average of the box is more than 50.

Problem2:
A coin is tossed 10,000 times, and it lands heads 5,167 times. Is the chance of heads equal to 50%? Or are there too many heads for that?

Formulate H0 and H1 in terms of a box model.
Compute z and P.
What do you conclude?

Solution:

The tosses are like draws made at random from a box of tickets with zeroes and ones.

> H0 := `Box contains an equal number of zeroes and ones`:
> H1 := `Box contains more ones than zeroes`:

another way of saying the same thing is; H0 = the proportion of ones in the box is 0.5. H1 = the proportion of ones in the box is more than 0.5.

> observed_proportion := 5167./10000:
> expected_proportion := 5000./10000.:
> SE := sqrt(0.5*0.5)/sqrt(10000.):
> z := (observed_proportion - expected_proportion)/SE;

                               z := 3.340000000

z > 3 is a large value for standard units and we don't even need to look at the table to conlude that P < half of 1%, since we know that about 99% of the area under the normal curve is between -3 and 3. If we want a more exact P-value we need to look at the table again.


  z    Height  Area     z    Height  Area     z    Height  Area 
___________________    __________________    ___________________
 
          ..............

 0.25  38.67  19.74    1.75   8.63  91.99    3.25  0.203  99.885
 0.30  38.14  23.58    1.80   7.90  92.81    3.30  0.172  99.903
 0.35  37.52  27.37    1.85   7.21  93.57    3.35  0.146  99.919
 0.40  36.83  31.08    1.90   6.56  94.26    3.40  0.123  99.933
 0.45  36.05  34.73    1.95   5.96  94.88    3.45  0.104  99.944

> P := (100 - 99.19)/2;

                               P := .4050000000

The exact P-value is then 0.405% or about 4 in 1000. This is a very small chance and therefore we conclude that there is strong evidence against the null. The coin is biased towards heads.

The t-Test

This is just like the z-test but used when the sample size is small. What happens is that for very small samples (less than 20 observations or so) the SD in the sample under estimates the SD in the box and needs to be inflated a little. More over, for small samples the SD in the sample changes considerably from sample to sample and this adds a new source of randomness to the z-statistic. Hence, when the tickets in the box follow the normal curve, the SD in the box is not known and the number of draws is less than 20 we need to use the t-test instead of the z-test. Mathematically these considerations boil down to

Use SDplus = SD*sqrt(n/(n-1)) instead of SD.
Look for P-values under the t-table with (n-1) degrees of freedom instead of using the normal table.

where n denotes the number of observations in the sample. Here is an example.

Problem3:

Three measurements (in inches) of the length of a desk produce: 50.8, 50.9, 50.7 The store selling the desk claims that these are 50-inches long desks. What do you say? Is this a desk longer than the rest?

Solution:

If we assume the Gauss model for measurements we can think of the observations as 3 draws made at random from a box of tickets that follow the normal curve. The SD in the box is unknown and the number of draws is much less than 20 (i.e. only 3) so we need to use the t-test. First we compute the observed average and SD.

> average := (50.8+50.9+50.7)/3.;

                            average := 50.80000000

> SD := sqrt( ((50.8-average)^2+(50.9-average)^2+(50.7-average)^2)/3.);

                              SD := .08164965809

what we need is to modify SD by multiplying it by the factor: sqrt(3/2),

> SDplus := SD*sqrt(3./2.);

                            SDplus := .09999999996

> t := (average - 50)/(SDplus/sqrt(3.));

                               t := 13.85640647

This is an extremely large value for a z-statistic but not so extreme for a t-statistic. We need to look at the t-table with (3-1)=2 degrees of freedom.

         Degrees            (one-tail areas)
	 of freedom	10%	5%	1%
	______________________________________
	
	   1		3.08	6.31	31.82 
	   2		1.89	2.92	6.96
	   3		1.64	2.35	4.54
	   4		1.53	2.13	3.75
	   5		1.48	2.02	3.36

From the table we see that for 2 degrees of freedom the observed value t=13.8 is way to the right of 6.96 which is the cut-off point for 1%. Hence we conclude that P-value is much smaller that 1% and we reject the null hypothesis that the desk is 50 inches long. The desk is longer that 50 inches.

Two-sample z-test
We use this test when we want to compare two samples. There are two boxes with different averages and SDs. We draw a different number of draws from each of the boxes and we want to test the null hypothesis that the two boxes have the same average against the alternative that one of the boxes has a larger average than the other or that they are just different. The test statistic is given by z = (observedDifference -expectedDifference)/SE that is the difference of the two averages (observed and expected). The formula for the SE is a generalization of the square-root-law for the single averages. We will have: (ObsAve1 +/- SE1) - (ObsAve2 +/- SE2) The SE for the above difference is given by (Pythagoras' theorem):
2 2 1/2 SEdifference := (SE1 + SE2 )
Here is an example.
Problem4:
One hundred draws are made at random from box A, and 250 from box B. Both boxes contain a very large number of tickets. The sample from box A averages out to 220 with an SD of 70 and the sample from box B averages out to 180 with and SD of 50. Are the averages of the two boxes equal? What is the P-value? What is your conclusion?

Solution:

We just need to compute the z-statitic for the two-sample test. For that we need the observed and expected differences and the SE for the difference. We have,

> AVEa := 220 : SDa := 70 :
> AVEb := 180 : SDb := 50 :
> SEa := SDa/sqrt(100);

                                   SEa := 7

> SEb := SDb/sqrt(250);

                                          1/2
                                 SEb := 10

> SEdifference := sqrt(SEa^2 + SEb^2);

                                               1/2
                             SEdifference := 59

> z := ( (AVEa - AVEb) - 0 ) / SEdifference:
> evalf(z,3);

                 5.21

This is an extremely large value in standard units so we conclude that the P-value is very close to 0 and we reject the null. AVEa is in fact significantly bigger than AVEb.

Problem5:

The National Household Survey on Drug Abuse was conducted in 1985 and in 1992.

Among persons age 18 to 25, the percentage of current users of marijuana dropped from 21.9% to 11.0%. Is this real or due to a chance variation?
Among persons age 18 to 25, the percentage of current users of cigarettes decreased from 36.0% to 31.9%. Is this real, or a chance variation? You may assume that these results are based on two independent simple random samples, each of size 700.

Solution:

For the marijuana case we have:

> obsDiff := 21.9 - 11.0;

                                obsDiff := 10.9

> SE85 := 100*sqrt(.219*(1.-.219))/sqrt(700.);

                              SE85 := 1.563142439

> SE92 := 100*sqrt(.11*(1.-.11))/sqrt(700.);

                              SE92 := 1.182612121

> SEdiff := sqrt(SE85^2+SE92^2);

                             SEdiff := 1.960098394

> z := obsDiff/SEdiff;

                               z := 5.560945325

VERY LARGE z so VERY SMALL P-value so REJECT the null. Conclusion: Real drop in marijuana use from 1985 to 1992. For the cigarette case we have:

> obsDiff := 36.0-31.9;

                                obsDiff := 4.1

> SE85 := 100*sqrt(.36*(1.-.36))/sqrt(700.);

                              SE85 := 1.814229470

> SE92 := 100*sqrt(.319*(1.-.319))/sqrt(700.);

                              SE92 := 1.761651011

> SEdiff := sqrt(SE85^2+SE92^2);

                             SEdiff := 2.528802652

> z := obsDiff/SEdiff;

                               z := 1.621320666

Now this is a very reasonable z value so there is no strong evidence agains the null in this case. The p-value is,


  z    Height  Area     z    Height  Area     z    Height  Area 
___________________    __________________    ___________________
 0.00  39.89   0.00    1.50  12.95  86.64    3.00  0.443  99.730
 0.05  39.84   3.99    1.55  12.00  87.89    3.05  0.381  99.771
 0.10  39.70   7.97    1.60  11.09  89.04    3.10  0.327  99.806
 0.15  39.45  11.92    1.65  10.23  90.11    3.15  0.279  99.837
 0.20  39.10  15.85    1.70   9.40  91.09    3.20  0.238  99.863

> P := (100-89.04)/2.;

                               P := 5.480000000

P-value larger than 5% so we can't reject the null hypothesis of no difference in consumption of cigarettes in 1985 and 1992.

Randomized Controlled Experiments

The two-sample z-test used for comparing the observed averages from two independent samples from two different boxes can also be used for comparing treatment and control averages in an experiment. Here is an example.

Problem6:

Is Wheaties a power breakfast? To find out, a study is done in a large elementary statistics class; 499 students agree to participate; after the midterm, 250 are randomized to the treatment group, and 249 to the control group. The treatment group is fed Wheaties for breakfast 7 days a week. The control group gets Sugar Pops. The final scores averaged 66 for the treatment group; the SD was 21. For the control group, the average was 59 and the SD 20. What do you conclude?

Solution:

We proceed as if the treatment and control samples where independent samples with replacement from two separate boxes. Clearly that is not the case but the math turns out to be the same. This is due to the fact that the two mistakes approximately cancel each other out. Assuming that you are drawing WHITH replacement when in fact you are doing it WHITHOUT replacement inflates the SE but assuming that the two samples are INDEPENDENT when in fact they are DEPENDENT under estimates the SE the final result is that the SE turns out to be about right or a little inflated which makes the procedure a little over-conservative but that's ok. For the above example we have,

> H0 := `no difference between Wheaties and Sugar Pops`:
> H1 := `Wheaties are better`:
> observed_difference := 66 - 59:
> expected_difference := 0.0:
> SE_aveW := 21/sqrt(250.);

                            SE_aveW := 1.328156617

> SE_aveSP := 20/sqrt(249.);

                            SE_aveSP := 1.267448501

> SE_difference := sqrt(1.33^2 + 1.27^2);

                         SE_difference := 1.838967101

> z := (observed_difference - expected_difference)/SE_difference;

                               z := 3.806484627

since z is almost 4 the P-value will be very small and will reject the Null hypothesis that there is no difference between Wheaties and Sugar Pops. Conclusion Wheaties are better and the P-value is:


  z    Height  Area     z    Height  Area     z    Height  Area 
___________________    __________________    ___________________
 0.00  39.89   0.00    1.50  12.95  86.64    3.00  0.443  99.730
  ....


 0.75  30.11  54.67    2.25   3.17  97.56    3.75  0.035  99.982
 0.80  28.97  57.63    2.30   2.83  97.86    3.80  0.029  99.986
 0.85  27.80  60.47    2.35   2.52  98.12    3.85  0.024  99.988
 0.90  26.61  63.19    2.40   2.24  98.36    3.90  0.020  99.990
 0.95  25.41  65.79    2.45   1.98  98.57    3.95  0.016  99.992

> P := (100 - 99.986)/2.;

                              P := .007000000000

i.e. 0.007 percent! or 7 in 100,000 very very small indeed. If the numbers where not hypothetical you should all be stuffing yourselves of wheaties before the next exam!

The Chi-square Test

The basic question that these tests answer is: How well does the data fit the model? The null says that the data is like drawing at random with replacement from a box with different kinds of tickets of type a,b,c... etc with pre-specified frequencies fa,fb,fc,... The alternative hypothesis says that null is not correct perhaps tickets in the box have frequencies different from fa,fb, ... etc. To measure how close are the observed frequencies from the expected under the null, the X2 (read Chi-square) test statistic is used:

> X2 := sum((obs_freq[i] - expt_freq[i])^2/expt_freq[i],i=1..K);

                           K
                         -----                             2
                          \    (obs_freq[i] - expt_freq[i])
                   X2 :=   )   -----------------------------
                          /            expt_freq[i]
                         -----
                         i = 1

where K denotes the number of types of tickets in the box. The P-value is obtained by looking at the area under the Chi-square curve with (K-1) degrees of freedom. Here is an example:

Problem7:

As part of a study on the selection of grand juries in Alameda county, the educational level of grand jurors was compared with the county distribution:

Educational			Number of
   level	County		 Jurors
_____________________________________________
    
Elemntary	28.4%		1
Secondary	48.5%		10
Some College	11.9%		16
College degree	11.2%		35
               _______	      ______	
Total		100.0%		62

Could a simple random sample of 62 people from the county show a distribution of educational level so different from the county-wide one? Carry out the X2-test and compute the P-value.

> d1 := (1 - 62*.284)^2/(62*.284):
> d2 := (10 - 62*.485)^2/(62*.485):
> d3 := (16 - 62*.119)^2/(62*.119):
> d4 := (35 - 62*.112)^2/(62*.112):
> X2 := d1 + d2 + d3 + d4;

                               X2 := 152.4914064

This is way way way out in the tail of a X2 curve with 4-1 = 3 degrees of freedom. So reject with allllll confidence.

Link to the commands in this file

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: Tue Apr 28 16:01:00 EDT 1998