Percents. Confidence Intervals and the Bootstrap

Problem1:

The Residential Energy Consumption Survey found in 1990 that 14.8% of American households had a computer. A market survey organization repeated this study in a certain town with 25000 households, using a simple random sample of 500 househods: 79 of the sample households had computers.

Estimate the percentage of households in the town with computers and its SE.
Find a 95%-confidence interval for the percentage of all 25,000 households with computers.

Solution:

The percentage of households in the town with computers is estimated by the observed sample percentage, psample given by

> psample := (79./500)*100;

                            psample := 15.80000000

The standard error for this estimate is given by the square-root law as,

> SE := (SDbox/sqrt(500))*100;

                                             1/2
                              SE := 2 SDbox 5

where SDbox is the standard deviation in the box of 25,000 tickets with zeroes and ones. The exact percentage of ones in the box is not known but it is estimated by the bootstrap method with the observed psample above. Using the formula for the standard deviation of a list of 0-1 tickets we estimate the SDbox by the SDsample given by,

> SDsample := sqrt(.158*(1-.158));

                            SDsample := .3647410040

Hence the SE is estimated by SEest,

> SEest := (SDsample/sqrt(500.))*100;

                             SEest := 1.631171358

Thus, the percentage of households in this town with computers is estimated by 15.8% give or take 1.6% or so.

Let us now find the 95%-confidence interval. This is just an interval centered about the observed sample percentage with the property that repeated samples will have about 95% chance of producing an interval covering the true percentage in the town. All we need is to go above and below 15.8% by 2*SE. The actual interval is,

> [psample-2*SEest, psample+2*SEest];

                          [12.53765728, 19.06234272]

How about a 92.7% Confidence Interval?

All we need is to enter the Normal Table with AREA=92.7% and read out the z,

  z    Height  Area     z    Height  Area     z    Height  Area 
___________________    __________________    ___________________
 0.00  39.89   0.00    1.50  12.95  86.64    3.00  0.443  99.730
 0.05  39.84   3.99    1.55  12.00  87.89    3.05  0.381  99.771
                   ......           .....
 0.25  38.67  19.74    1.75   8.63  91.99    3.25  0.203  99.885
 0.30  38.14  23.58    1.80   7.90  92.81    3.30  0.172  99.903
 0.35  37.52  27.37    1.85   7.21  93.57    3.35  0.146  99.919

The closest is z=1.80

The interval is then,

> [psample-1.8*SEest,psample+1.8*SEest];

                          [12.86389156, 18.73610844]

Do you see how the interval shrinks?

The more the confidence the wither the interval until we get that the interval [0%,100%] has total 100% confidence but ofcourse we knew that before taking the sample so that extreme case is useless.

One question that arises in the above computations of SEest is: How large must the sample size be in order for the Bootstrap method (of estimating the SD in the box with the SD in the sample) to work? There is no overall answer. It depends on the actual composition of the box. Here is an example.

Problem2:

Continuing with Problem1... Suppose now that of the 500 sample households 498 had refrigerators. Find a 99% confidence interval for the percentage of all 25,000 households with refrigerators.

Answer:

It can't be done! The bootstrap method will give a poor estimate in this case eventhough the sample size is still 500. The reason is that a very small percentage of the households in the towm don't have refrigerators so a much larger sample size is needed in order to see enough households with no refrigerators.

Problem3:

Continuing with Problem1.... Suppose now that among the sample households 121 had no car, 172 had one car, and 207 had two or more cars. Find a 92%-confidence interval for the percentage of households in the town with one or more cars.

Answer:

The observed percentage of households in the town with one ore more cars is given by,

> p := 100*(172+207.)/500.;

                               p := 75.80000000

The estimated SE for this percentage is,

> SE := sqrt(p*(100-p))/sqrt(500.);

                               SE := 1.915390299

Notice that I used a simplified version of the formula for the SE of a pecent. It is just algebra and it always gives the same answer as the formula we used in Problem1. Look

> se := 100*sqrt(.758*(1-.758))/sqrt(500.);

                               se := 1.915390299

So far we know that estimated percentage is:

75.8% give or take 1.9% or so

To get the desired 92%-confidence interval, we enter the normal table (see above) with Area=92% and read out the z=1.75. The interval is then,

> [p-1.75*SE,p+1.75*SE];

                          [72.44806698, 79.15193302]

What does it mean?

What indeed does it mean that the interval above [72%,79%] is a 92%-confidence interval?

The answer to this question is tricky. In fact

IT DOES NOT MEAN

what it should mean which is:

There is 92% chance that the true % is between 72% and 79%

What it actually means is that if you take lots and lots of samples (just like the ONLY one you have) then each sample will give you a different 92%-confidence interval. About 92% of these intervals will cover the true % in the box.

I know, I know.. stats should be able to do better...

Link to the commands in this file

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: Tue Apr 14 14:20:55 EDT 1998