Chi-Square Test for Independence

The Chi-square test can also be used to test for independence between rows and columns of a contingency table. Here is an example.

Problem:

In a certain town, there are about one million eligible voters. A simple random sample of 10000 eligible voters was chosen to study the relationship between sex and participation in the last election. The results are summarized in the following 2X2 (read two by two) contingency table:

		Men	Women
_____________________________
Voted		2792	3591
Didn't vote	1486	2131

We want to check whether being a man or a woman (columns) is independent of having voted in the last election (rows). In other words is "sex and voting independent"?

Solution:

In order to answer the question we need to build a test of hypothesis as usual. We have

> Null := `Sex is independent of Voting`:
> Alternative := `Sex and Voting are dependent`:

After specifying the Null hypothesis we need to compute the expected table under the assumption that rows and columns are in fact independent. To compute the expected table we use the product rule for chances:
chance of (row_i,col_j) = (chance row_i) * (chance col_j)
From here we deduce that the expected number of counts in (row_i,col_j) is given by:
N*(chance row_i)*(chance col_j) = (Sum row_i)*(Sum col_j) / N
The observed table with totals included is:
OBSERVED TABLE Men Women Total _____________________________ |______ Voted 2792 3591 | 6383 Didn't vote 1486 2131 | 3617 _____________________________________ Total 4278 5722 | 10000
The associated expected table under the assumption that sex and voting are independent is given by
EXPECTED TABLE Men Women Total _____________________________ |______ Voted 2731 3652 | 6383 Didn't vote 1547 2070 | 3617 _____________________________________ Total 4278 5722 | 10000
We now have the observed table and the expected table under the null hypothesis of independence. After that we need to compute the X2 statistic. The X2 statistic measures how far away is the observed table from the expected one. The X2 statistic has as many terms as there are cells in the observed table (4 in our case):

> c11 := (2792-2731)^2/2731.:
> c12 := (3591-3652)^2/3652.:
> c21 := (1486-1547)^2/1547.:
> c22 := (2131-2070)^2/2070.:

The X2-statistic is the sum of each of the contributions from each cell:

> X2 := c11+c12+c21+c22;

                               X2 := 6.584283457

The last part is to compute the P-value. This is done by looking under the Chi-square table with (rows-1)*(cols-1) degrees of freedom. In the case of a 2x2 table (our case) the number of degrees of freedom is (2-1)(2-1)=1*1=1. The table gives the tail areas at:

Degrees of
 freedom	99%  ...	10%	5%	1%
_____________________________________________________
1		0.00016		2.71	3.84	6.64
2 		0.020		4.60	5.99	9.21

Since the observed X2 = 6.58 and thus, 3.84 < X2 < 6.64 we conclude that: 1% < P-value < 5% and we reject the NULL. The data supports the hypothesis that sex and voting are dependent in this town.

Problem2:

Each respondent in the Current Population Survey of March 1993 was classified as employed, unemployed, or outside the labor force. The results for men in California age 35-44 can be cross-tabulated by marital status, as follows:

			 	  Widowed, divorced,	Never
			Married	    or separated	married
			________________________________________
Employed		679		103		114
Unemployed		63		10		20
Not in labor force	42		18		25

Men of different marital status seem to have different distributions of labor force status. Or is this just chance variation? (you may assume the table results from a simple random sample.)

Solution:

We have:

> Obs_table := matrix(3,3,[679,103,114,63,10,20,42,18,25]);

                                    [679    103    114]
                                    [                 ]
                       Obs_table := [ 63     10     20]
                                    [                 ]
                                    [ 42     18     25]

> R1 := 679+103+114:R2:=63+10+20:R3:=42+18+25:
> C1:=679+63+42:C2:=103+10+18:C3:=114+20+25:N:=evalf(R1+R2+R3):
> Exp_table := matrix(3,3,(i,j)-> round(R.i*C.j/N));

                                    [654    109    133]
                                    [                 ]
                       Exp_table := [ 68     11     14]
                                    [                 ]
                                    [ 62     10     13]


                        2                    2
             (679 - 654)            (25 - 13)
     X2 :=  ------------ + ...   + ----------
                 654                   13

> X2 := 30.96:

Looking at the table of the Chi-sqare distribution with (3-1)(3-1)=2*2=4 degrees of freedom we get:

Degrees of
 freedom	99%  ...	10%	5%	1%
_____________________________________________________
1		0.00016		2.71	3.84	6.64
2 		0.020		4.60	5.99	9.21
3		0.12		6.25	7.82	11.34
4		0.30		7.78	9.49	13.28
5		0.55		9.24	11.07	15.09

since 30.96 > > 13.28 we conclude from the table that: P < < 1% so we reject with all confidence. Conclussion: Marital Status seems to be related to Job Status in this town.

Link to the commands in this file

Carlos Rodriguez <carlos@math.albany.edu>

Last modified: Tue Apr 28 13:54:56 EDT 1998