R/S-PLUS example: logistic regression

The Raftery and Hout Irish education data can be obtained here as an S-PLUS/R data dump.
 # These examples were done in R 1.1.0 on an i686 running
linux. I've # checked most of the syntax. Everything should work under
most # versions of S-PLUS although the output is slightly different.
# # I'll start by using the Irish education data from Raftery and Hout
1985.  # I already have a dataframe containing these data so I'll just
attach # it

> attach(irished)

# Now I'll fit a simple logistic regression model in which 
# the leaving certificate indicator is the response variable and
# the student's DVRT test result is the sole explanatory variable

> result1 <- glm(lvcert~DVRT, family=binomial(logit))
> summary(result1)

Call:
glm(formula = lvcert ~ DVRT, family = binomial(logit))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9077  -0.9810  -0.4543   1.0307   2.1552  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.72498    0.77798  -8.644   <2e-16 ***
DVRT         0.06437    0.00759   8.481   <2e-16 ***
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 686.86  on 499  degrees of freedom
Residual deviance: 593.77  on 498  degrees of freedom
AIC: 597.77

Number of Fisher Scoring iterations: 2

# Here we see a very significant bivariate association between 
# DVRT and the leaving certificate indicator.
# Is this magnitude of this association substantively large?
# Let's plot the fitted values against DVRT.
 
> plot(DVRT, fitted(result1))

# Here we see that the probability of a student at the low end of
# the distribution of DVRT scores has about a 0.10 probability of
# getting a leaving certificate while a student at the hight end of
# the distribution of DVRT scores has about a 0.90 probability of 
# getting a leaving certificate. This is sizable effect.

# Let's do a probit analysis of these same data

> result2 <- glm(lvcert~DVRT, family=binomial(probit))
> summary(result2)

Call:
glm(formula = lvcert ~ DVRT, family = binomial(probit))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9126  -0.9887  -0.4363   1.0346   2.1904  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.051730   0.440535  -9.197   <2e-16 ***
DVRT         0.038799   0.004311   8.999   <2e-16 ***
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 686.86  on 499  degrees of freedom
Residual deviance: 593.88  on 498  degrees of freedom
AIC: 597.88

Number of Fisher Scoring iterations: 2

# Once again, as we would expect, we see a very significant bivariate
# association between DVRT and the leaving certificate indicator
# Let's plot the probit fitted values against DVRT

> plot(DVRT, fitted(result2))

# As we would expect, this looks pretty similar to the plot of the 
# fitted values from the logit analysis

# Let's plot the the logit fitted values, the probit fitted values,
# and the actual values of the leaving certificate indicator against
# DVRT.

# first the raw data 

> plot(DVRT, jitter(lvcert, .4), ylim=c(-0.2,1.2), ylab="")

# Note that I added some jitter to the leaving certificate indicator
# so that the density of responses could be seen in the scatterplot.
# Also, note that I set the y axis to run from -0.2 to 1.2. It is 
# important to manually set the limits of the plot axes when
# overlaying multiple graphs. 

# now the logit fitted values

> par(new=TRUE)
> plot(DVRT, fitted(result1), ylim=c(-0.2,1.2), ylab="", col=2, pch=3)

# the par(new=TRUE) line tells R/S-PLUS not to overlay the new plot
# on the existing plot. In R col=2 specifies the plotting color to be
# red, and pch=3 specifies the plotting symbol to be a plus sign.

# now let's add the probit fitted values

> par(new=TRUE)
> plot(DVRT, fitted(result2), ylim=c(-0.2,1.2), ylab="", col=4, pch=0)

# In R the probit fitted values are displayed as blue squares. As we
# expect, the probit fitted values are almost identical to the logit
# fitted values. 

# Let's fit a slightly more realistic model in which sex and the
# prestige score of the father's education enter as covariates
# Missing values in the prestige of father's occupation are coded
# as 0s. To turn these into NAs we type:

> irished$fathocc[irished$fathocc==0] <- NA

# then to fit the new logistic regression model we type:


> result3 <- glm(lvcert~DVRT+sex+fathocc, na.omit(irished), family=binomial(logit))
> summary(result3)

Call:
glm(formula = lvcert ~ DVRT + sex + fathocc, family = binomial(logit), 
    data = na.omit(irished))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1557  -0.9151  -0.4497   0.9175   2.2907  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.648069   0.983888  -8.790  < 2e-16 ***
DVRT         0.060585   0.008148   7.435 1.04e-13 ***
sex          0.516546   0.214930   2.403   0.0162 *  
fathocc      0.039139   0.007489   5.227 1.73e-07 ***
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 651.39  on 473  degrees of freedom
Residual deviance: 530.63  on 470  degrees of freedom
AIC: 538.63

Number of Fisher Scoring iterations: 3

# Here we see that DVRT, sex, and fathocc exert significant effects on
# the probability of obtaining a leaving certificate. 

# If we are interested in testing whether the inclusion of these
# additional covariates improves the fit over the original logistic
# regression model with DVRT as the sole covariate we could perform a
# likelihood ratio test. However, to do this correctly we need to
# refit the original model to the dataset in which the observations 
# with missing values of fathocc have been dropped.


> result4 <- glm(lvcert~DVRT, na.omit(irished), family=binomial(logit))
> summary(result4)

Call:
glm(formula = lvcert ~ DVRT, family = binomial(logit), data = na.omit(irished))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8877  -0.9848  -0.4547   1.0447   2.1543  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.742867   0.805496  -8.371   <2e-16 ***
DVRT         0.064653   0.007869   8.216   <2e-16 ***
---
Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 651.39  on 473  degrees of freedom
Residual deviance: 564.09  on 472  degrees of freedom
AIC: 568.09

Number of Fisher Scoring iterations: 2

# the LR test statistic is 564.09 - 530.63
# the p-value is


> 1 - pchisq(564.09 - 530.63, 2)
[1] 5.423171e-08

# this is significant at any conventional level.