merge.hist {mining} | R Documentation |
Quantize a variable by merging similar histogram bins.
merge.hist(x,b=NULL,n=b)
x |
a numerical vector |
b |
the starting number of bins, or a vector of starting break locations.
If NULL, chosen automatically by hist . |
n |
the desired number of bins. |
The desired number of bins is achieved by successively merging the two most similar histogram bins. The distance between bins of height (f1,f2) and width (w1,w2) is measured according to the chi-square statistic
w1*(f1-f)^2/f + w2*(f2-f)^2/f
where f is the height of the merged bin:
f = (f1*w1 + f2*w2)/(w1 + w2)
A vector of bin breaks, suitable for use in hist
,
bhist
, or cut
.
Two plots are shown: a bhist
using the returned bin breaks,
and a merging trace. The trace shows, for each merge, the chi-square
distance of the bins which were merged. This is useful for determining
the appropriate number of bins. An interesting number of bins is one
that directly precedes a sudden jump in the chi-square distance.
Tom Minka
x <- c(rnorm(100,-2,0.5),rnorm(100,2,0.5)) b <- seq(-4,4,by=0.25) merge.hist(x,b,10) # according to the merging trace, n=5 and n=11 are most interesting. x <- runif(1000) b <- seq(0,1,by=0.05) merge.hist(x,b,10) # according to the merging trace, n=6 and n=9 are most interesting. # because the data is uniform, there should only be one bin, # but chance deviations in density prevent this. # a multiple comparisons correction in merge.hist may fix this.