ward {mining} | R Documentation |
Produces a hierarchical clustering of one-dimensional data via Ward's method.
ward(x, n=rep(1,length(x)), s=rep(1,length(x)), sortx=T, same.var=T)
x |
a numerical vector, or a list of vectors. |
n |
if x is a vector of cluster means, n is the size of each cluster. |
s |
if x is a vector of cluster means, s is the sum of squares in each
cluster. only needed if same.var=F . |
sortx |
if sortx=F , only clusters which are adjacent in x
can be merged. Used by break.ts . |
Repeatedly merges clusters in order to minimize the clustering cost.
By default, it is the same as hclust(method="ward")
.
If same.var=T
, the cost is the sum of squares:
sum_c sum_{i in c} (x_i - m_c)^2
The incremental cost of merging clusters ca and cb is
(n_a*n_b)/(n_a+n_b)*(m_a - m_b)^2
It prefers to merge clusters which are small and have similar means.
If same.var=F
, the cost is the sum of log-variances:
sum_c n_c*log(1/n_c*sum_{i in c} (x_i - m_c)^2)
It prefers to merge clusters which are small, have similar means, and have similar variances.
If x
is a list of vectors, each vector is assumed to be a
cluster. n
and s
are computed for each cluster and
x
is replaced by the cluster means.
Thus you can say ward(split(x,f))
to cluster the data for different
factors.
The same type of object returned by hclust
.
Because of the adjacency constraint used in implementation,
the clustering that results
from sortx=T
and same.var=F
may occasionally be suboptimal.
Tom Minka
hclust
,
plot.hclust.trace
,
hist.hclust
,
boxplot.hclust
,
break.ward
,
break.ts
,
merge.factor
x <- c(rnorm(700,-2,1.5),rnorm(300,3,0.5)) hc <- ward(x) opar <- par(mfrow=c(2,1)) plot.hclust.trace(hc) hist.hclust(hc,x) par(opar) x <- c(rnorm(700,-2,0.5),rnorm(1000,2.5,1.5),rnorm(500,7,0.1)) hc <- ward(x) opar <- par(mfrow=c(2,1)) plot.hclust.trace(hc) hist.hclust(hc,x) par(opar) data(OrchardSprays) x <- OrchardSprays$decrease f <- factor(OrchardSprays$treatment) # shuffle levels #lev <- levels(OrchardSprays$treatment) #f <- factor(OrchardSprays$treatment,levels=sample(lev)) hc <- ward(split(x,f)) # is equivalent to: #n <- tapply(x,f,length) #m <- tapply(x,f,mean) #s <- tapply(x,f,var)*n #hc <- ward(m,n,s) boxplot.hclust(hc,split(x,f))