R/centroidFunctions.R
centroids.Rd
Functions to calculate cluster centroids for K-centroids clustering that extend the options available in package flexclust.
centMode
calculates centroids based on the mode of each variable.
centMin
determines centroids within a specified range which
minimize the supplied distance metric. centOptimNA
replicates
the behaviour of flexclust::centOptim()
but removes missing
values.
These functions are designed for use with flexclust::kcca()
or
functions that are built upon it. Their use is easiest via the
wrapper kccaExtendedFamily()
.
centMode(x)
centMin(x, dist, xrange = NULL)
centOptimNA(x, dist)
A numeric matrix or data frame.
The distance measure function used in centMin
and centOptimNA
.
The range of the data in x
. Currently only used for
centMin
. Options are:
NULL
(default): defaults to "all"
.
"all"
: uses the same minimum and maximum value for each column
of x
by determining the whole range of values in the data
object x
.
"columnwise"
: uses different minimum and maximum values for
each column of x
by determining the columnwise ranges of
values in the data object x
.
A vector of c(min, max)
: specifies the same minimum and maximum
value for each column of x
.
A list of vectors list(c(min1, max1), c(min2, max2),...)
with
length ncol(x)
: specifies different minimum and maximum
values for each column of x
.
A named numeric vector containing the centroid values for each column of x
.
centMode
: Column-wise modes are used as centroids, and ties are
broken randomly. In combination with Simple Matching Distance (distSimMatch
),
this results in the kmodes
algorithm.
centMin
: Column-wise centroids are calculated by minimizing
the specified distance measure between the values in x
and all
possible levels of x
.
centOptimNA
: Column-wise centroids are calculated by
minimizing the specified distance measure via a general purpose
optimizer. Unlike in flexclust::centOptim()
, NAs are removed
from the starting search values and disregarded in the distance
calculation.
# Example: Mode as centroid
dat <- data.frame(A = rep(2:5, 2),
B = rep(1:4, 2),
C = rep(c(1, 2, 4, 5), 2))
centMode(dat)
#> A B C
#> 5 4 1
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kModes')) #default centroid
#> kcca object of family ‘kModes’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kModes"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 4 2 2
#>
# Example: Centroid is level for which distance is minimal
centMin(dat, flexclust::distManhattan, xrange = 'all')
#> A B C
#> 4 2 3
## within kcca
flexclust::kcca(dat, 3,
family=flexclust::kccaFamily(dist=flexclust::distManhattan,
cent=\(y) centMin(y, flexclust::distManhattan,
xrange='all')))
#> kcca object of family ‘flexclust::distManhattan’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = flexclust::kccaFamily(dist = flexclust::distManhattan,
#> cent = function(y) centMin(y, flexclust::distManhattan, xrange = "all")))
#>
#> cluster sizes:
#>
#> 2 3
#> 2 6
#>
# Example: Centroid calculated by general purpose optimizer with NA removal
nas <- sample(c(TRUE, FALSE), prod(dim(dat)),
replace=TRUE, prob=c(0.1,0.9)) |>
matrix(nrow=nrow(dat))
dat[nas] <- NA
centOptimNA(dat, flexclust::distManhattan)
#> A B C
#> 4.000000 3.000000 4.000002
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kGower')) #default centroid
#> kcca object of family ‘kGower’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 2 2 4
#>