Functions to calculate cluster centroids for K-centroids clustering that extend the options available in package flexclust.

centMode calculates centroids based on the mode of each variable. centMin determines centroids within a specified range which minimize the supplied distance metric. centOptimNA replicates the behaviour of flexclust::centOptim() but removes missing values.

These functions are designed for use with flexclust::kcca() or functions that are built upon it. Their use is easiest via the wrapper kccaExtendedFamily().

centMode(x)

centMin(x, dist, xrange = NULL)

centOptimNA(x, dist)

Arguments

x

A numeric matrix or data frame.

dist

The distance measure function used in centMin and centOptimNA.

xrange

The range of the data in x. Currently only used for centMin. Options are:

  • NULL (default): defaults to "all".

  • "all": uses the same minimum and maximum value for each column of x by determining the whole range of values in the data object x.

  • "columnwise": uses different minimum and maximum values for each column of x by determining the columnwise ranges of values in the data object x.

  • A vector of c(min, max): specifies the same minimum and maximum value for each column of x.

  • A list of vectors list(c(min1, max1), c(min2, max2),...) with length ncol(x): specifies different minimum and maximum values for each column of x.

Value

A named numeric vector containing the centroid values for each column of x.

Details

  • centMode: Column-wise modes are used as centroids, and ties are broken randomly. In combination with Simple Matching Distance (distSimMatch), this results in the kmodes algorithm.

  • centMin: Column-wise centroids are calculated by minimizing the specified distance measure between the values in x and all possible levels of x.

  • centOptimNA: Column-wise centroids are calculated by minimizing the specified distance measure via a general purpose optimizer. Unlike in flexclust::centOptim(), NAs are removed from the starting search values and disregarded in the distance calculation.

Examples

# Example: Mode as centroid
dat <- data.frame(A = rep(2:5, 2),
                  B = rep(1:4, 2),
                  C = rep(c(1, 2, 4, 5), 2))
centMode(dat)
#> A B C 
#> 5 4 1 
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kModes')) #default centroid
#> kcca object of family ‘kModes’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kModes"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 4 2 2 
#> 

# Example: Centroid is level for which distance is minimal
centMin(dat, flexclust::distManhattan, xrange = 'all')
#> A B C 
#> 4 2 3 
## within kcca
flexclust::kcca(dat, 3,
                family=flexclust::kccaFamily(dist=flexclust::distManhattan,
                                             cent=\(y) centMin(y, flexclust::distManhattan,
                                                               xrange='all')))
#> kcca object of family ‘flexclust::distManhattan’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = flexclust::kccaFamily(dist = flexclust::distManhattan, 
#>     cent = function(y) centMin(y, flexclust::distManhattan, xrange = "all")))
#> 
#> cluster sizes:
#> 
#> 2 3 
#> 2 6 
#> 
                             
# Example: Centroid calculated by general purpose optimizer with NA removal
nas <- sample(c(TRUE, FALSE), prod(dim(dat)),
              replace=TRUE, prob=c(0.1,0.9)) |> 
       matrix(nrow=nrow(dat))
dat[nas] <- NA
centOptimNA(dat, flexclust::distManhattan)
#>        A        B        C 
#> 4.000000 3.000000 4.000002 
## within kcca
flexclust::kcca(dat, 3, family=kccaExtendedFamily('kGower')) #default centroid
#> kcca object of family ‘kGower’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 2 2 4 
#>