Distance Functions for K-Centroids Clustering of (Ordinal) Categorical/Mixed Data

Functions to calculate the distance between a matrix x and a matrix c, which can be used for K-centroids clustering via flexclust::kcca().

distSimMatch implements Simple Matching Distance (most frequently used for categorical, or symmetric binary data) for K-centroids clustering.

distGower implements Gower's Distance after Gower (1971) and Kaufman & Rousseeuw (1990) for mixed-type data with missings for K-centroids clustering.

distGDM2 implements GDM2 distance for ordinal data introduced by Walesiak et al. (1993) and adapted to K-centroids clustering by Ernst et al. (2025).

These functions are designed for use with flexclust::kcca() or functions that are built upon it. Their use is easiest via the wrapper kccaExtendedFamily(). However, they can also easily be used to obtain a distance matrix of x, see Examples.

distGDM2(x, centers, genDist, xrange = NULL)

distGower(x, centers, genDist)

distSimMatch(x, centers)

Arguments

x

A numeric matrix or data frame.

centers

A numeric matrix with ncol(centers) equal to ncol(x) and nrow(centers) smaller or equal to row(x).

genDist

Additional information on x required for distance calculation. Filled automatically if used within flexclust::kcca().

For distGower: A character vector of variable specific distances to be used with length equal to ncol(x). The following options are possible:
- distEuclidean: Euclidean distance between the scaled variables.
- distManhattan: absolute distance between the scaled variables.
- distJaccard: counts of zero if both binary variables are equal to 1, and 1 otherwise.
- distSimMatch: Simple Matching Distance, i.e. the number of agreements between variables.
For distGDM2: Function creating a distance function that will be primed on x.
For distSimMatch: not used.

xrange

Range specification for the variables. Currently only used for distGDM2 (as distGower expects x to be already scaled). Possible values are:

NULL (default): defaults to "all".
"all": uses the same minimum and maximum value for each column of x by determining the whole range of values in the data object x.
"columnwise": uses different minimum and maximum values for each column of x by determining the columnwise ranges of values in the data object x.
A vector of c(min, max): specifies the same minimum and maximum value for each column of x.
A list of vectors list(c(min1, max1), c(min2, max2),...) with length ncol(x): specifies different minimum and maximum values for each column of x.

Value

A matrix of dimensions c(nrow(x), nrow(centers)) that contains the distance between each row of x and each row of centers.

Details

distSimMatch: Simple Matching Distance between two observations is calculated as the proportion of disagreements acros all variables. Described, e.g., in Kaufman & Rousseeuw (1990), p. 24. If this is used in K-centroids analysis in combination with mode centroids (as implemented in centMode), this results in the kModes algorithm. A wrapper for this algorithm is obtained with kccaExtendedFamily(which='kModes').
distGower: Distances are calculated for each column (Euclidean distance, distEuclidean, is recommended for numeric, Manhattan distance, distManhattan for ordinal, Simple Matching Distance, distSimMatch for categorical, and Jaccard distance, distJaccard for asymmetric binary variables), and they are summed up as:

$$d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}$$

where $p$ is the number of variables and with the weight $\delta_{ikj}$ being 1 if both values $x_{ij}$ and $x_{kj}$ are not missing, and in the case of asymmetric binary variables, at least one of them is not 0. Please note that for calculating Gower's distance, scaling of numeric/ordered variables is required (as f.i. by .ScaleVarSpecific). A wrapper for K-centroids analysis using Gower's distance in combination with a numerically optimized centroid is found in kccaExtendedFamily(which='kGower').
distGDM2: GDM2 distance for ordinal variables conducts only relational operations on the variables, such as $\leq$, $\geq$ and $=$. By translating $x$ to its relative frequencies and empirical cumulative distributions, we are able to extend this principle to compare two arbitrary values, and thus use it within K-centroids clustering. For more details, see Ernst et al. (2025). A wrapper for this algorithm in combination with a numerically optimized centroid is found in kccaExtendedFamily(which='kGDM2').

The distances functions presented here can also be used in clustering algorithms that rely on distance matrices (such as hierarchical clustering and PAM), if applied accordingly, see Examples.

References

Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Kaufman, L, Rousseeuw, P (1990.) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, New York: John Wiley & Sons. doi:10.1002/9780470316801
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In Baier D, Decker, R, Schmidt-Thieme, L (eds.). Data Analysis and Decision Support, 335-343. Berlin: Springer-Verlag. doi:10.1007/3-540-28397-8_36

Examples

# Example 1: Simple Matching Distance
set.seed(123)
dat <- data.frame(question1 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
                  question2 = factor(sample(LETTERS[1:6], 10, replace=TRUE)),
                  question3 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
                  question4 = factor(sample(LETTERS[1:5], 10, replace=TRUE)),
                  state = factor(sample(state.name[1:10], 10, replace=TRUE)),
                  gender = factor(sample(c('M', 'F', 'N'), 10, replace=TRUE,
                                         prob=c(0.45, 0.45, 0.1))))
datmat <- data.matrix(dat)
initcenters <- datmat[sample(1:10, 3),]
distSimMatch(datmat, initcenters)
#>            [,1]      [,2]      [,3]
#>  [1,] 0.8333333 0.6666667 0.8333333
#>  [2,] 0.8333333 0.5000000 0.8333333
#>  [3,] 0.3333333 0.8333333 0.8333333
#>  [4,] 1.0000000 0.0000000 0.8333333
#>  [5,] 0.5000000 1.0000000 0.8333333
#>  [6,] 0.6666667 0.8333333 0.0000000
#>  [7,] 0.6666667 0.8333333 0.6666667
#>  [8,] 0.8333333 0.6666667 0.6666667
#>  [9,] 0.0000000 1.0000000 0.6666667
#> [10,] 0.6666667 0.6666667 0.8333333
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
#> kcca object of family ‘kModes’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kModes"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 2 3 5 
#> 
## as a distance matrix
as.dist(distSimMatch(datmat, datmat))
#>            1         2         3         4         5         6         7
#> 2  0.6666667                                                            
#> 3  0.8333333 0.5000000                                                  
#> 4  0.6666667 0.5000000 0.8333333                                        
#> 5  0.8333333 0.8333333 0.5000000 1.0000000                              
#> 6  0.8333333 0.8333333 0.8333333 0.8333333 0.8333333                    
#> 7  1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.6666667          
#> 8  0.8333333 0.8333333 1.0000000 0.6666667 0.6666667 0.6666667 0.8333333
#> 9  0.8333333 0.8333333 0.3333333 1.0000000 0.5000000 0.6666667 0.6666667
#> 10 1.0000000 0.8333333 0.5000000 0.6666667 0.8333333 0.8333333 0.8333333
#>            8         9
#> 2                     
#> 3                     
#> 4                     
#> 5                     
#> 6                     
#> 7                     
#> 8                     
#> 9  0.8333333          
#> 10 1.0000000 0.6666667

# Example 2: GDM2 distance
distGDM2(datmat, initcenters, genDist=flexord:::.projectIntofx)
#>            [,1]      [,2]      [,3]
#>  [1,] 0.4153611 0.3964902 0.3159825
#>  [2,] 0.4113284 0.2831168 0.4036075
#>  [3,] 0.1533025 0.4746327 0.4746327
#>  [4,] 0.3998748 0.0000000 0.3452381
#>  [5,] 0.2466527 0.4156565 0.2108224
#>  [6,] 0.3122652 0.3452381 0.0000000
#>  [7,] 0.2944484 0.3389847 0.3044814
#>  [8,] 0.3777210 0.3604274 0.2324858
#>  [9,] 0.0000000 0.3998748 0.3122652
#> [10,] 0.3763073 0.2058620 0.4294069
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2'))
#> kcca object of family ‘kGDM2’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGDM2"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 3 3 4 
#> 
## as a distance matrix
as.dist(distGDM2(datmat, datmat, genDist=flexord:::.projectIntofx))
#>            1         2         3         4         5         6         7
#> 2  0.3021114                                                            
#> 3  0.5000000 0.2689267                                                  
#> 4  0.3964902 0.2831168 0.4746327                                        
#> 5  0.4301570 0.4268293 0.4358130 0.4156565                              
#> 6  0.3159825 0.4036075 0.4746327 0.3452381 0.2108224                    
#> 7  0.2777778 0.3370329 0.3161963 0.3389847 0.4650785 0.3044814          
#> 8  0.4101067 0.4058237 0.5371761 0.3604274 0.2056990 0.2324858 0.4550533
#> 9  0.4153611 0.4113284 0.1533025 0.3998748 0.2466527 0.3122652 0.2944484
#> 10 0.5681994 0.5357244 0.4498588 0.2058620 0.5357244 0.4294069 0.2840351
#>            8         9
#> 2                     
#> 3                     
#> 4                     
#> 5                     
#> 6                     
#> 7                     
#> 8                     
#> 9  0.3777210          
#> 10 0.5804651 0.3763073

# Example 3: Gower's distance
# Ex. 3.1: single variable type case with no missings:
xcls <- flexord:::.ChooseVarDists(datmat)
##all Euclidean (on dat, it would default to all Simple Matching)
datscld <- flexord:::.ScaleVarSpecific(datmat, xclass=xcls,
                                       xrange=list(c(1,4), c(1,6), c(1,4),
                                                   c(1,5), c(1,10), c(1,3)))
initcentscld <- datscld[sample(1:10, 3),]
distGower(datscld, initcentscld, genDist=xcls)
#>            [,1]       [,2]      [,3]
#>  [1,] 0.1924701 0.13551773 0.0000000
#>  [2,] 0.0000000 0.21943468 0.1924701
#>  [3,] 0.1547294 0.23839491 0.2338472
#>  [4,] 0.1805556 0.20309041 0.2096440
#>  [5,] 0.2004389 0.09676258 0.1731853
#>  [6,] 0.2194347 0.00000000 0.1355177
#>  [7,] 0.1501485 0.16377114 0.1598959
#>  [8,] 0.1739386 0.11462424 0.1630156
#>  [9,] 0.1924256 0.17720055 0.1895233
#> [10,] 0.2510694 0.22870370 0.2557031
## within kcca
flexclust::kcca(datmat, 3, kccaExtendedFamily('kGower'))
#> kcca object of family ‘kGower’ 
#> 
#> call:
#> flexclust::kcca(x = datmat, k = 3, family = kccaExtendedFamily("kGower"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 3 2 5 
#> 
##turns into kmeans with scaling

# Ex. 3.2: single variable type case with missing values:
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |> 
   matrix(nrow=nrow(dat))
dat[nas] <- NA
#repeat the steps from above...or just do:
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower', cent=centMode))
#> kcca object of family ‘kGower’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower", 
#>     cent = centMode))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 3 3 4 
#> 
##turns into kModes with upweighting of present values

#Ex. 3.3: mixed variable types (with or without missings): 
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
                  bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
                  bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),                     
                  ord_levmis = factor(sample(1:5, 10, replace=TRUE),
                                      levels=1:6, ordered=TRUE),
                  ord_levfull = factor(sample(1:4, 10, replace=TRUE),
                                       levels=1:4, ordered=TRUE),
                  nom = factor(sample(letters[1:4], 10, replace=TRUE),
                               levels=letters[1:4]))
dat[nas] <- NA
xcls <- flexord:::.ChooseVarDists(dat)
datmat <- flexord:::.ScaleVarSpecific(data.matrix(dat), xclass=xcls,
                                      xrange='columnwise')
initcenters <- datmat[sample(1:10, 3),]
distGower(datmat, initcenters, genDist=xcls)                  
#>            [,1]      [,2]      [,3]
#>  [1,] 0.6137885 0.0000000 0.3360107
#>  [2,] 0.3004016 0.4745649 0.3670683
#>  [3,] 0.0000000 0.6137885 0.4666667
#>  [4,] 0.8078983 0.4170013 0.5856760
#>  [5,] 0.5528782 0.4390897 0.5528782
#>  [6,] 0.5349398 0.4431058 0.5349398
#>  [7,] 0.4444444 0.8888889 0.8888889
#>  [8,] 0.4666667 0.3360107 0.0000000
#>  [9,] 0.5686747 0.3045515 0.2353414
#> [10,] 0.5833333 0.8634538 0.9166667
## within kcca
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
#> kcca object of family ‘kGower’ 
#> 
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower"))
#> 
#> cluster sizes:
#> 
#> 1 2 3 
#> 4 4 2 
#> 
## as a distance matrix
distGower(datmat, datmat, genDist=xcls) |> as.dist()
#>            1         2         3         4         5         6         7
#> 2  0.4745649                                                            
#> 3  0.6137885 0.3004016                                                  
#> 4  0.4170013 0.8353414 0.8078983                                        
#> 5  0.4390897 0.5803213 0.5528782 0.5783133                              
#> 6  0.4431058 0.5678715 0.5349398 0.6934404 0.4484605                    
#> 7  0.8888889 0.3333333 0.4444444 1.0000000 0.6666667 0.7777778          
#> 8  0.3360107 0.3670683 0.4666667 0.5856760 0.5528782 0.5349398 0.8888889
#> 9  0.3045515 0.4016064 0.5686747 0.7215529 0.5321285 0.3670683 0.7777778
#> 10 0.8634538 0.4578313 0.5833333 0.9548193 0.6698795 0.6646586 0.0000000
#>            8         9
#> 2                     
#> 3                     
#> 4                     
#> 5                     
#> 6                     
#> 7                     
#> 8                     
#> 9  0.2353414          
#> 10 0.9166667 0.6224900
## as a distance matrix