R/distGDM2.R
, R/distGower.R
, R/distSimMatch.R
distances.Rd
Functions to calculate the distance between a matrix x
and a
matrix c
, which can be used for K-centroids clustering via
flexclust::kcca()
.
distSimMatch
implements Simple Matching Distance (most frequently
used for categorical, or symmetric binary data) for K-centroids
clustering.
distGower
implements Gower's Distance after Gower (1971) and
Kaufman & Rousseeuw (1990) for mixed-type data with missings for K-centroids
clustering.
distGDM2
implements GDM2 distance for ordinal data introduced by
Walesiak et al. (1993) and adapted to K-centroids clustering by
Ernst et al. (2025).
These functions are designed for use with flexclust::kcca()
or
functions that are built upon it. Their use is easiest via the
wrapper kccaExtendedFamily()
. However, they can also easily be
used to obtain a distance matrix of x
, see Examples.
distGDM2(x, centers, genDist, xrange = NULL)
distGower(x, centers, genDist)
distSimMatch(x, centers)
A numeric matrix or data frame.
A numeric matrix with ncol(centers)
equal to
ncol(x)
and nrow(centers)
smaller or equal to row(x)
.
Additional information on x
required for distance
calculation. Filled automatically if used within
flexclust::kcca()
.
For distGower
: A character vector of variable specific
distances to be used with length equal to ncol(x)
. The
following options are possible:
distEuclidean
: Euclidean distance between the scaled variables.
distManhattan
: absolute distance between the scaled variables.
distJaccard
: counts of zero if both binary variables are
equal to 1, and 1 otherwise.
distSimMatch
: Simple Matching Distance, i.e. the number of
agreements between variables.
For distGDM2
: Function creating a distance function that will
be primed on x
.
For distSimMatch
: not used.
Range specification for the variables. Currently only
used for distGDM2
(as distGower
expects x
to be already
scaled). Possible values are:
NULL
(default): defaults to "all"
.
"all"
: uses the same minimum and maximum value for each column
of x
by determining the whole range of values in the data
object x
.
"columnwise"
: uses different minimum and maximum values for
each column of x
by determining the columnwise ranges of
values in the data object x
.
A vector of c(min, max)
: specifies the same minimum and maximum
value for each column of x
.
A list of vectors list(c(min1, max1), c(min2, max2),...)
with
length ncol(x)
: specifies different minimum and maximum
values for each column of x
.
A matrix of dimensions c(nrow(x), nrow(centers))
that contains the distance
between each row of x
and each row of centers
.
distSimMatch
: Simple Matching Distance between two observations
is calculated as the proportion of disagreements acros all
variables. Described, e.g., in Kaufman & Rousseeuw (1990), p. 24.
If this is used in K-centroids analysis in combination with mode
centroids (as implemented in centMode
), this results in the
kModes algorithm. A wrapper for this algorithm is obtained
with kccaExtendedFamily(which='kModes')
.
distGower
: Distances are calculated for each column (Euclidean
distance, distEuclidean
, is recommended for numeric, Manhattan
distance, distManhattan
for ordinal, Simple Matching Distance,
distSimMatch
for categorical, and Jaccard distance,
distJaccard
for asymmetric binary variables), and they are
summed up as:
$$d(x_i, x_k) = \frac{\sum_{j=1}^p \delta_{ikj} d(x_{ij}, x_{kj})}{\sum_{j=1}^p \delta_{ikj}}$$
where \(p\) is the number of variables and with the weight
\(\delta_{ikj}\) being 1 if both values \(x_{ij}\) and
\(x_{kj}\) are not missing, and in the case of asymmetric
binary variables, at least one of them is not 0. Please note
that for calculating Gower's distance, scaling of numeric/ordered
variables is required (as f.i. by .ScaleVarSpecific
). A
wrapper for K-centroids analysis using Gower's distance in
combination with a numerically optimized centroid is found in
kccaExtendedFamily(which='kGower')
.
distGDM2
: GDM2 distance for ordinal variables conducts only
relational operations on the variables, such as \(\leq\),
\(\geq\) and \(=\). By translating \(x\) to its relative
frequencies and empirical cumulative distributions, we are able
to extend this principle to compare two arbitrary values, and
thus use it within K-centroids clustering. For more details, see
Ernst et al. (2025). A wrapper for this algorithm in
combination with a numerically optimized centroid is found in
kccaExtendedFamily(which='kGDM2')
.
The distances functions presented here can also be used in clustering algorithms that rely on distance matrices (such as hierarchical clustering and PAM), if applied accordingly, see Examples.
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.
Gower, JC (1971). A General Coefficient for Similarity and Some of Its Properties. Biometrics, 27(4), 857-871. doi:10.2307/2528823
Kaufman, L, Rousseeuw, P (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. doi:10.1002/9780470316801
Leisch, F (2006). A Toolbox for K-Centroids Cluster Analysis. Computational Statistics and Data Analysis, 17(3), 526-544. doi:10.1016/j.csda.2005.10.006
Kaufman, L, Rousseeuw, P (1990.) Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics, New York: John Wiley & Sons. doi:10.1002/9780470316801
Walesiak, M (1993). Statystyczna Analiza Wielowymiarowa w Badaniach Marketingowych. Wydawnictwo Akademii Ekonomicznej, 44-46.
Weihs, C, Ligges, U, Luebke, K, Raabe, N (2005). klaR Analyzing German Business Cycles. In Baier D, Decker, R, Schmidt-Thieme, L (eds.). Data Analysis and Decision Support, 335-343. Berlin: Springer-Verlag. doi:10.1007/3-540-28397-8_36
# Example 1: Simple Matching Distance
set.seed(123)
dat <- data.frame(question1 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question2 = factor(sample(LETTERS[1:6], 10, replace=TRUE)),
question3 = factor(sample(LETTERS[1:4], 10, replace=TRUE)),
question4 = factor(sample(LETTERS[1:5], 10, replace=TRUE)),
state = factor(sample(state.name[1:10], 10, replace=TRUE)),
gender = factor(sample(c('M', 'F', 'N'), 10, replace=TRUE,
prob=c(0.45, 0.45, 0.1))))
datmat <- data.matrix(dat)
initcenters <- datmat[sample(1:10, 3),]
distSimMatch(datmat, initcenters)
#> [,1] [,2] [,3]
#> [1,] 0.8333333 0.6666667 0.8333333
#> [2,] 0.8333333 0.5000000 0.8333333
#> [3,] 0.3333333 0.8333333 0.8333333
#> [4,] 1.0000000 0.0000000 0.8333333
#> [5,] 0.5000000 1.0000000 0.8333333
#> [6,] 0.6666667 0.8333333 0.0000000
#> [7,] 0.6666667 0.8333333 0.6666667
#> [8,] 0.8333333 0.6666667 0.6666667
#> [9,] 0.0000000 1.0000000 0.6666667
#> [10,] 0.6666667 0.6666667 0.8333333
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kModes'))
#> kcca object of family ‘kModes’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kModes"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 2 3 5
#>
## as a distance matrix
as.dist(distSimMatch(datmat, datmat))
#> 1 2 3 4 5 6 7
#> 2 0.6666667
#> 3 0.8333333 0.5000000
#> 4 0.6666667 0.5000000 0.8333333
#> 5 0.8333333 0.8333333 0.5000000 1.0000000
#> 6 0.8333333 0.8333333 0.8333333 0.8333333 0.8333333
#> 7 1.0000000 1.0000000 0.8333333 0.8333333 0.8333333 0.6666667
#> 8 0.8333333 0.8333333 1.0000000 0.6666667 0.6666667 0.6666667 0.8333333
#> 9 0.8333333 0.8333333 0.3333333 1.0000000 0.5000000 0.6666667 0.6666667
#> 10 1.0000000 0.8333333 0.5000000 0.6666667 0.8333333 0.8333333 0.8333333
#> 8 9
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9 0.8333333
#> 10 1.0000000 0.6666667
# Example 2: GDM2 distance
distGDM2(datmat, initcenters, genDist=flexord:::.projectIntofx)
#> [,1] [,2] [,3]
#> [1,] 0.4153611 0.3964902 0.3159825
#> [2,] 0.4113284 0.2831168 0.4036075
#> [3,] 0.1533025 0.4746327 0.4746327
#> [4,] 0.3998748 0.0000000 0.3452381
#> [5,] 0.2466527 0.4156565 0.2108224
#> [6,] 0.3122652 0.3452381 0.0000000
#> [7,] 0.2944484 0.3389847 0.3044814
#> [8,] 0.3777210 0.3604274 0.2324858
#> [9,] 0.0000000 0.3998748 0.3122652
#> [10,] 0.3763073 0.2058620 0.4294069
## within kcca
flexclust::kcca(dat, k=3, family=kccaExtendedFamily('kGDM2'))
#> kcca object of family ‘kGDM2’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGDM2"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 3 3 4
#>
## as a distance matrix
as.dist(distGDM2(datmat, datmat, genDist=flexord:::.projectIntofx))
#> 1 2 3 4 5 6 7
#> 2 0.3021114
#> 3 0.5000000 0.2689267
#> 4 0.3964902 0.2831168 0.4746327
#> 5 0.4301570 0.4268293 0.4358130 0.4156565
#> 6 0.3159825 0.4036075 0.4746327 0.3452381 0.2108224
#> 7 0.2777778 0.3370329 0.3161963 0.3389847 0.4650785 0.3044814
#> 8 0.4101067 0.4058237 0.5371761 0.3604274 0.2056990 0.2324858 0.4550533
#> 9 0.4153611 0.4113284 0.1533025 0.3998748 0.2466527 0.3122652 0.2944484
#> 10 0.5681994 0.5357244 0.4498588 0.2058620 0.5357244 0.4294069 0.2840351
#> 8 9
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9 0.3777210
#> 10 0.5804651 0.3763073
# Example 3: Gower's distance
# Ex. 3.1: single variable type case with no missings:
xcls <- flexord:::.ChooseVarDists(datmat)
##all Euclidean (on dat, it would default to all Simple Matching)
datscld <- flexord:::.ScaleVarSpecific(datmat, xclass=xcls,
xrange=list(c(1,4), c(1,6), c(1,4),
c(1,5), c(1,10), c(1,3)))
initcentscld <- datscld[sample(1:10, 3),]
distGower(datscld, initcentscld, genDist=xcls)
#> [,1] [,2] [,3]
#> [1,] 0.1924701 0.13551773 0.0000000
#> [2,] 0.0000000 0.21943468 0.1924701
#> [3,] 0.1547294 0.23839491 0.2338472
#> [4,] 0.1805556 0.20309041 0.2096440
#> [5,] 0.2004389 0.09676258 0.1731853
#> [6,] 0.2194347 0.00000000 0.1355177
#> [7,] 0.1501485 0.16377114 0.1598959
#> [8,] 0.1739386 0.11462424 0.1630156
#> [9,] 0.1924256 0.17720055 0.1895233
#> [10,] 0.2510694 0.22870370 0.2557031
## within kcca
flexclust::kcca(datmat, 3, kccaExtendedFamily('kGower'))
#> kcca object of family ‘kGower’
#>
#> call:
#> flexclust::kcca(x = datmat, k = 3, family = kccaExtendedFamily("kGower"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 3 2 5
#>
##turns into kmeans with scaling
# Ex. 3.2: single variable type case with missing values:
nas <- sample(c(TRUE,FALSE), prod(dim(dat)), replace=TRUE, prob=c(0.1,0.9)) |>
matrix(nrow=nrow(dat))
dat[nas] <- NA
#repeat the steps from above...or just do:
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower', cent=centMode))
#> kcca object of family ‘kGower’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower",
#> cent = centMode))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 3 3 4
#>
##turns into kModes with upweighting of present values
#Ex. 3.3: mixed variable types (with or without missings):
dat <- data.frame(cont = sample(1:100, 10, replace=TRUE)/10,
bin_sym = as.logical(sample(0:1, 10, replace=TRUE)),
bin_asym = as.logical(sample(0:1, 10, replace=TRUE)),
ord_levmis = factor(sample(1:5, 10, replace=TRUE),
levels=1:6, ordered=TRUE),
ord_levfull = factor(sample(1:4, 10, replace=TRUE),
levels=1:4, ordered=TRUE),
nom = factor(sample(letters[1:4], 10, replace=TRUE),
levels=letters[1:4]))
dat[nas] <- NA
xcls <- flexord:::.ChooseVarDists(dat)
datmat <- flexord:::.ScaleVarSpecific(data.matrix(dat), xclass=xcls,
xrange='columnwise')
initcenters <- datmat[sample(1:10, 3),]
distGower(datmat, initcenters, genDist=xcls)
#> [,1] [,2] [,3]
#> [1,] 0.6137885 0.0000000 0.3360107
#> [2,] 0.3004016 0.4745649 0.3670683
#> [3,] 0.0000000 0.6137885 0.4666667
#> [4,] 0.8078983 0.4170013 0.5856760
#> [5,] 0.5528782 0.4390897 0.5528782
#> [6,] 0.5349398 0.4431058 0.5349398
#> [7,] 0.4444444 0.8888889 0.8888889
#> [8,] 0.4666667 0.3360107 0.0000000
#> [9,] 0.5686747 0.3045515 0.2353414
#> [10,] 0.5833333 0.8634538 0.9166667
## within kcca
flexclust::kcca(dat, 3, kccaExtendedFamily('kGower'))
#> kcca object of family ‘kGower’
#>
#> call:
#> flexclust::kcca(x = dat, k = 3, family = kccaExtendedFamily("kGower"))
#>
#> cluster sizes:
#>
#> 1 2 3
#> 4 4 2
#>
## as a distance matrix
distGower(datmat, datmat, genDist=xcls) |> as.dist()
#> 1 2 3 4 5 6 7
#> 2 0.4745649
#> 3 0.6137885 0.3004016
#> 4 0.4170013 0.8353414 0.8078983
#> 5 0.4390897 0.5803213 0.5528782 0.5783133
#> 6 0.4431058 0.5678715 0.5349398 0.6934404 0.4484605
#> 7 0.8888889 0.3333333 0.4444444 1.0000000 0.6666667 0.7777778
#> 8 0.3360107 0.3670683 0.4666667 0.5856760 0.5528782 0.5349398 0.8888889
#> 9 0.3045515 0.4016064 0.5686747 0.7215529 0.5321285 0.3670683 0.7777778
#> 10 0.8634538 0.4578313 0.5833333 0.9548193 0.6698795 0.6646586 0.0000000
#> 8 9
#> 2
#> 3
#> 4
#> 5
#> 6
#> 7
#> 8
#> 9 0.2353414
#> 10 0.9166667 0.6224900
## as a distance matrix