FlexMix Driver for Regularized Multinomial Mixtures

This model driver can be used to cluster data using a multinomial distribution.

FLXMCregmultinom(formula = . ~ ., r, alpha = 0)

Arguments

formula: A formula which is interpreted relative to the formula specified in the call to flexmix::flexmix() using stats::update.formula(). Only the left-hand side (response) of the formula is used. Default is to use the original model formula specified in flexmix::flexmix().
r: Number of different categories. Values are assumed to be integers in 1:r.
alpha: A non-negative scalar acting as regularization parameter. Can be regarded as adding alpha observations equal to the population mean to each component.

Value

an object of class "FLXC"

Details

Using a regularization parameter alpha greater than zero acts as adding alpha observations conforming to the population mean to each component. This can be used to avoid degenerate solutions. It also has the effect that clusters become more similar to each other the larger alpha is chosen. For small values it is mostly negligible however.

For regularization we compute the MAP estimates for the multinomial distribution using the Dirichlet distribution as prior, which is the conjugate prior. The parameters of this prior are selected to correspond to the marginal distribution of the variable across all observations.

References

Galindo Garre, F, Vermunt, JK (2006). Avoiding Boundary Estimates in Latent Class Analysis by Bayesian Posterior Mode Estimation Behaviormetrika, 33, 43-59.
Ernst, D, Ortega Menjivar, L, Scharl, T, Grün, B (2025). Ordinal Clustering with the flex-Scheme. Austrian Journal of Statistics. Submitted manuscript.

Examples

library("flexmix")
library("flexord")
library("flexclust")


set.seed(0xdeaf)

# Sample data
k <- 4     # nr of clusters
nvar <- 10  # nr of variables
size <- sample(1:6, size=nvar, replace=TRUE)  # nr of trials 
N <- 100   # obs. per cluster


# random probabilities per component
probs <- lapply(seq_len(k), \(ki) runif(nvar, 0.01, 0.99))

# sample data
dat <- lapply(probs, \(p) {
    mapply(\(p_i, size_i) {
        rbinom(N, size_i, p_i)
    }, p, size, SIMPLIFY=FALSE) |> do.call(cbind, args=_)
}) |> do.call(rbind, args=_)

true_clusters <- rep(1:4, rep(N, k))

# Sample data is drawn from a binomial distribution but we fit
# a multinomial meaning the model is mis-specified.
# Note that for the multinomial distribution we expect values to lie inside
# 1:(size+1) hence we add +1.

# Cluster without regularization
m1 <- stepFlexmix((dat+1L)~1, model=FLXMCregmultinom(r=size+1L, alpha=0), k=k)
#> 4 : * * *

# Cluster with regularization
m2 <- stepFlexmix((dat+1L)~1, model=FLXMCregmultinom(r=size+1L, alpha=1), k=k)
#> 4 : * * *

# Both models are mostly able to reconstruct the true clusters (ARI ~ 0.95)
# (it's a very easy clustering problem)
# Small values for the regularization don't seem to affect the ARI (much)
randIndex(clusters(m1), true_clusters)
#>       ARI 
#> 0.9933165 
randIndex(clusters(m2), true_clusters)
#>       ARI 
#> 0.9867009