| Title: | Implementation of the DISCO Metric for Internal Clustering Evaluation |
|---|---|
| Description: | Implementation of the DISCO (Density-based Internal Score for Clusterings with nOise) metric, a cluster validity index for evaluating density-based clustering results without ground truth labels. DISCO is the first index to explicitly assess the quality of noise point assignments in addition to cluster quality. It uses density-connectivity distance derived from a minimum spanning tree of the mutual-reachability graph, providing interpretable, bounded scores in [-1, 1]. Higher scores indicate better clustering. Based on Beer et al. (2025) <doi:10.48550/arXiv.2503.00127>. |
| Authors: | Amin Entezari [aut, cre] (ORCID: <https://orcid.org/0009-0005-4733-6032>), Davide Chicco [ctb] (ORCID: <https://orcid.org/0000-0001-9655-7142>, Supervision and guidance), Anna Beer [aut] (Original Python implementation (arXiv:2503.00127)), Lena Krieger [aut] (Original Python implementation (arXiv:2503.00127)), Pascal Weber [aut] (Original Python implementation (arXiv:2503.00127)) |
| Maintainer: | Amin Entezari <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.1 |
| Built: | 2026-06-05 08:58:48 UTC |
| Source: | https://github.com/aminentezari/discocvi |
The main entry point for constructing the DC-distance matrix used throughout the DISCO metric. The procedure is:
Compute mutual-reachability distances
(calculate_reachability_distance).
Build the minimum spanning tree of those distances
(get_mst_edges).
Extract pairwise DC distances as minimax-path weights in the MST
(extract_dc_distances_from_mst).
compute_dc_distances(X, min_points = 5)compute_dc_distances(X, min_points = 5)
X |
A numeric matrix of shape |
min_points |
A positive integer for the core-distance neighbourhood
size. Default is |
A symmetric numeric matrix of DC distances.
set.seed(1) X <- matrix(rnorm(40), ncol = 2) D <- compute_dc_distances(X, min_points = 5) dim(D) # 20 x 20set.seed(1) X <- matrix(rnorm(40), ncol = 2) D <- compute_dc_distances(X, min_points = 5) dim(D) # 20 x 20
Computes a pointwise DISCO score for every observation in X. The
score lies in :
Values close to indicate a well-placed point (either
tightly within its cluster or a noise point far from all clusters).
Values close to indicate borderline placement.
Values close to indicate a misplaced point (or a noise
point that should belong to a cluster).
Four cases are handled:
All noise — every point labelled -1: all scores
are -1.
Single cluster with no noise — all scores are 0.
One real cluster + noise — cluster points are scored via
p_cluster; noise points are scored via
p_noise.
Two or more clusters (optional noise) — non-noise points
use p_cluster on the non-noise sub-matrix; noise points
use p_noise.
disco_samples(X, labels, min_points = 5)disco_samples(X, labels, min_points = 5)
X |
A numeric matrix ( |
labels |
An integer vector of length |
min_points |
A positive integer for the core-distance neighbourhood
size. Default is |
A numeric vector of length with per-point DISCO scores.
disco_score, p_cluster,
p_noise
set.seed(42) X <- matrix(rnorm(100), ncol = 2) labels <- rep(c(0L, 1L), each = 25) s <- disco_samples(X, labels) hist(s, main = "Per-sample DISCO scores")set.seed(42) X <- matrix(rnorm(100), ncol = 2) labels <- rep(c(0L, 1L), each = 25) s <- disco_samples(X, labels) hist(s, main = "Per-sample DISCO scores")
Returns the mean of the per-sample DISCO scores produced by
disco_samples. This is the single-number summary of
clustering quality used for model selection and comparison.
disco_score(X, labels, min_points = 5)disco_score(X, labels, min_points = 5)
X |
A numeric matrix ( |
labels |
An integer vector of cluster labels ( |
min_points |
Positive integer; core-distance neighbourhood size.
Default is |
A single numeric value in .
set.seed(42) X <- matrix(rnorm(100), ncol = 2) labels <- rep(c(0L, 1L), each = 25) disco_score(X, labels)set.seed(42) X <- matrix(rnorm(100), ncol = 2) labels <- rep(c(0L, 1L), each = 25) disco_score(X, labels)
Computes a silhouette coefficient for each non-noise point using the
precomputed (or internally derived) DC-distance matrix rather than raw
Euclidean distances. This matches the formula used by
sklearn.metrics.silhouette_samples with metric="precomputed"
(Python reference: disco.py, line 280):
where is the mean DC distance from to all other points
in the same cluster and is the minimum over other clusters of
the mean DC distance from to that cluster.
p_cluster(X, labels, min_points = 5, precomputed_dc_dists = FALSE)p_cluster(X, labels, min_points = 5, precomputed_dc_dists = FALSE)
X |
Either (a) a numeric matrix of raw data ( |
labels |
An integer vector of cluster labels of length |
min_points |
Positive integer; used only when
|
precomputed_dc_dists |
Logical. If |
A numeric vector of length with per-point silhouette-style
scores in .
disco_samples, compute_dc_distances
set.seed(7) X <- matrix(rnorm(60), ncol = 2) labels <- rep(c(0L, 1L, 2L), each = 10) p_cluster(X, labels)set.seed(7) X <- matrix(rnorm(60), ncol = 2) labels <- rep(c(0L, 1L, 2L), each = 10) p_cluster(X, labels)
Assigns a quality score to each point labelled as noise (-1). Two
complementary sub-scores are computed and the element-wise minimum is taken
as the final score:
Measures whether the noise point is sparser
(more peripheral in density) than the densest part of every real cluster.
For cluster , let be the maximum core distance over
all points in . Then
Measures whether the noise point is far from
every cluster in DC-distance space. For cluster , let
be the minimum DC distance from to any
point in . Then
Both sub-scores lie in : positive means the noise point is
legitimately sparse/far; negative means it is denser/closer than the cluster
boundary and might be a misclassified inlier.
p_noise(X, labels, min_points = 5, dc_dists = NULL)p_noise(X, labels, min_points = 5, dc_dists = NULL)
X |
A numeric matrix ( |
labels |
An integer vector of length |
min_points |
Positive integer |
dc_dists |
Optional precomputed |
A named list with two numeric vectors, each of length equal to the number of noise points:
Sparsity-based noise scores.
DC-distance-based noise scores.
set.seed(3) X <- matrix(rnorm(60), ncol = 2) # 30 rows x 2 cols labels <- c(rep(0L, 10), rep(1L, 10), rep(-1L, 10)) # 30 labels nr <- p_noise(X, labels, min_points = 3) str(nr)set.seed(3) X <- matrix(rnorm(60), ncol = 2) # 30 rows x 2 cols labels <- c(rep(0L, 10), rep(1L, 10), rep(-1L, 10)) # 30 labels nr <- p_noise(X, labels, min_points = 3) str(nr)