clustering_metrics.metrics module¶
-
class
clustering_metrics.metrics.
ClusteringMetrics
(*args, **kwargs)[source]¶ Bases:
clustering_metrics.metrics.ContingencyTable
Provides external clustering evaluation metrics
A subclass of ContingencyTable that builds a pairwise co-association matrix for clustering comparisons.
>>> Y1 = {(1, 2, 3), (4, 5, 6)} >>> Y2 = {(1, 2), (3, 4, 5), (6,)} >>> cm = ClusteringMetrics.from_partitions(Y1, Y2) >>> cm.split_join_similarity(model=None) 0.75
-
adjusted_fowlkes_mallows
()[source]¶ Fowlkes-Mallows index adjusted for chance
Adjustmend for chance done by subtracting the expected (Model 3) pairwise matrix from the actual one. This coefficient appears to be uniformly more powerful than the unadjusted version. Compared to ARI and product-moment correlation coefficients, this index is generally less powerful except in particularly poorly specified cases, e.g. clusters of unequal size sampled with high error rate from a large population.
-
adjusted_rand_index
()[source]¶ Rand score (accuracy) corrected for chance
This is a memory-efficient replacement for a similar Scikit-Learn function.
-
fowlkes_mallows
()[source]¶ Fowlkes-Mallows index for partition comparison
Defined as the Ochiai coefficient on the pairwise matrix
-
mirkin_match_coeff
(normalize=True)[source]¶ Equivalence match (similarity) coefficient
Derivation of distance variant described in [R1]. This measure is nearly identical to pairwise unadjusted Rand index, as can be seen from the definition (Mirkin match formula uses square while pairwise accuracy uses n choose 2).
>>> C3 = [{1, 2, 3, 4}, {5, 6, 7, 8, 9, 10}, {11, 12, 13, 14, 15, 16}] >>> C4 = [{1, 2, 3, 4}, {5, 6, 7, 8, 9, 10, 11, 12}, {13, 14, 15, 16}] >>> t = ClusteringMetrics.from_partitions(C3, C4) >>> t.mirkin_match_coeff(normalize=False) 216.0
References
[R1] (1, 2) Mirkin, B (1996). Mathematical Classification and Clustering. Kluwer Academic Press: Boston-Dordrecht.
-
mirkin_mismatch_coeff
(normalize=True)[source]¶ Equivalence mismatch (distance) coefficient
Direct formulation (without the pairwise abstraction):
\[M = \sum_{i=1}^{R} r_{i}^2 + \sum_{j=1}^{C} c_{j}^2 - \sum_{i=1}^{R}\sum_{j=1}^{C} n_{ij}^2,\]where \(r\) and \(c\) are row and column margins, respectively, with \(R\) and \(C\) cardinalities.
>>> C1 = [{1, 2, 3, 4, 5, 6, 7, 8}, {9, 10, 11, 12, 13, 14, 15, 16}] >>> C2 = [{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, {11, 12, 13, 14, 15, 16}] >>> t = ClusteringMetrics.from_partitions(C1, C2) >>> t.mirkin_mismatch_coeff(normalize=False) 56.0
-
pairwise
¶ Confusion matrix on all pair assignments from two partitions
A partition of N is a set of disjoint clusters s.t. every point in N belongs to one and only one cluster and every cluster consists of at least one point. Given two partitions A and B and a co-occurrence matrix of point pairs,
TP count of pairs found in the same partition in both A and B FP count of pairs found in the same partition in A but not in B FN count of pairs found in the same partition in B but not in A TN count of pairs in different partitions in both A and B Note that although the resulting confusion matrix has the form of a correlation table for two binary variables, it is not symmetric if the original partitions are not symmetric.
-
-
class
clustering_metrics.metrics.
ConfusionMatrix2
(TP=None, FN=None, FP=None, TN=None, rows=None)[source]¶ Bases:
clustering_metrics.metrics.ContingencyTable
,pymaptools.containers.OrderedCrossTab
A confusion matrix (2x2 contingency table)
For a binary variable (where one is measuring either presence vs absence of a particular feature), a confusion matrix where the ground truth levels are rows looks like this:
>>> cm = ConfusionMatrix2(TP=20, FN=31, FP=14, TN=156) >>> cm ConfusionMatrix2(rows=[[20, 31], [14, 156]]) >>> cm.to_array() array([[ 20, 31], [ 14, 156]])
For a nominal variable, the negative class becomes a distinct label, and TP/FP/FN/TN terminology does not apply, although the algorithms should work the same way (with the obvious distinction that different assumptions will be made). For a convenient reference about some of the attributes and methods defined here see [R2].
References
[R2] (1, 2) Wikipedia entry for Confusion Matrix Attributes
TP : True positive count FP : False positive count TN : True negative count FN : False negative count -
DOR
()[source]¶ Diagnostics odds ratio
Defined as
\[DOR = \frac{PLL}{NLL}.\]Odds ratio has a number of interesting/desirable properties, however its one peculiarity that leaves us looking for an alternative measure is that on L-shaped matrices like,
\[\begin{split}\begin{matrix} 77 & 0 \\ 5 & 26 \end{matrix}\end{split}\]its value will be infinity.
Also known as: crude odds ratio, Mantel-Haenszel estimate.
-
FN
¶
-
FP
¶
-
PPV
()[source]¶ Positive Predictive Value (Precision)
Synonyms: precision, frequency of hits, post agreement, success ratio, correct alarm ratio
-
TN
¶
-
TP
¶
-
TPR
()[source]¶ True Positive Rate (Recall, Sensitivity)
Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance
-
accuracy
()¶ Accuracy (Rand Index)
Synonyms: Simple Matching Coefficient, Rand Index
-
bias_index
()[source]¶ Bias Index
In interrater agreement studies, bias is the extent to which the raters disagree on the positive-negative ratio of the binary variable studied. Example of a confusion matrix with high bias of rater A (represented by rows) towards negative rating:
\[\begin{split}\begin{matrix} 17 & 14 \\ 78 & 81 \end{matrix}\end{split}\]See also
-
cole_coeff
()[source]¶ Cole coefficient
This is exactly the same coefficient as Lewontin’s D’. It is defined as:
\[D' = \frac{cov}{cov_{max}},\]where \(cov_{max}\) is the maximum covariance attainable under the given marginal distribution. When \(ad \geq bc\), this coefficient is equivalent to Loevinger’s H.
Synonyms: C7, Lewontin’s D’.
See also
-
dice_coeff
()[source]¶ Dice similarity (Nei-Li coefficient)
This is the same as F1-score, but calculated slightly differently here. Note that Dice can be zero if total number of positives is zero, but F-score is undefined in that case (because recall is undefined).
When adjusted for chance, this coefficient becomes identical to
kappa
[R3].Since this coefficient is monotonic with respect to Jaccard and Sokal Sneath coefficients, its resolving power is identical to that of the other two.
See also
References
[R3] (1, 2) Albatineh, A. N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23(2), 301-313.
-
diseq_coeff
(standardize=False)[source]¶ Linkage disequilibrium
\[D = \frac{a}{n} - \frac{p_1}{n}\frac{p_2}{n} = \frac{cov}{n^2}\]If
standardize=True
, this measure is further normalized to maximum covariance attainable under given marginal distribution, and the resulting index is called Lewontin’s D’.See also
-
classmethod
from_ccw
(TP, FP, TN, FN)[source]¶ Instantiate from counter-clockwise form of TP FP TN FN
-
classmethod
from_sets
(set1, set2, universe_size=None)[source]¶ Instantiate from two sets
Accepts an optional universe_size parameter which allows us to take into account TN class and use probability-based similarity metrics. Most of the time, however, set comparisons are performed ignoring this parameter and relying instead on non-probabilistic indices such as Jaccard’s or Dice.
-
fscore
(beta=1.0)[source]¶ F-score
As beta tends to infinity, F-score will approach recall. As beta tends to zero, F-score will approach precision. A similarity coefficient that uses a similar definition is called Dice coefficient.
See also
-
informedness
()[source]¶ Informedness (recall corrected for chance)
A complement to markedness. Can be thought of as recall corrected for chance. Alternative formulations:
\[\begin{split}Informedness &= Sensitivity + Specificity - 1.0 \\ &= TPR - FPR\end{split}\]In the case of ranked predictions, TPR can be plotted on the y-axis with FPR on the x-axis. The resulting plot is known as Receiver Operating Characteristic (ROC) curve [R4]. The delta between a point on the ROC curve and the diagonal is equal to the value of informedness at the given FPR threshold.
This measure was first proposed for evaluating medical diagnostics tests in [R5], and was also used in meteorology under the name “True Skill Score” [R6].
Synonyms: Youden’s J, True Skill Score, Hannssen-Kuiper Score, Attributable Risk, DeltaP.
See also
References
[R4] (1, 2) Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874. [R5] (1, 2) Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32-35. [R6] (1, 2) Doswell III, C. A., Davies-Jones, R., & Keller, D. L. (1990). On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5(4), 576-585.
-
jaccard_coeff
()[source]¶ Jaccard similarity coefficient
Jaccard coefficient has an interesting property in that in L-shaped matrices where either FP or FN are close to zero, its scale becomes equivalent to the scale of either recall or precision respectively.
Since this coefficient is monotonic with respect to Dice (F-score) and Sokal Sneath coefficients, its resolving power is identical to that of the other two.
Jaccard index does not belong to the L-family of association indices and thus cannot be adjusted for chance by subtracting the its value under fixed-margin null model. Instead, its expectation must be calculated, for which no analytical solution exists [R7].
Synonyms: critical success index
See also
References
-
kappa
()[source]¶ Cohen’s Kappa (Interrater Agreement)
Kappa coefficient is best known in the psychology field where it was introduced to measure interrater agreement [R8]. It has also been used in replication studies [R9], clustering evaluation [R10], image segmentation [R11], feature selection [R12] [R13], forecasting [R14], and network link prediction [R15]. The first derivation of this measure is in [R16].
Kappa can be derived by correcting either Accuracy (Simple Matching Coefficient, Rand Index) or F1-score (Dice Coefficient) for chance. Conversely, Dice coefficient can be derived from Kappa by obtaining its limit as \(d \rightarrow \infty\). Normalizing Kappa by its maximum value given fixed-margin table gives Loevinger’s H.
Synonyms: Adjusted Rand Index, Heidke Skill Score
See also
References
[R8] (1, 2) Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46. [R9] (1, 2) Arabie, P., Hubert, L. J., & De Soete, G. (1996). Clustering validation: results and implications for applied analyses (p. 341). World Scientific Pub Co Inc. [R10] (1, 2) Warrens, M. J. (2008). On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. Journal of Classification, 25(2), 177-183. [R14] (1, 2) Doswell III, C. A., Davies-Jones, R., & Keller, D. L. (1990). On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5(4), 576-585. [R15] (1, 2) Hoffman, M., Steinley, D., & Brusco, M. J. (2015). A note on using the adjusted Rand index for link prediction in networks. Social Networks, 42, 72-79. [R16] (1, 2) Heidke, Paul. “Berechnung des Erfolges und der Gute der Windstarkevorhersagen im Sturmwarnungsdienst.” Geografiska Annaler (1926): 301-349.
-
kappas
()[source]¶ Pairwise precision and recall corrected for chance
Kappa decomposes into a pair of components (regression coefficients), \(\kappa_0\) (precision-like) and \(\kappa_1\) (recall-like), of which it is a harmonic mean:
\[\kappa_0 = \frac{cov}{p_2 q_1}, \quad \kappa_1 = \frac{cov}{p_1 q_2}.\]These coefficients are interesting because they represent precision and recall, respectively, corrected for chance by subtracting the fixed-margin null model. In clustering context, \(\kappa_0\) corresponds to pairwise homogeneity, while \(\kappa_1\) corresponds to pairwise completeness. The geometric mean of the two components is equal to Matthews’ Correlation Coefficient, while their maximum is equal to Loevinger’s H when \(ad \geq bc\).
See also
-
loevinger_coeff
()[source]¶ Loevinger’s Index of Homogeneity (Loevinger’s H)
Given a clustering (numbers correspond to class labels, inner groups to clusters) with perfect homogeneity but imperfect completeness, Loevinger coefficient returns a perfect score on the corresponding pairwise co-association matrix:
>>> clusters = [[0, 0], [0, 0, 0, 0], [1, 1, 1, 1]] >>> t = ClusteringMetrics.from_clusters(clusters) >>> t.pairwise.loevinger_coeff() 1.0
At the same time, kappa and Matthews coefficients are 0.63 and 0.68, respectively. Loevinger coefficient will also return a perfect score for the dual situation:
>>> clusters = [[0, 2, 2, 0, 0, 0], [1, 1, 1, 1]] >>> t = ClusteringMetrics.from_clusters(clusters) >>> t.pairwise.loevinger_coeff() 1.0
Loevinger’s coefficient has a unique property: all two-way correlation coefficients on a 2x2 table that are in L-family (including Kappa and Matthews’ correlation coefficient) become Loevinger’s coefficient after normalization by maximum value [R18]. However, this measure is not symmetric: when \(ad < bc\), it does not have a lower bound. For an equivalent symmetric measure, use Cole coefficient.
See also
References
[R18] (1, 2) Warrens, M. J. (2008). On association coefficients for 2x2 tables and properties that do not depend on the marginal distributions. Psychometrika, 73(4), 777-789.
-
markedness
()[source]¶ Markedness (precision corrected for chance)
A complement to informedness. Can be thought of as precision corrected for chance. Alternative formulations:
\[\begin{split}Markedness &= PPV + NPV - 1.0 \\ &= PPV - FOR\end{split}\]In the case of ranked predictions, PPV can be plotted on the y-axis with FOR on the x-axis. The resulting plot is known as Relative Operating Level (ROL) curve [R19]. The delta between a point on the ROL curve and the diagonal is equal to the value of markedness at the given FOR threshold.
Synonyms: DeltaP′
See also
References
-
matthews_corr
()[source]¶ Matthews Correlation Coefficient (Phi coefficient)
MCC is directly related to the Chi-square statistic. Its value is equal to the Chi-square value normalized by the maximum value the Chi-square can achieve with given margins (for a 2x2 table, the maximum Chi-square score is equal to the grand total N) transformed to correlation space by taking a square root.
MCC is a also a geometric mean of informedness and markedness (the regression coefficients of the problem and its dual). As \(d \rightarrow \infty\), MCC turns into Ochiai coefficient. Unlike with Kappa, normalizing the corresponding similarity coefficient for chance by subtracting the fixed-margin null model does not produce MCC in return, but gives a different index with equivalent discriminating power to that of MCC. Normalizing MCC by its maximum value under fixed- margin model gives Loevinger’s H.
Empirically, the discriminating power of MCC is sligtly better than that of
mp_corr
andkappa
, and is only lower than that ofloevinger_coeff
under highly biased conditions. While MCC is a commonly used and recently preferred measure of prediction and reproducibility [R20], it is somewhat strange that one can hardly find any literature that uses this index in clustering comparison context, with some rare exceptions [R21] [R22].Synonyms: Phi Coefficient, Product-Moment Correlation
See also
References
[R22] (1, 2) Kao, D. (2012). Using Matthews correlation coefficient to cluster annotations. NextGenetics (personal blog).
-
mic_scores
(mean='harmonic')[source]¶ Mutual information-based correlation
The coefficient decomposes into regression coefficients defined according to fixed-margin tables. The
mic1
coefficient, for example, is obtained by dividing the G-score by the maximum achievable value on a table with fixed true class counts (which here correspond to row totals). Themic0
is its dual, defined by dividing the G-score by its maximum achievable value with fixed predicted label counts (here represented as column totals).mic0
roughly corresponds to precision (homogeneity) whilemic1
roughly corresponds to recall (completeness).
-
mp_corr
()[source]¶ Maxwell & Pilliner’s association index
Another covariance-based association index corrected for chance. Like MCC, based on a mean of informedness and markedness, except uses a harmonic mean instead of geometric. Like Kappa, turns into Dice coefficient (F-score) as ‘d’ approaches infinity.
On typical problems, the resolving power of this coefficient is nearly identical to that of Cohen’s Kappa and is only very slightly below that of Matthews’ correlation coefficient.
See also
-
ochiai_coeff
()[source]¶ Ochiai similarity coefficient (Fowlkes-Mallows)
One interpretation of this coefficient that it is equal to the geometric mean of the conditional probability of an element (in the case of pairwise clustering comparison, a pair of elements) belonging to the same cluster given that they belong to the same class [R23].
This coefficient is in the L-family, and thus it can be corrected for chance by subtracting its value under fixed-margin null model. The resulting adjusted index is very close to, but not the same as, Matthews Correlation Coefficient. Empirically, the discriminating power of the adjusted coefficient is equal to that of Matthews’ Correlation Coefficient to within rounding error.
Synonyms: Cosine Similarity, Fowlkes-Mallows Index
See also
References
[R23] (1, 2) Ramirez, E. H., Brena, R., Magatti, D., & Stella, F. (2012). Topic model validation. Neurocomputing, 76(1), 125-133.
-
ochiai_coeff_adj
()[source]¶ Ochiai coefficient adjusted for chance
This index is nearly identical to Mattthews’ Correlation Coefficient, which should be used instead.
See also
-
overlap_coeff
()[source]¶ Overlap coefficient (Szymkiewicz-Simpson coefficient)
Can be obtained by standardizing Dice or Ochiai coefficients by their maximum possible value given fixed marginals. Not corrected for chance.
Note that \(min(p_1, p_2)\) is equal to the maximum value of \(a\) given fixed marginals.
When adjusted for chance, this coefficient turns into Loevinger’s H.
See also
-
pairwise_hcv
()[source]¶ Pairwise homogeneity, completeness, and their geometric mean
Each of the two one-sided measures is defined as follows:
\[\hat{M}_{adj} = \frac{M - E[M]}{M_{max} - min(E[M], M)}.\]It is clear from the definition above that iff \(M < E[M]\) and \(M \leq M_{max}\), the denominator will switch from the standard normalization interval to a larger one, thereby ensuring that \(-1.0 \leq \hat{M}_{adj} \leq 1.0\). The definition for the bottom half of the range can also be expressed in terms of the standard adjusted value:
\[\hat{M}_{adj} = \frac{M_{adj}}{(1 + |M_{adj}|^n)^{1/n}}, \quad M_{adj} < 0, n = 1.\]The resulting measure is not symmetric over its range (negative values are scaled differently from positive values), however this should not matter for applications where negative correlation does not carry any special meaning other than being additional evidence for absence of positive correlation. Such as a situation occurs in pairwise confusion matrices used in cluster analysis. Nevertheless, if more symmetric behavior near zero is desired, the upper part of the negative range can be linearized either by increasing \(n\) in the definition above or by replacing it with \(\hat{M}_{adj} = tanh(M_{adj})\) transform.
For the compound measure, the geometric mean was chosen over the harmonic after the results of a Monte Carlo power analysis, due to slightly better discriminating performance. For positive matrices, the geometric mean is equal to
matthews_corr
, while the harmonic mean would have been equal tokappa
. For negative matrices, the harmonic mean would have remained monotonic (though not equal) to Kappa, while the geometric mean is neither monotonic nor equal to MCC, despite the two being closely correlated. The discriminating performance indices of the geometric mean and of MCC are empirically the same (equal to within rounding error).For matrices with negative covariance, it is possible to switch to
markedness
andinformedness
as one-sided components (homogeneity and completeness, respectively). However, the desirable property of measure orthogonality will not be preserved then, since markedness and informedness exhibit strong correlation under the assumed null model.
-
precision
()¶ Positive Predictive Value (Precision)
Synonyms: precision, frequency of hits, post agreement, success ratio, correct alarm ratio
-
prevalence_index
()[source]¶ Prevalence
In interrater agreement studies, prevalence is high when the proportion of agreements on the positive classification differs from that of the negative classification. Example of a confusion matrix with high prevalence of negative response (note that this happens regardless of which rater we look at):
\[\begin{split}\begin{matrix} 3 & 27 \\ 28 & 132 \end{matrix}\end{split}\]See also
-
recall
()¶ True Positive Rate (Recall, Sensitivity)
Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance
-
sensitivity
()¶ True Positive Rate (Recall, Sensitivity)
Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance
-
sokal_sneath_coeff
()[source]¶ Sokal and Sneath similarity index
In a 2x2 matrix
\[\begin{split}\begin{matrix} a & b \\ c & d \end{matrix}\end{split}\]Dice places more weight on \(a\) component, Jaccard places equal weight on \(a\) and \(b + c\), while Sokal and Sneath places more weight on \(b + c\).
See also
-
specificity
()¶ True Negative Rate (Specificity)
Synonyms: specificity
-
yule_q
()[source]¶ Yule’s Q (association index)
Yule’s Q relates to the odds ratio (DOR) as follows:
\[Q = \frac{DOR - 1}{DOR + 1}.\]
-
yule_y
()[source]¶ Yule’s Y (colligation coefficient)
The Y coefficient was used as basis of a new association measure by accounting for entropy in [R24].
References
[R24] (1, 2) Hasenclever, D., & Scholz, M. (2013). Comparing measures of association in 2x2 probability tables. arXiv preprint arXiv:1302.6161.
-
-
class
clustering_metrics.metrics.
ContingencyTable
(*args, **kwargs)[source]¶ Bases:
pymaptools.containers.CrossTab
-
adjust_to_null
(measure, model='m3', with_warnings=False)[source]¶ Adjust a measure to null model
The general formula for chance correction of an association measure \(M\) is:
\[M_{adj} = \frac{M - E(M)}{M_{max} - E(M)},\]where \(M_{max}\) is the maximum value a measure \(M\) can achieve, and \(E(M)\) is the expected value of \(M\) under statistical independence given fixed table margins. In simple cases, the expected value of a measure is the same as the value of the measure given a null model. This is not always the case, however, and, to properly adjust for chance, sometimes one has average over all possible contingency tables using hypergeometric distribution for example.
The method returns a tuple for two different measure ceilings: row- diagonal and column-diagonal. For symmetric measures, the two values will be the same.
-
adjusted_mutual_info
()[source]¶ Adjusted Mutual Information for two partitions
For a mathematical definition, see [R25], [R26], and [R26].
References
-
assignment_score
(normalize=True, model='m1', discrete=False, redraw=False)[source]¶ Similarity score by solving the Linear Sum Assignment Problem
This metric is uniformly more powerful than the similarly behaved
split_join_similarity
which relies on an approximation to the optimal solution evaluated here. The split-join approximation asymptotically approaches the optimal solution as the clustering quality improves.On the
model
parameter: adjusting assignment cost for chance by relying on the hypergeometric distribution is extremely computationally expensive, but one way to get a better behaved metric is to just subtract the cost of a null model from the obtained score (in case of normalization, the null cost also has to be subtracted from the maximum cost). Note that on large tables even finding the null cost is too expensive, since expected tables have a lot less sparsity. Hence the parameter is off by default.Alternatively this problem can be recast as that of finding a maximum weighted bipartite match [R28].
This method of partition comparison was first mentioned in [R29], given an approximation in [R30], formally elaborated in [R31] and empirically compared with other measures in [R32].
See also
References
[R28] (1, 2) Wikipedia entry on weighted bipartite graph matching [R29] (1, 2) Almudevar, A., & Field, C. (1999). Estimation of single-generation sibling relationships based on DNA markers. Journal of agricultural, biological, and environmental statistics, 136-165. [R30] (1, 2) Ben-Hur, A., & Guyon, I. (2003). Detecting stable clusters using principal component analysis. In Functional Genomics (pp. 159-182). Humana press. [R31] (1, 2) Gusfield, D. (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82(3), 159-164.
-
bc_metrics
()[source]¶ ‘B-cubed’ precision, recall, and fscore
As described in [R33] and [R34]. Was extended to overlapping clusters in [R35]. These metrics perform very similarly to normalized entropy metrics (homogeneity, completeness, V-measure).
References
[R35] (1, 2) Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval, 12(4), 461-486.
-
chisq_score
()[source]¶ Pearson’s chi-square statistic
>>> r = {1: {1: 16, 3: 2}, 2: {1: 1, 2: 3}, 3: {1: 4, 2: 5, 3: 5}} >>> cm = ContingencyTable(rows=r) >>> round(cm.chisq_score(), 3) 19.256
-
entropy_scores
(mean='harmonic')[source]¶ Gives three entropy-based metrics for a RxC table
The metrics are: Homogeneity, Completeness, and V-measure
The V-measure metric is also known as Normalized Mutual Information (NMI), and is calculated here as the harmonic mean of Homogeneity and Completeness (\(NMI_{sum}\)). There exist other definitions of NMI (see Table 2 in [R36] for a good review).
Homogeneity and Completeness are duals of each other and can be thought of (although this is not technically accurate) as squared regression coefficients of a given clustering vs true labels (homogeneity) and of the dual problem of true labels vs given clustering (completeness). Because of the dual property, in a symmetric matrix, all three scores are the same. Homogeneity has an overall profile similar to that of precision in information retrieval. Completeness roughly corresponds to recall.
This method replaces
homogeneity_completeness_v_measure
method in Scikit-Learn. The Scikit-Learn version takes up \(O(n^2)\) space because it stores data in a dense NumPy array, while the given version is sub-quadratic because of sparse underlying storage.Note that the entropy variables H in the code below are improperly defined because they ought to be divided by N (the grand total for the contingency table). However, the N variable cancels out during normalization.
References
-
expected
(model='m3', discrete=False, redraw=False)[source]¶ Factory creating expected table given current margins
-
g_score
()[source]¶ G-statistic for RxC contingency table
This method does not perform any corrections to this statistic (e.g. Williams’, Yates’ corrections).
The statistic is equivalent to the negative of Mutual Information times two. Mututal Information on a contingency table is defined as the difference between the information in the table and the information in an independent table with the same margins. For application of mutual information (in the form of G-score) to search for collocated words in NLP, see [R37] and [R38].
References
[R37] (1, 2) Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1), 61-74. [R38] (1, 2) Ted Dunning’s personal blog entry and the discussion under it.
-
muc_scores
()[source]¶ MUC similarity indices for coreference scoring
Implemented after description in [R39]. The compound fscore-like metric has good resolving power on sparse models, similar to
fowlkes_mallows
(pairwiseochiai_coeff
), however it becomes useless on dense matrices as it relies on category cardinalities (how many types were seen) rather than on observation counts (how many instances of each type were seen).>>> p1 = [x.split() for x in ["A B C", "D E F G"]] >>> p2 = [x.split() for x in ["A B", "C", "D", "E", "F G"]] >>> cm = ClusteringMetrics.from_partitions(p1, p2) >>> cm.muc_scores()[:2] (1.0, 0.4)
Elements that are part of neither partition (in this case, E) are excluded from consideration:
>>> p1 = [x.split() for x in ["A B", "C", "D", "F G", "H"]] >>> p2 = [x.split() for x in ["A B", "C D", "F G H"]] >>> cm = ClusteringMetrics.from_partitions(p1, p2) >>> cm.muc_scores()[:2] (0.5, 1.0)
References
-
mutual_info_score
()[source]¶ Mutual Information Score
Mutual Information (divided by N).
The metric is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals.
-
split_join_similarity
(normalize=True, model='m1')[source]¶ Split-join similarity score
Split-join similarity is a two-way assignment-based score first proposed in [R40]. The distance variant of this measure has metric properties. Like the better known purity score (a one-way coefficient), this measure implicitly performs class-cluster assignment, except the assignment is performed twice: based on the corresponding maximum frequency in the contingency table, each class is given a cluster with the assignment weighted according to the frequency, then the procedure is inversed to assign a class to each cluster. The final unnormalized distance score comprises of a simple sum of the two one-way assignment scores.
By default,
m1
null model is subtracted, to make the final score independent of the number of clusters:>>> t2 = ClusteringMetrics(rows=10 * np.ones((2, 2), dtype=int)) >>> t2.split_join_similarity(model=None) 0.5 >>> t2.split_join_similarity(model='m1') 0.0 >>> t8 = ClusteringMetrics(rows=10 * np.ones((8, 8), dtype=int)) >>> t8.split_join_similarity(model=None) 0.125 >>> t8.split_join_similarity(model='m1') 0.0
See also
References
[R40] (1, 2) Dongen, S. V. (2000). Performance criteria for graph clustering and Markov cluster experiments. Information Systems [INS], (R 0012), 1-36.
-
talburt_wang_index
()[source]¶ Talburt-Wang index of similarity of two partitions
On sparse matrices, the resolving power of this measure asymptotically approaches that of assignment-based scores such as
assignment_score
andsplit_join_similarity
, however on dense matrices this measure will not perform well due to its reliance on category cardinalities (how many types were seen) rather than on observation counts (how many instances of each type were seen).A relatively decent clustering:
>>> a = [ 1, 1, 1, 2, 2, 2, 2, 3, 3, 4] >>> b = [43, 56, 56, 5, 36, 36, 36, 74, 74, 66] >>> t = ContingencyTable.from_labels(a, b) >>> round(t.talburt_wang_index(), 3) 0.816
Less good clustering (example from [R41]):
>>> clusters = [[1, 1], [1, 1, 1, 1], [2, 3], [2, 2, 3, 3], ... [3, 3, 4], [3, 4, 4, 4, 4, 4, 4, 4, 4, 4]] >>> t = ContingencyTable.from_clusters(clusters) >>> round(t.talburt_wang_index(), 2) 0.49
References
-
vi_distance
(normalize=True)[source]¶ Variation of Information distance
Defined in [R42]. This is one of several possible entropy-based distance measures that could be defined on a RxC matrix. The given measure is equivalent to \(2 D_{sum}\) as listed in Table 2 in [R43].
Note that the entropy variables H below are calculated using natural logs, so a base correction may be necessary if you need your result in base 2 for example.
References
[R42] (1, 2) Meila, M. (2007). Comparing clusterings – an information based distance. Journal of multivariate analysis, 98(5), 873-895.
-
-
clustering_metrics.metrics.
adjusted_mutual_info_score
(labels_true, labels_pred)[source]¶ Adjusted Mutual Information for two partitions
This is a memory-efficient replacement for the equivalently named Scikit-Learn function.
Perfect labelings are both homogeneous and complete, hence AMI has the perfect score of one:
>>> adjusted_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1]) 1.0 >>> adjusted_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0]) 1.0
If classes members are completely split across different clusters, the assignment is utterly incomplete, hence AMI equals zero:
>>> adjusted_mutual_info_score([0, 0, 0, 0], [0, 1, 2, 3]) 0.0
-
clustering_metrics.metrics.
adjusted_rand_score
(labels_true, labels_pred)[source]¶ Rand score (accuracy) corrected for chance
This is a memory-efficient replacement for the equivalently named Scikit-Learn function
In a supplement to [R44], the following example is given:
>>> classes = [1, 1, 2, 2, 2, 2, 3, 3, 3, 3] >>> clusters = [1, 2, 1, 2, 2, 3, 3, 3, 3, 3] >>> round(adjusted_rand_score(classes, clusters), 3) 0.313
References
-
clustering_metrics.metrics.
cohen_kappa
(*args, **kwargs)[source]¶ Return Cohen’s Kappa for a 2x2 contingency table
-
clustering_metrics.metrics.
confmat2_type
¶ alias of
ConfusionMatrix2
-
clustering_metrics.metrics.
geometric_mean
(x, y)[source]¶ Geometric mean of two numbers. Always returns a float
Although geometric mean is defined for negative numbers, Scipy function doesn’t allow it. Hence this function
-
clustering_metrics.metrics.
geometric_mean_weighted
(x, y, ratio=1.0)[source]¶ Geometric mean of two numbers with a weight ratio. Returns a float
>>> geometric_mean_weighted(1, 4, ratio=1.0) 2.0 >>> geometric_mean_weighted(1, 4, ratio=0.0) 1.0 >>> geometric_mean_weighted(1, 4, ratio=float('inf')) 4.0
-
clustering_metrics.metrics.
harmonic_mean
(x, y)[source]¶ Harmonic mean of two numbers. Always returns a float
-
clustering_metrics.metrics.
harmonic_mean_weighted
(x, y, ratio=1.0)[source]¶ Harmonic mean of two numbers with a weight ratio. Returns a float
>>> harmonic_mean_weighted(1, 3, ratio=1.0) 1.5 >>> harmonic_mean_weighted(1, 3, ratio=0.0) 1.0 >>> harmonic_mean_weighted(1, 3, ratio=float('inf')) 3.0
-
clustering_metrics.metrics.
homogeneity_completeness_v_measure
(labels_true, labels_pred)[source]¶ Memory-efficient replacement for equivalently named Scikit-Learn function
-
clustering_metrics.metrics.
jaccard_similarity
(iterable1, iterable2)[source]¶ Jaccard similarity between two sets
Parameters: iterable1 : collections.Iterable
first bag of items (order irrelevant)
iterable2 : collections.Iterable
second bag of items (order irrelevant)
Returns: jaccard_similarity : float
-
clustering_metrics.metrics.
mutual_info_score
(labels_true, labels_pred)[source]¶ Memory-efficient replacement for equivalently named Sklean function
-
clustering_metrics.metrics.
product_moment
(*args, **kwargs)[source]¶ Return MCC score for a 2x2 contingency table