clustering_metrics.metrics module

class clustering_metrics.metrics.ClusteringMetrics(*args, **kwargs)[source]

Bases: clustering_metrics.metrics.ContingencyTable

Provides external clustering evaluation metrics

A subclass of ContingencyTable that builds a pairwise co-association matrix for clustering comparisons.

>>> Y1 = {(1, 2, 3), (4, 5, 6)}
>>> Y2 = {(1, 2), (3, 4, 5), (6,)}
>>> cm = ClusteringMetrics.from_partitions(Y1, Y2)
>>> cm.split_join_similarity(model=None)
0.75
adjusted_fowlkes_mallows()[source]

Fowlkes-Mallows index adjusted for chance

Adjustmend for chance done by subtracting the expected (Model 3) pairwise matrix from the actual one. This coefficient appears to be uniformly more powerful than the unadjusted version. Compared to ARI and product-moment correlation coefficients, this index is generally less powerful except in particularly poorly specified cases, e.g. clusters of unequal size sampled with high error rate from a large population.

adjusted_rand_index()[source]

Rand score (accuracy) corrected for chance

This is a memory-efficient replacement for a similar Scikit-Learn function.

fowlkes_mallows()[source]

Fowlkes-Mallows index for partition comparison

Defined as the Ochiai coefficient on the pairwise matrix

get_score(scoring_method, *args, **kwargs)[source]

Evaluate specified scoring method

mirkin_match_coeff(normalize=True)[source]

Equivalence match (similarity) coefficient

Derivation of distance variant described in [R1]. This measure is nearly identical to pairwise unadjusted Rand index, as can be seen from the definition (Mirkin match formula uses square while pairwise accuracy uses n choose 2).

>>> C3 = [{1, 2, 3, 4}, {5, 6, 7, 8, 9, 10}, {11, 12, 13, 14, 15, 16}]
>>> C4 = [{1, 2, 3, 4}, {5, 6, 7, 8, 9, 10, 11, 12}, {13, 14, 15, 16}]
>>> t = ClusteringMetrics.from_partitions(C3, C4)
>>> t.mirkin_match_coeff(normalize=False)
216.0

References

[R1](1, 2) Mirkin, B (1996). Mathematical Classification and Clustering. Kluwer Academic Press: Boston-Dordrecht.
mirkin_mismatch_coeff(normalize=True)[source]

Equivalence mismatch (distance) coefficient

Direct formulation (without the pairwise abstraction):

\[M = \sum_{i=1}^{R} r_{i}^2 + \sum_{j=1}^{C} c_{j}^2 - \sum_{i=1}^{R}\sum_{j=1}^{C} n_{ij}^2,\]

where \(r\) and \(c\) are row and column margins, respectively, with \(R\) and \(C\) cardinalities.

>>> C1 = [{1, 2, 3, 4, 5, 6, 7, 8}, {9, 10, 11, 12, 13, 14, 15, 16}]
>>> C2 = [{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, {11, 12, 13, 14, 15, 16}]
>>> t = ClusteringMetrics.from_partitions(C1, C2)
>>> t.mirkin_mismatch_coeff(normalize=False)
56.0
pairwise

Confusion matrix on all pair assignments from two partitions

A partition of N is a set of disjoint clusters s.t. every point in N belongs to one and only one cluster and every cluster consists of at least one point. Given two partitions A and B and a co-occurrence matrix of point pairs,

TP count of pairs found in the same partition in both A and B
FP count of pairs found in the same partition in A but not in B
FN count of pairs found in the same partition in B but not in A
TN count of pairs in different partitions in both A and B

Note that although the resulting confusion matrix has the form of a correlation table for two binary variables, it is not symmetric if the original partitions are not symmetric.

rand_index()[source]

Pairwise accuracy (uncorrected for chance)

Don’t use this metric; it is only added here as the “worst reference”

class clustering_metrics.metrics.ConfusionMatrix2(TP=None, FN=None, FP=None, TN=None, rows=None)[source]

Bases: clustering_metrics.metrics.ContingencyTable, pymaptools.containers.OrderedCrossTab

A confusion matrix (2x2 contingency table)

For a binary variable (where one is measuring either presence vs absence of a particular feature), a confusion matrix where the ground truth levels are rows looks like this:

>>> cm = ConfusionMatrix2(TP=20, FN=31, FP=14, TN=156)
>>> cm
ConfusionMatrix2(rows=[[20, 31], [14, 156]])
>>> cm.to_array()
array([[ 20,  31],
       [ 14, 156]])

For a nominal variable, the negative class becomes a distinct label, and TP/FP/FN/TN terminology does not apply, although the algorithms should work the same way (with the obvious distinction that different assumptions will be made). For a convenient reference about some of the attributes and methods defined here see [R2].

References

[R2](1, 2) Wikipedia entry for Confusion Matrix

Attributes

TP : True positive count
FP : False positive count
TN : True negative count
FN : False negative count
ACC()[source]

Accuracy (Rand Index)

Synonyms: Simple Matching Coefficient, Rand Index

DOR()[source]

Diagnostics odds ratio

Defined as

\[DOR = \frac{PLL}{NLL}.\]

Odds ratio has a number of interesting/desirable properties, however its one peculiarity that leaves us looking for an alternative measure is that on L-shaped matrices like,

\[\begin{split}\begin{matrix} 77 & 0 \\ 5 & 26 \end{matrix}\end{split}\]

its value will be infinity.

Also known as: crude odds ratio, Mantel-Haenszel estimate.

FDR()[source]

False discovery rate

Synonyms: false alarm ratio, probability of false alarm

FN
FNR()[source]

False Negative Rate

Synonyms: miss rate, frequency of misses

FOR()[source]

False omission rate

Synonyms: detection failure ratio, miss ratio

FP
FPR()[source]

False Positive Rate

Synonyms: fallout

NLL()[source]

Negative likelihood ratio

NPV()[source]

Negative Predictive Value

Synonyms: frequency of correct null forecasts

PLL()[source]

Positive likelihood ratio

PPV()[source]

Positive Predictive Value (Precision)

Synonyms: precision, frequency of hits, post agreement, success ratio, correct alarm ratio

TN
TNR()[source]

True Negative Rate (Specificity)

Synonyms: specificity

TP
TPR()[source]

True Positive Rate (Recall, Sensitivity)

Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance

accuracy()

Accuracy (Rand Index)

Synonyms: Simple Matching Coefficient, Rand Index

bias_index()[source]

Bias Index

In interrater agreement studies, bias is the extent to which the raters disagree on the positive-negative ratio of the binary variable studied. Example of a confusion matrix with high bias of rater A (represented by rows) towards negative rating:

\[\begin{split}\begin{matrix} 17 & 14 \\ 78 & 81 \end{matrix}\end{split}\]

See also

prevalence_index

cole_coeff()[source]

Cole coefficient

This is exactly the same coefficient as Lewontin’s D’. It is defined as:

\[D' = \frac{cov}{cov_{max}},\]

where \(cov_{max}\) is the maximum covariance attainable under the given marginal distribution. When \(ad \geq bc\), this coefficient is equivalent to Loevinger’s H.

Synonyms: C7, Lewontin’s D’.

covar()[source]

Covariance (determinant of a 2x2 matrix)

dice_coeff()[source]

Dice similarity (Nei-Li coefficient)

This is the same as F1-score, but calculated slightly differently here. Note that Dice can be zero if total number of positives is zero, but F-score is undefined in that case (because recall is undefined).

When adjusted for chance, this coefficient becomes identical to kappa [R3].

Since this coefficient is monotonic with respect to Jaccard and Sokal Sneath coefficients, its resolving power is identical to that of the other two.

References

[R3](1, 2) Albatineh, A. N., Niewiadomska-Bugaj, M., & Mihalko, D. (2006). On similarity indices and correction for chance agreement. Journal of Classification, 23(2), 301-313.
diseq_coeff(standardize=False)[source]

Linkage disequilibrium

\[D = \frac{a}{n} - \frac{p_1}{n}\frac{p_2}{n} = \frac{cov}{n^2}\]

If standardize=True, this measure is further normalized to maximum covariance attainable under given marginal distribution, and the resulting index is called Lewontin’s D’.

See also

cole_coeff

frequency_bias()[source]

Frequency bias

How much more often is rater B is predicting TP

classmethod from_ccw(TP, FP, TN, FN)[source]

Instantiate from counter-clockwise form of TP FP TN FN

classmethod from_random_counts(low=0, high=100)[source]

Instantiate from random values

classmethod from_sets(set1, set2, universe_size=None)[source]

Instantiate from two sets

Accepts an optional universe_size parameter which allows us to take into account TN class and use probability-based similarity metrics. Most of the time, however, set comparisons are performed ignoring this parameter and relying instead on non-probabilistic indices such as Jaccard’s or Dice.

fscore(beta=1.0)[source]

F-score

As beta tends to infinity, F-score will approach recall. As beta tends to zero, F-score will approach precision. A similarity coefficient that uses a similar definition is called Dice coefficient.

See also

dice_coeff

get_score(scoring_method, *args, **kwargs)[source]

Evaluate specified scoring method

hypergeometric()[source]

Hypergeometric association score

informedness()[source]

Informedness (recall corrected for chance)

A complement to markedness. Can be thought of as recall corrected for chance. Alternative formulations:

\[\begin{split}Informedness &= Sensitivity + Specificity - 1.0 \\ &= TPR - FPR\end{split}\]

In the case of ranked predictions, TPR can be plotted on the y-axis with FPR on the x-axis. The resulting plot is known as Receiver Operating Characteristic (ROC) curve [R4]. The delta between a point on the ROC curve and the diagonal is equal to the value of informedness at the given FPR threshold.

This measure was first proposed for evaluating medical diagnostics tests in [R5], and was also used in meteorology under the name “True Skill Score” [R6].

Synonyms: Youden’s J, True Skill Score, Hannssen-Kuiper Score, Attributable Risk, DeltaP.

See also

markedness

References

[R4](1, 2) Fawcett, T. (2006). An introduction to ROC analysis. Pattern recognition letters, 27(8), 861-874.
[R5](1, 2) Youden, W. J. (1950). Index for rating diagnostic tests. Cancer, 3(1), 32-35.
[R6](1, 2) Doswell III, C. A., Davies-Jones, R., & Keller, D. L. (1990). On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5(4), 576-585.
jaccard_coeff()[source]

Jaccard similarity coefficient

Jaccard coefficient has an interesting property in that in L-shaped matrices where either FP or FN are close to zero, its scale becomes equivalent to the scale of either recall or precision respectively.

Since this coefficient is monotonic with respect to Dice (F-score) and Sokal Sneath coefficients, its resolving power is identical to that of the other two.

Jaccard index does not belong to the L-family of association indices and thus cannot be adjusted for chance by subtracting the its value under fixed-margin null model. Instead, its expectation must be calculated, for which no analytical solution exists [R7].

Synonyms: critical success index

References

[R7](1, 2) Albatineh, A. N., & Niewiadomska-Bugaj, M. (2011). Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Advances in Data Analysis and Classification, 5(3), 179-200.
kappa()[source]

Cohen’s Kappa (Interrater Agreement)

Kappa coefficient is best known in the psychology field where it was introduced to measure interrater agreement [R8]. It has also been used in replication studies [R9], clustering evaluation [R10], image segmentation [R11], feature selection [R12] [R13], forecasting [R14], and network link prediction [R15]. The first derivation of this measure is in [R16].

Kappa can be derived by correcting either Accuracy (Simple Matching Coefficient, Rand Index) or F1-score (Dice Coefficient) for chance. Conversely, Dice coefficient can be derived from Kappa by obtaining its limit as \(d \rightarrow \infty\). Normalizing Kappa by its maximum value given fixed-margin table gives Loevinger’s H.

Synonyms: Adjusted Rand Index, Heidke Skill Score

References

[R8](1, 2) Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.
[R9](1, 2) Arabie, P., Hubert, L. J., & De Soete, G. (1996). Clustering validation: results and implications for applied analyses (p. 341). World Scientific Pub Co Inc.
[R10](1, 2) Warrens, M. J. (2008). On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. Journal of Classification, 25(2), 177-183.
[R11](1, 2) Briggman, K., Denk, W., Seung, S., Helmstaedter, M. N., & Turaga, S. C. (2009). Maximin affinity learning of image segmentation. In Advances in Neural Information Processing Systems (pp. 1865-1873).
[R12](1, 2) Santos, J. M., & Embrechts, M. (2009). On the use of the adjusted rand index as a metric for evaluating supervised classification. In Artificial neural networks - ICANN 2009 (pp. 175-184). Springer Berlin Heidelberg.
[R13](1, 2) Santos, J. M., & Ramos, S. (2010, November). Using a clustering similarity measure for feature selection in high dimensional data sets. In Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on (pp. 900-905). IEEE.
[R14](1, 2) Doswell III, C. A., Davies-Jones, R., & Keller, D. L. (1990). On summary measures of skill in rare event forecasting based on contingency tables. Weather and Forecasting, 5(4), 576-585.
[R15](1, 2) Hoffman, M., Steinley, D., & Brusco, M. J. (2015). A note on using the adjusted Rand index for link prediction in networks. Social Networks, 42, 72-79.
[R16](1, 2) Heidke, Paul. “Berechnung des Erfolges und der Gute der Windstarkevorhersagen im Sturmwarnungsdienst.” Geografiska Annaler (1926): 301-349.
kappas()[source]

Pairwise precision and recall corrected for chance

Kappa decomposes into a pair of components (regression coefficients), \(\kappa_0\) (precision-like) and \(\kappa_1\) (recall-like), of which it is a harmonic mean:

\[\kappa_0 = \frac{cov}{p_2 q_1}, \quad \kappa_1 = \frac{cov}{p_1 q_2}.\]

These coefficients are interesting because they represent precision and recall, respectively, corrected for chance by subtracting the fixed-margin null model. In clustering context, \(\kappa_0\) corresponds to pairwise homogeneity, while \(\kappa_1\) corresponds to pairwise completeness. The geometric mean of the two components is equal to Matthews’ Correlation Coefficient, while their maximum is equal to Loevinger’s H when \(ad \geq bc\).

lform()[source]

Factory creating L-form version of current table

loevinger_coeff()[source]

Loevinger’s Index of Homogeneity (Loevinger’s H)

Given a clustering (numbers correspond to class labels, inner groups to clusters) with perfect homogeneity but imperfect completeness, Loevinger coefficient returns a perfect score on the corresponding pairwise co-association matrix:

>>> clusters = [[0, 0], [0, 0, 0, 0], [1, 1, 1, 1]]
>>> t = ClusteringMetrics.from_clusters(clusters)
>>> t.pairwise.loevinger_coeff()
1.0

At the same time, kappa and Matthews coefficients are 0.63 and 0.68, respectively. Loevinger coefficient will also return a perfect score for the dual situation:

>>> clusters = [[0, 2, 2, 0, 0, 0], [1, 1, 1, 1]]
>>> t = ClusteringMetrics.from_clusters(clusters)
>>> t.pairwise.loevinger_coeff()
1.0

Loevinger’s coefficient has a unique property: all two-way correlation coefficients on a 2x2 table that are in L-family (including Kappa and Matthews’ correlation coefficient) become Loevinger’s coefficient after normalization by maximum value [R18]. However, this measure is not symmetric: when \(ad < bc\), it does not have a lower bound. For an equivalent symmetric measure, use Cole coefficient.

See also

cole_coeff

References

[R18](1, 2) Warrens, M. J. (2008). On association coefficients for 2x2 tables and properties that do not depend on the marginal distributions. Psychometrika, 73(4), 777-789.
markedness()[source]

Markedness (precision corrected for chance)

A complement to informedness. Can be thought of as precision corrected for chance. Alternative formulations:

\[\begin{split}Markedness &= PPV + NPV - 1.0 \\ &= PPV - FOR\end{split}\]

In the case of ranked predictions, PPV can be plotted on the y-axis with FOR on the x-axis. The resulting plot is known as Relative Operating Level (ROL) curve [R19]. The delta between a point on the ROL curve and the diagonal is equal to the value of markedness at the given FOR threshold.

Synonyms: DeltaP′

See also

informedness

References

[R19](1, 2) Mason, S. J., & Graham, N. E. (2002). Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quarterly Journal of the Royal Meteorological Society, 128(584), 2145-2166.
matthews_corr()[source]

Matthews Correlation Coefficient (Phi coefficient)

MCC is directly related to the Chi-square statistic. Its value is equal to the Chi-square value normalized by the maximum value the Chi-square can achieve with given margins (for a 2x2 table, the maximum Chi-square score is equal to the grand total N) transformed to correlation space by taking a square root.

MCC is a also a geometric mean of informedness and markedness (the regression coefficients of the problem and its dual). As \(d \rightarrow \infty\), MCC turns into Ochiai coefficient. Unlike with Kappa, normalizing the corresponding similarity coefficient for chance by subtracting the fixed-margin null model does not produce MCC in return, but gives a different index with equivalent discriminating power to that of MCC. Normalizing MCC by its maximum value under fixed- margin model gives Loevinger’s H.

Empirically, the discriminating power of MCC is sligtly better than that of mp_corr and kappa, and is only lower than that of loevinger_coeff under highly biased conditions. While MCC is a commonly used and recently preferred measure of prediction and reproducibility [R20], it is somewhat strange that one can hardly find any literature that uses this index in clustering comparison context, with some rare exceptions [R21] [R22].

Synonyms: Phi Coefficient, Product-Moment Correlation

References

[R20](1, 2) MAQC Consortium. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nature biotechnology, 28(8), 827-838.
[R21](1, 2) Xiao, J., Wang, X. F., Yang, Z. F., & Xu, C. W. (2008). Comparison of Supervised Clustering Methods for the Analysis of DNA Microarray Expression Data. Agricultural Sciences in China, 7(2), 129-139.
[R22](1, 2) Kao, D. (2012). Using Matthews correlation coefficient to cluster annotations. NextGenetics (personal blog).
mic_scores(mean='harmonic')[source]

Mutual information-based correlation

The coefficient decomposes into regression coefficients defined according to fixed-margin tables. The mic1 coefficient, for example, is obtained by dividing the G-score by the maximum achievable value on a table with fixed true class counts (which here correspond to row totals). The mic0 is its dual, defined by dividing the G-score by its maximum achievable value with fixed predicted label counts (here represented as column totals).

mic0 roughly corresponds to precision (homogeneity) while mic1 roughly corresponds to recall (completeness).

mp_corr()[source]

Maxwell & Pilliner’s association index

Another covariance-based association index corrected for chance. Like MCC, based on a mean of informedness and markedness, except uses a harmonic mean instead of geometric. Like Kappa, turns into Dice coefficient (F-score) as ‘d’ approaches infinity.

On typical problems, the resolving power of this coefficient is nearly identical to that of Cohen’s Kappa and is only very slightly below that of Matthews’ correlation coefficient.

See also

kappa, matthews_corr

ochiai_coeff()[source]

Ochiai similarity coefficient (Fowlkes-Mallows)

One interpretation of this coefficient that it is equal to the geometric mean of the conditional probability of an element (in the case of pairwise clustering comparison, a pair of elements) belonging to the same cluster given that they belong to the same class [R23].

This coefficient is in the L-family, and thus it can be corrected for chance by subtracting its value under fixed-margin null model. The resulting adjusted index is very close to, but not the same as, Matthews Correlation Coefficient. Empirically, the discriminating power of the adjusted coefficient is equal to that of Matthews’ Correlation Coefficient to within rounding error.

Synonyms: Cosine Similarity, Fowlkes-Mallows Index

References

[R23](1, 2) Ramirez, E. H., Brena, R., Magatti, D., & Stella, F. (2012). Topic model validation. Neurocomputing, 76(1), 125-133.
ochiai_coeff_adj()[source]

Ochiai coefficient adjusted for chance

This index is nearly identical to Mattthews’ Correlation Coefficient, which should be used instead.

overlap_coeff()[source]

Overlap coefficient (Szymkiewicz-Simpson coefficient)

Can be obtained by standardizing Dice or Ochiai coefficients by their maximum possible value given fixed marginals. Not corrected for chance.

Note that \(min(p_1, p_2)\) is equal to the maximum value of \(a\) given fixed marginals.

When adjusted for chance, this coefficient turns into Loevinger’s H.

See also

loevinger_coeff

pairwise_hcv()[source]

Pairwise homogeneity, completeness, and their geometric mean

Each of the two one-sided measures is defined as follows:

\[\hat{M}_{adj} = \frac{M - E[M]}{M_{max} - min(E[M], M)}.\]

It is clear from the definition above that iff \(M < E[M]\) and \(M \leq M_{max}\), the denominator will switch from the standard normalization interval to a larger one, thereby ensuring that \(-1.0 \leq \hat{M}_{adj} \leq 1.0\). The definition for the bottom half of the range can also be expressed in terms of the standard adjusted value:

\[\hat{M}_{adj} = \frac{M_{adj}}{(1 + |M_{adj}|^n)^{1/n}}, \quad M_{adj} < 0, n = 1.\]

The resulting measure is not symmetric over its range (negative values are scaled differently from positive values), however this should not matter for applications where negative correlation does not carry any special meaning other than being additional evidence for absence of positive correlation. Such as a situation occurs in pairwise confusion matrices used in cluster analysis. Nevertheless, if more symmetric behavior near zero is desired, the upper part of the negative range can be linearized either by increasing \(n\) in the definition above or by replacing it with \(\hat{M}_{adj} = tanh(M_{adj})\) transform.

For the compound measure, the geometric mean was chosen over the harmonic after the results of a Monte Carlo power analysis, due to slightly better discriminating performance. For positive matrices, the geometric mean is equal to matthews_corr, while the harmonic mean would have been equal to kappa. For negative matrices, the harmonic mean would have remained monotonic (though not equal) to Kappa, while the geometric mean is neither monotonic nor equal to MCC, despite the two being closely correlated. The discriminating performance indices of the geometric mean and of MCC are empirically the same (equal to within rounding error).

For matrices with negative covariance, it is possible to switch to markedness and informedness as one-sided components (homogeneity and completeness, respectively). However, the desirable property of measure orthogonality will not be preserved then, since markedness and informedness exhibit strong correlation under the assumed null model.

precision()

Positive Predictive Value (Precision)

Synonyms: precision, frequency of hits, post agreement, success ratio, correct alarm ratio

prevalence_index()[source]

Prevalence

In interrater agreement studies, prevalence is high when the proportion of agreements on the positive classification differs from that of the negative classification. Example of a confusion matrix with high prevalence of negative response (note that this happens regardless of which rater we look at):

\[\begin{split}\begin{matrix} 3 & 27 \\ 28 & 132 \end{matrix}\end{split}\]

See also

bias_index

recall()

True Positive Rate (Recall, Sensitivity)

Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance

sensitivity()

True Positive Rate (Recall, Sensitivity)

Synonyms: recall, sensitivity, hit rate, probability of detection, prefigurance

sokal_sneath_coeff()[source]

Sokal and Sneath similarity index

In a 2x2 matrix

\[\begin{split}\begin{matrix} a & b \\ c & d \end{matrix}\end{split}\]

Dice places more weight on \(a\) component, Jaccard places equal weight on \(a\) and \(b + c\), while Sokal and Sneath places more weight on \(b + c\).

specificity()

True Negative Rate (Specificity)

Synonyms: specificity

to_ccw()[source]

Convert to counter-clockwise form of TP FP TN FN

xcoeff()[source]

Alternative to loevinger_coeff but with -1 lower bound

yule_q()[source]

Yule’s Q (association index)

Yule’s Q relates to the odds ratio (DOR) as follows:

\[Q = \frac{DOR - 1}{DOR + 1}.\]
yule_y()[source]

Yule’s Y (colligation coefficient)

The Y coefficient was used as basis of a new association measure by accounting for entropy in [R24].

References

[R24](1, 2) Hasenclever, D., & Scholz, M. (2013). Comparing measures of association in 2x2 probability tables. arXiv preprint arXiv:1302.6161.
class clustering_metrics.metrics.ContingencyTable(*args, **kwargs)[source]

Bases: pymaptools.containers.CrossTab

adjust_to_null(measure, model='m3', with_warnings=False)[source]

Adjust a measure to null model

The general formula for chance correction of an association measure \(M\) is:

\[M_{adj} = \frac{M - E(M)}{M_{max} - E(M)},\]

where \(M_{max}\) is the maximum value a measure \(M\) can achieve, and \(E(M)\) is the expected value of \(M\) under statistical independence given fixed table margins. In simple cases, the expected value of a measure is the same as the value of the measure given a null model. This is not always the case, however, and, to properly adjust for chance, sometimes one has average over all possible contingency tables using hypergeometric distribution for example.

The method returns a tuple for two different measure ceilings: row- diagonal and column-diagonal. For symmetric measures, the two values will be the same.

adjusted_mutual_info()[source]

Adjusted Mutual Information for two partitions

For a mathematical definition, see [R25], [R26], and [R26].

References

[R25](1, 2) Vinh, N. X., Epps, J., & Bailey, J. (2009, June). Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th Annual International Conference on Machine Learning (pp. 1073-1080). ACM.
[R26](1, 2, 3) Vinh, N. X., & Epps, J. (2009, June). A novel approach for automatic number of clusters detection in microarray data based on consensus clustering. In Bioinformatics and BioEngineering, 2009. BIBE‘09. Ninth IEEE International Conference on (pp. 84-91). IEEE.
[R27]Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837-2854.
assignment_score(normalize=True, model='m1', discrete=False, redraw=False)[source]

Similarity score by solving the Linear Sum Assignment Problem

This metric is uniformly more powerful than the similarly behaved split_join_similarity which relies on an approximation to the optimal solution evaluated here. The split-join approximation asymptotically approaches the optimal solution as the clustering quality improves.

On the model parameter: adjusting assignment cost for chance by relying on the hypergeometric distribution is extremely computationally expensive, but one way to get a better behaved metric is to just subtract the cost of a null model from the obtained score (in case of normalization, the null cost also has to be subtracted from the maximum cost). Note that on large tables even finding the null cost is too expensive, since expected tables have a lot less sparsity. Hence the parameter is off by default.

Alternatively this problem can be recast as that of finding a maximum weighted bipartite match [R28].

This method of partition comparison was first mentioned in [R29], given an approximation in [R30], formally elaborated in [R31] and empirically compared with other measures in [R32].

References

[R28](1, 2) Wikipedia entry on weighted bipartite graph matching
[R29](1, 2) Almudevar, A., & Field, C. (1999). Estimation of single-generation sibling relationships based on DNA markers. Journal of agricultural, biological, and environmental statistics, 136-165.
[R30](1, 2) Ben-Hur, A., & Guyon, I. (2003). Detecting stable clusters using principal component analysis. In Functional Genomics (pp. 159-182). Humana press.
[R31](1, 2) Gusfield, D. (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82(3), 159-164.
[R32](1, 2) Giurcaneanu, C. D., & Tabus, I. (2004). Cluster structure inference based on clustering stability with applications to microarray data analysis. EURASIP Journal on Applied Signal Processing, 2004, 64-80.
assignment_score_m1(normalize=True, redraw=False)[source]
assignment_score_m2c(normalize=True, redraw=False)[source]
assignment_score_m2r(normalize=True, redraw=False)[source]
assignment_score_m3(normalize=True, redraw=False)[source]
bc_metrics()[source]

‘B-cubed’ precision, recall, and fscore

As described in [R33] and [R34]. Was extended to overlapping clusters in [R35]. These metrics perform very similarly to normalized entropy metrics (homogeneity, completeness, V-measure).

References

[R33](1, 2) Bagga, A., & Baldwin, B. (1998, August). Entity-based cross- document coreferencing using the vector space model. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1 (pp. 79-85). Association for Computational Linguistics.
[R34](1, 2) Bagga, A., & Baldwin, B. (1998, May). Algorithms for scoring coreference chains. In The first international conference on language resources and evaluation workshop on linguistics coreference (Vol. 1, pp. 563-566).
[R35](1, 2) Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval, 12(4), 461-486.
chisq_score()[source]

Pearson’s chi-square statistic

>>> r = {1: {1: 16, 3: 2}, 2: {1: 1, 2: 3}, 3: {1: 4, 2: 5, 3: 5}}
>>> cm = ContingencyTable(rows=r)
>>> round(cm.chisq_score(), 3)
19.256
col_diag()[source]

Factory creating diagonal table given current column margin

entropy_scores(mean='harmonic')[source]

Gives three entropy-based metrics for a RxC table

The metrics are: Homogeneity, Completeness, and V-measure

The V-measure metric is also known as Normalized Mutual Information (NMI), and is calculated here as the harmonic mean of Homogeneity and Completeness (\(NMI_{sum}\)). There exist other definitions of NMI (see Table 2 in [R36] for a good review).

Homogeneity and Completeness are duals of each other and can be thought of (although this is not technically accurate) as squared regression coefficients of a given clustering vs true labels (homogeneity) and of the dual problem of true labels vs given clustering (completeness). Because of the dual property, in a symmetric matrix, all three scores are the same. Homogeneity has an overall profile similar to that of precision in information retrieval. Completeness roughly corresponds to recall.

This method replaces homogeneity_completeness_v_measure method in Scikit-Learn. The Scikit-Learn version takes up \(O(n^2)\) space because it stores data in a dense NumPy array, while the given version is sub-quadratic because of sparse underlying storage.

Note that the entropy variables H in the code below are improperly defined because they ought to be divided by N (the grand total for the contingency table). However, the N variable cancels out during normalization.

References

[R36](1, 2) Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837-2854.
expected(model='m3', discrete=False, redraw=False)[source]

Factory creating expected table given current margins

expected_freqs_(model='m3')[source]
g_score()[source]

G-statistic for RxC contingency table

This method does not perform any corrections to this statistic (e.g. Williams’, Yates’ corrections).

The statistic is equivalent to the negative of Mutual Information times two. Mututal Information on a contingency table is defined as the difference between the information in the table and the information in an independent table with the same margins. For application of mutual information (in the form of G-score) to search for collocated words in NLP, see [R37] and [R38].

References

[R37](1, 2) Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1), 61-74.
[R38](1, 2) Ted Dunning’s personal blog entry and the discussion under it.
muc_scores()[source]

MUC similarity indices for coreference scoring

Implemented after description in [R39]. The compound fscore-like metric has good resolving power on sparse models, similar to fowlkes_mallows (pairwise ochiai_coeff), however it becomes useless on dense matrices as it relies on category cardinalities (how many types were seen) rather than on observation counts (how many instances of each type were seen).

>>> p1 = [x.split() for x in ["A B C", "D E F G"]]
>>> p2 = [x.split() for x in ["A B", "C", "D", "E", "F G"]]
>>> cm = ClusteringMetrics.from_partitions(p1, p2)
>>> cm.muc_scores()[:2]
(1.0, 0.4)

Elements that are part of neither partition (in this case, E) are excluded from consideration:

>>> p1 = [x.split() for x in ["A B", "C", "D", "F G", "H"]]
>>> p2 = [x.split() for x in ["A B", "C D", "F G H"]]
>>> cm = ClusteringMetrics.from_partitions(p1, p2)
>>> cm.muc_scores()[:2]
(0.5, 1.0)

References

[R39](1, 2) Vilain, M., Burger, J., Aberdeen, J., Connolly, D., & Hirschman, L. (1995, November). A model-theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message understanding (pp. 45-52). Association for Computational Linguistics.
mutual_info_score()[source]

Mutual Information Score

Mutual Information (divided by N).

The metric is equal to the Kullback-Leibler divergence of the joint distribution with the product distribution of the marginals.

row_diag()[source]

Factory creating diagonal table given current row margin

split_join_distance(normalize=True)[source]

Distance metric based on split_join_similarity

split_join_similarity(normalize=True, model='m1')[source]

Split-join similarity score

Split-join similarity is a two-way assignment-based score first proposed in [R40]. The distance variant of this measure has metric properties. Like the better known purity score (a one-way coefficient), this measure implicitly performs class-cluster assignment, except the assignment is performed twice: based on the corresponding maximum frequency in the contingency table, each class is given a cluster with the assignment weighted according to the frequency, then the procedure is inversed to assign a class to each cluster. The final unnormalized distance score comprises of a simple sum of the two one-way assignment scores.

By default, m1 null model is subtracted, to make the final score independent of the number of clusters:

>>> t2 = ClusteringMetrics(rows=10 * np.ones((2, 2), dtype=int))
>>> t2.split_join_similarity(model=None)
0.5
>>> t2.split_join_similarity(model='m1')
0.0
>>> t8 = ClusteringMetrics(rows=10 * np.ones((8, 8), dtype=int))
>>> t8.split_join_similarity(model=None)
0.125
>>> t8.split_join_similarity(model='m1')
0.0

See also

assignment_score

References

[R40](1, 2) Dongen, S. V. (2000). Performance criteria for graph clustering and Markov cluster experiments. Information Systems [INS], (R 0012), 1-36.
split_join_similarity_m1(normalize=True)[source]
split_join_similarity_m2c(normalize=True)[source]
split_join_similarity_m2r(normalize=True)[source]
split_join_similarity_m3(normalize=True)[source]
talburt_wang_index()[source]

Talburt-Wang index of similarity of two partitions

On sparse matrices, the resolving power of this measure asymptotically approaches that of assignment-based scores such as assignment_score and split_join_similarity, however on dense matrices this measure will not perform well due to its reliance on category cardinalities (how many types were seen) rather than on observation counts (how many instances of each type were seen).

A relatively decent clustering:

>>> a = [ 1,  1,  1,  2,  2,  2,  2,  3,  3,  4]
>>> b = [43, 56, 56,  5, 36, 36, 36, 74, 74, 66]
>>> t = ContingencyTable.from_labels(a, b)
>>> round(t.talburt_wang_index(), 3)
0.816

Less good clustering (example from [R41]):

>>> clusters = [[1, 1], [1, 1, 1, 1], [2, 3], [2, 2, 3, 3],
...             [3, 3, 4], [3, 4, 4, 4, 4, 4, 4, 4, 4, 4]]
>>> t = ContingencyTable.from_clusters(clusters)
>>> round(t.talburt_wang_index(), 2)
0.49

References

[R41](1, 2) Talburt, J., Wang, R., Hess, K., & Kuo, E. (2007). An algebraic approach to data quality metrics for entity resolution over large datasets. Information quality management: Theory and applications, 1-22.
to_array(default=0, cpad=False, rpad=False)[source]

Convert to NumPy array

vi_distance(normalize=True)[source]

Variation of Information distance

Defined in [R42]. This is one of several possible entropy-based distance measures that could be defined on a RxC matrix. The given measure is equivalent to \(2 D_{sum}\) as listed in Table 2 in [R43].

Note that the entropy variables H below are calculated using natural logs, so a base correction may be necessary if you need your result in base 2 for example.

References

[R42](1, 2) Meila, M. (2007). Comparing clusterings – an information based distance. Journal of multivariate analysis, 98(5), 873-895.
[R43](1, 2) Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research, 11, 2837-2854.
vi_similarity(normalize=True, model='m1')[source]

Inverse of vi_distance

The m1 adjustment is monotonic for tables of fixed size. The m3 adjustment turns this measure into Normalized Mutual Information (NMI)

vi_similarity_m1(normalize=True)[source]
vi_similarity_m2c(normalize=True)[source]
vi_similarity_m2r(normalize=True)[source]
vi_similarity_m3(normalize=True)[source]
clustering_metrics.metrics.adjusted_mutual_info_score(labels_true, labels_pred)[source]

Adjusted Mutual Information for two partitions

This is a memory-efficient replacement for the equivalently named Scikit-Learn function.

Perfect labelings are both homogeneous and complete, hence AMI has the perfect score of one:

>>> adjusted_mutual_info_score([0, 0, 1, 1], [0, 0, 1, 1])
1.0
>>> adjusted_mutual_info_score([0, 0, 1, 1], [1, 1, 0, 0])
1.0

If classes members are completely split across different clusters, the assignment is utterly incomplete, hence AMI equals zero:

>>> adjusted_mutual_info_score([0, 0, 0, 0], [0, 1, 2, 3])
0.0
clustering_metrics.metrics.adjusted_rand_score(labels_true, labels_pred)[source]

Rand score (accuracy) corrected for chance

This is a memory-efficient replacement for the equivalently named Scikit-Learn function

In a supplement to [R44], the following example is given:

>>> classes = [1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
>>> clusters = [1, 2, 1, 2, 2, 3, 3, 3, 3, 3]
>>> round(adjusted_rand_score(classes, clusters), 3)
0.313

References

[R44](1, 2) Yeung, K. Y., & Ruzzo, W. L. (2001). Details of the adjusted Rand index and clustering algorithms, supplement to the paper “An empirical study on principal component analysis for clustering gene expression data”. Bioinformatics, 17(9), 763-774.
clustering_metrics.metrics.cohen_kappa(*args, **kwargs)[source]

Return Cohen’s Kappa for a 2x2 contingency table

clustering_metrics.metrics.confmat2_type

alias of ConfusionMatrix2

clustering_metrics.metrics.geometric_mean(x, y)[source]

Geometric mean of two numbers. Always returns a float

Although geometric mean is defined for negative numbers, Scipy function doesn’t allow it. Hence this function

clustering_metrics.metrics.geometric_mean_weighted(x, y, ratio=1.0)[source]

Geometric mean of two numbers with a weight ratio. Returns a float

>>> geometric_mean_weighted(1, 4, ratio=1.0)
2.0
>>> geometric_mean_weighted(1, 4, ratio=0.0)
1.0
>>> geometric_mean_weighted(1, 4, ratio=float('inf'))
4.0
clustering_metrics.metrics.harmonic_mean(x, y)[source]

Harmonic mean of two numbers. Always returns a float

clustering_metrics.metrics.harmonic_mean_weighted(x, y, ratio=1.0)[source]

Harmonic mean of two numbers with a weight ratio. Returns a float

>>> harmonic_mean_weighted(1, 3, ratio=1.0)
1.5
>>> harmonic_mean_weighted(1, 3, ratio=0.0)
1.0
>>> harmonic_mean_weighted(1, 3, ratio=float('inf'))
3.0
clustering_metrics.metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)[source]

Memory-efficient replacement for equivalently named Scikit-Learn function

clustering_metrics.metrics.jaccard_similarity(iterable1, iterable2)[source]

Jaccard similarity between two sets

Parameters:

iterable1 : collections.Iterable

first bag of items (order irrelevant)

iterable2 : collections.Iterable

second bag of items (order irrelevant)

Returns:

jaccard_similarity : float

clustering_metrics.metrics.mutual_info_score(labels_true, labels_pred)[source]

Memory-efficient replacement for equivalently named Sklean function

clustering_metrics.metrics.product_moment(*args, **kwargs)[source]

Return MCC score for a 2x2 contingency table

clustering_metrics.metrics.ratio2weights(ratio)[source]

Numerically accurate conversion of ratio of two weights to weights

clustering_metrics.metrics.unitsq_sigmoid(x, s=0.5)[source]

Unit square sigmoid (for transforming P-like scales)

>>> round(unitsq_sigmoid(0.1), 4)
0.25
>>> round(unitsq_sigmoid(0.9), 4)
0.75