clustering_metrics.ranking module

Motivation

Assume that there is a data set of mostly unique samples where a hidden binary variable is dependent on the number of similar samples that exist in the set (i.e. a sample is called positive if it has many neighbors) and that our goal is to label all samples in this set. Given sparse enough data, if a clustering method relies on the same sample property on which the ground truth similarity space is defined, it will naturally separate the samples into two groups – those found in clusters and containing mostly positives, and those found outside clusters and containing mostly negatives. There would exist only one possible perfect clustering—one with a single, entirely homogeneous cluster C that covers all positives present in the data set. If we were to produce such a clustering, we could correctly label all positive samples in one step with the simple rule, all positive samples belong to cluster C. Under an imperfect clustering, however, the presence of the given sample in a cluster of size two or more implies the sample is only somewhat more likely to be positive, with the confidence of the positive call monotonously increasing with the size of the cluster. In other words, our expectation from a good clustering is that it will help us minimize the amount of work labeling samples.

This idea for this metric originated when mining for positive spam examples in large data sets of short user-generated content. Given large enough data sets, spam content naturally forms clusters either because creative rewriting of every single individual spam message is too expensive for spammers to employ, or because, even if human or algorithmic rewriting is applied, one can still find features that link individual spam messages to their creator or to the product or service being promoted in the spam campaign. The finding was consistent with what is reported in literature [104].

Algorithm

Given a clustering, we order the clusters from the largest one to the smallest one. We then plot a cumulative step function where the width of the bin under a given “step” is proportional to cluster size, and the height of the bin is proportional to the expected number of positive samples seen so far [103]. If a sample is in a cluster of size one, we assume it is likely to be negative and is therefore checked on an individual basis (the specific setting of cluster size at which the expectation changes is our ‘threshold’ parameter. The result of this assumption is that the expected contribution from unclustered samples is equal to their actual contribution (we assume individual checking always gives a correct answer). After two-way normalization, a perfect clustering (i.e. where a single perfectly homogeneous cluster covers the entire set of positives) will have the AUL score of 1.0. A failure to will result in the AUL of 0.5. A perverse clustering, i.e. one where many negative samples fall into clusters whose size is above our threshold, or where many positive samples remain unclustered (fall into clusters of size below the threshold one) the AUL somewhere between 0.0 and 0.5.

A special treatment is necessary for cases where clusters are tied by size. If one were to treat tied clusters as a single group, one would obtain AUL of 1.0 when no clusters at all are present, which is against our desiderata. On the other hand, if one were to treat tied clusters entirely separately, one would obtain different results depending on the properties of the sorting algorithm, also an undesirable situation. Always placing “heavy” clusters (i.e. those containing more positives) towards the beginning or towards the end of the tied group will result in, respectively, overestimating or underestimating the true AUL. The solution here is to average the positive counts among all clusters in a tied group, and then walk through them one by one, with the stepwise cumulative function asymptotically approaching a diagonal from the group’s bottom left corner to the top right one. This way, a complete absence of clustering (i.e. all clusters are of size one) will always result in AUL of 0.5.

The resulting AUL measure has some similarity with the Gini coefficient of inequality [105] except we plot the corresponding curve in the opposite direction (from “richest” to “poorest”), and do not subtract 0.5 from the resulting score.

[103]We take the expected number of positives and not the actual number seen so far as the vertical scale in order to penalize non-homogeneous clusters. Otherwise the y=1.0 ceiling would be reached early in the process even in very bad cases, for example when there is only one giant non-homogeneous cluster.

References

[104]Whissell, J. S., & Clarke, C. L. (2011, September). Clustering for semi-supervised spam filtering. In Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (pp. 125-134). ACM.
[105]Wikipedia entry for Gini coefficient of inequality
class clustering_metrics.ranking.LiftCurve(score_groups)[source]

Bases: object

Lift Curve for cluster-size correlated classification

aul_score(threshold=1, plot=False)[source]

Calculate AUL score

Parameters:

threshold : int, optional (default=1)

only predicted scores above this number considered accurate

plot : bool, optional (default=False)

whether to return X and Y data series for plotting

classmethod from_clusters(clusters, is_class_pos=<function num2bool>)[source]

Instantiates class from clusters of class-coded points

Parameters:

clusters : collections.Iterable

List of lists of class labels

is_class_pos: label_true -> Bool

Boolean predicate used to binarize true (class) labels

classmethod from_counts(counts_true, counts_pred)[source]

Instantiates class from arrays of true and predicted counts

Parameters:

counts_true : array, shape = [n_clusters]

Count of positives in cluster

counts_pred : array, shape = [n_clusters]

Predicted number of positives in each cluster

classmethod from_labels(labels_true, labels_pred, is_class_pos=<function num2bool>)[source]

Instantiates class from arrays of classes and cluster sizes

Parameters:

labels_true : array, shape = [n_samples]

Class labels. If binary, ‘is_class_pos’ is optional

labels_pred : array, shape = [n_samples]

Cluster labels to evaluate

is_class_pos: label_true -> Bool

Boolean predicate used to binarize true (class) labels

plot(threshold=1, fill=True, marker=None, save_to=None)[source]

Create a graphical representation of Lift Curve

Requires Matplotlib

Parameters:

threshold : int, optional (default=1)

only predicted scores above this number considered accurate

marker : str, optional (default=None)

Whether to draw marker at each bend

save_to : str, optional (default=None)

If specified, save the plot to path instead of displaying

class clustering_metrics.ranking.RocCurve(fprs, tprs, thresholds=None, pos_label=None, sample_weight=None)[source]

Bases: object

Receiver Operating Characteristic (ROC)

>>> c = RocCurve.from_labels([0, 0, 1, 1],
...                          [0.1, 0.4, 0.35, 0.8])
>>> c.auc_score()
0.75
>>> c.max_informedness()
0.5
auc_score()[source]

Replacement for Scikit-Learn’s method

If number of Y classes is other than two, a warning will be triggered but no exception thrown (the return value will be a NaN). Also, we don’t reorder arrays during ROC calculation since they are assumed to be in order.

classmethod from_clusters(clusters, is_class_pos=<function num2bool>)[source]

Instantiates class from clusters of class-coded points

Parameters:

clusters : collections.Iterable

List of lists of class labels

is_class_pos: label_true -> Bool

Boolean predicate used to binarize true (class) labels

classmethod from_labels(labels_true, y_score, is_class_pos=<function num2bool>)[source]

Instantiate assuming binary labeling of {0, 1}

labels_true
: array, shape = [n_samples]
Class labels. If binary, ‘is_class_pos’ is optional
y_score
: array, shape = [n_samples]
Predicted scores
is_class_pos: label_true -> Bool
Boolean predicate used to binarize true (class) labels
classmethod from_scores(scores_neg, scores_pos)[source]

Instantiate given scores of two ground truth classes

The score arrays don’t have to be the same length.

max_informedness()[source]

Maximum value of Informedness (TPR minus FPR) on a ROC curve

A diagram of what this measure looks like is shown in [101]. Note a correspondence between the definitions of this measure and that of Kolmogorov-Smirnov’s supremum statistic.

References

[101](1, 2) Wikipedia entry for Youden’s J statistic
optimal_cutoff(scoring_method)[source]

Optimal cutoff point on ROC curve under scoring method

The scoring method must take two arguments: fpr and tpr.

plot(fill=True, marker=None, save_to=None)[source]

Plot the ROC curve

clustering_metrics.ranking.aul_score_from_clusters(clusters)[source]

Calculate AUL score given clusters of class-coded points

Parameters:

clusters : collections.Iterable

List of clusters where each point is binary-coded according to true class.

Returns:

aul : float

clustering_metrics.ranking.aul_score_from_labels(y_true, labels_pred)[source]

AUL score given array of classes and array of cluster sizes

Parameters:

y_true : array, shape = [n_samples]

True binary labels in range {0, 1}

labels_pred : array, shape = [n_samples]

Cluster labels to evaluate

Returns:

aul : float

clustering_metrics.ranking.dist_auc(scores0, scores1)[source]

AUC score for two distributions, with NaN correction

Note: arithmetic mean appears to be appropriate here, as other means don’t result in total of 1.0 when sides are switched.

clustering_metrics.ranking.num2bool(num)[source]

True if zero or positive real, False otherwise

When binarizing class labels, this lets us be consistent with Scikit-Learn where binary labels can be {0, 1} with 0 being negative or {-1, 1} with -1 being negative.

clustering_metrics.ranking.roc_auc_score(y_true, y_score, sample_weight=None)[source]

AUC score for a ROC curve

Replaces Scikit Learn implementation (given binary y_true).