clustering_metrics.ranking module¶
Motivation¶
Assume that there is a data set of mostly unique samples where a hidden binary variable is dependent on the number of similar samples that exist in the set (i.e. a sample is called positive if it has many neighbors) and that our goal is to label all samples in this set. Given sparse enough data, if a clustering method relies on the same sample property on which the ground truth similarity space is defined, it will naturally separate the samples into two groups – those found in clusters and containing mostly positives, and those found outside clusters and containing mostly negatives. There would exist only one possible perfect clustering—one with a single, entirely homogeneous cluster C that covers all positives present in the data set. If we were to produce such a clustering, we could correctly label all positive samples in one step with the simple rule, all positive samples belong to cluster C. Under an imperfect clustering, however, the presence of the given sample in a cluster of size two or more implies the sample is only somewhat more likely to be positive, with the confidence of the positive call monotonously increasing with the size of the cluster. In other words, our expectation from a good clustering is that it will help us minimize the amount of work labeling samples.
This idea for this metric originated when mining for positive spam examples in large data sets of short user-generated content. Given large enough data sets, spam content naturally forms clusters either because creative rewriting of every single individual spam message is too expensive for spammers to employ, or because, even if human or algorithmic rewriting is applied, one can still find features that link individual spam messages to their creator or to the product or service being promoted in the spam campaign. The finding was consistent with what is reported in literature [104].
Algorithm¶
Given a clustering, we order the clusters from the largest one to the smallest one. We then plot a cumulative step function where the width of the bin under a given “step” is proportional to cluster size, and the height of the bin is proportional to the expected number of positive samples seen so far [103]. If a sample is in a cluster of size one, we assume it is likely to be negative and is therefore checked on an individual basis (the specific setting of cluster size at which the expectation changes is our ‘threshold’ parameter. The result of this assumption is that the expected contribution from unclustered samples is equal to their actual contribution (we assume individual checking always gives a correct answer). After two-way normalization, a perfect clustering (i.e. where a single perfectly homogeneous cluster covers the entire set of positives) will have the AUL score of 1.0. A failure to will result in the AUL of 0.5. A perverse clustering, i.e. one where many negative samples fall into clusters whose size is above our threshold, or where many positive samples remain unclustered (fall into clusters of size below the threshold one) the AUL somewhere between 0.0 and 0.5.
A special treatment is necessary for cases where clusters are tied by size. If one were to treat tied clusters as a single group, one would obtain AUL of 1.0 when no clusters at all are present, which is against our desiderata. On the other hand, if one were to treat tied clusters entirely separately, one would obtain different results depending on the properties of the sorting algorithm, also an undesirable situation. Always placing “heavy” clusters (i.e. those containing more positives) towards the beginning or towards the end of the tied group will result in, respectively, overestimating or underestimating the true AUL. The solution here is to average the positive counts among all clusters in a tied group, and then walk through them one by one, with the stepwise cumulative function asymptotically approaching a diagonal from the group’s bottom left corner to the top right one. This way, a complete absence of clustering (i.e. all clusters are of size one) will always result in AUL of 0.5.
The resulting AUL measure has some similarity with the Gini coefficient of inequality [105] except we plot the corresponding curve in the opposite direction (from “richest” to “poorest”), and do not subtract 0.5 from the resulting score.
[103] | We take the expected number of positives and not the actual number seen so far as the vertical scale in order to penalize non-homogeneous clusters. Otherwise the y=1.0 ceiling would be reached early in the process even in very bad cases, for example when there is only one giant non-homogeneous cluster. |
References¶
[105] | Wikipedia entry for Gini coefficient of inequality |
-
class
clustering_metrics.ranking.
LiftCurve
(score_groups)[source]¶ Bases:
object
Lift Curve for cluster-size correlated classification
-
aul_score
(threshold=1, plot=False)[source]¶ Calculate AUL score
Parameters: threshold : int, optional (default=1)
only predicted scores above this number considered accurate
plot : bool, optional (default=False)
whether to return X and Y data series for plotting
-
classmethod
from_clusters
(clusters, is_class_pos=<function num2bool>)[source]¶ Instantiates class from clusters of class-coded points
Parameters: clusters : collections.Iterable
List of lists of class labels
is_class_pos: label_true -> Bool
Boolean predicate used to binarize true (class) labels
-
classmethod
from_counts
(counts_true, counts_pred)[source]¶ Instantiates class from arrays of true and predicted counts
Parameters: counts_true : array, shape = [n_clusters]
Count of positives in cluster
counts_pred : array, shape = [n_clusters]
Predicted number of positives in each cluster
-
classmethod
from_labels
(labels_true, labels_pred, is_class_pos=<function num2bool>)[source]¶ Instantiates class from arrays of classes and cluster sizes
Parameters: labels_true : array, shape = [n_samples]
Class labels. If binary, ‘is_class_pos’ is optional
labels_pred : array, shape = [n_samples]
Cluster labels to evaluate
is_class_pos: label_true -> Bool
Boolean predicate used to binarize true (class) labels
-
plot
(threshold=1, fill=True, marker=None, save_to=None)[source]¶ Create a graphical representation of Lift Curve
Requires Matplotlib
Parameters: threshold : int, optional (default=1)
only predicted scores above this number considered accurate
marker : str, optional (default=None)
Whether to draw marker at each bend
save_to : str, optional (default=None)
If specified, save the plot to path instead of displaying
-
-
class
clustering_metrics.ranking.
RocCurve
(fprs, tprs, thresholds=None, pos_label=None, sample_weight=None)[source]¶ Bases:
object
Receiver Operating Characteristic (ROC)
>>> c = RocCurve.from_labels([0, 0, 1, 1], ... [0.1, 0.4, 0.35, 0.8]) >>> c.auc_score() 0.75 >>> c.max_informedness() 0.5
-
auc_score
()[source]¶ Replacement for Scikit-Learn’s method
If number of Y classes is other than two, a warning will be triggered but no exception thrown (the return value will be a NaN). Also, we don’t reorder arrays during ROC calculation since they are assumed to be in order.
-
classmethod
from_clusters
(clusters, is_class_pos=<function num2bool>)[source]¶ Instantiates class from clusters of class-coded points
Parameters: clusters : collections.Iterable
List of lists of class labels
is_class_pos: label_true -> Bool
Boolean predicate used to binarize true (class) labels
-
classmethod
from_labels
(labels_true, y_score, is_class_pos=<function num2bool>)[source]¶ Instantiate assuming binary labeling of {0, 1}
- labels_true : array, shape = [n_samples]
- Class labels. If binary, ‘is_class_pos’ is optional
- y_score : array, shape = [n_samples]
- Predicted scores
- is_class_pos: label_true -> Bool
- Boolean predicate used to binarize true (class) labels
-
classmethod
from_scores
(scores_neg, scores_pos)[source]¶ Instantiate given scores of two ground truth classes
The score arrays don’t have to be the same length.
-
max_informedness
()[source]¶ Maximum value of Informedness (TPR minus FPR) on a ROC curve
A diagram of what this measure looks like is shown in [101]. Note a correspondence between the definitions of this measure and that of Kolmogorov-Smirnov’s supremum statistic.
References
[101] (1, 2) Wikipedia entry for Youden’s J statistic
-
-
clustering_metrics.ranking.
aul_score_from_clusters
(clusters)[source]¶ Calculate AUL score given clusters of class-coded points
Parameters: clusters : collections.Iterable
List of clusters where each point is binary-coded according to true class.
Returns: aul : float
-
clustering_metrics.ranking.
aul_score_from_labels
(y_true, labels_pred)[source]¶ AUL score given array of classes and array of cluster sizes
Parameters: y_true : array, shape = [n_samples]
True binary labels in range {0, 1}
labels_pred : array, shape = [n_samples]
Cluster labels to evaluate
Returns: aul : float
-
clustering_metrics.ranking.
dist_auc
(scores0, scores1)[source]¶ AUC score for two distributions, with NaN correction
Note: arithmetic mean appears to be appropriate here, as other means don’t result in total of 1.0 when sides are switched.