clscurves package¶
Subpackages¶
Submodules¶
clscurves.config module¶
-
class
clscurves.config.MetricsAliases¶ Bases:
objectMetrics key aliases.
A class to provide human-readable labels for each key in a
metrics_dictthat we might want use when coloring a classification curve plot.-
cbar_dict= {'f1': 'F1 Score', 'fn': 'False Negative (FN) Count', 'fn_w': 'Weighted False Negative (FN) Sum', 'fp': 'False Positive (FP) Count', 'fp_w': 'Weighted False Positive (FP) Sum', 'fpr': 'FPR = FP/(FP + TN)', 'fpr_w': 'Weighted FPR = FP/(FP + TN)', 'frac': 'Fraction Flagged', 'frac_w': 'Weighted Fraction Flagged', 'precision': 'Precision = TP/(TP + FP)', 'recall': 'Recall = TP/(TP + FN)', 'thresh': 'Score Threshold Value', 'tn': 'True Negative (TN) Count', 'tn_w': 'Weighted True Negative (TN) Sum', 'tp': 'True Positive (TP) Count', 'tp_w': 'Weighted True Positive (TP) Sum', 'tpr': 'Recall = TP/(TP + FN)', 'tpr_w': 'Weighted Recall = TP/(TP + FN)'}¶
-
clscurves.covariance module¶
-
class
clscurves.covariance.CovarianceEllipseGenerator(data: numpy.ndarray)¶ Bases:
objectA class to generate a stylized covariance elipse.
Given a collection of 2D points that are assumed to be distributed according to a bivariate normal distribution, compute and plot an elliptical confidence region representing the distribution of the points.
- Parameters
- data
(2, M)-dim numpy array.
Examples
>>> data = ... >>> ax = ... >>> ceg = CovarianceEllipseGenerator(data) >>> ceg.create_ellipse_patch(conf = 0.95, ax = ax) >>> ceg.add_ellipse_center(ax)
Methods
Add covariance ellipse patch to existing plot.
compute_cov_ellipse([conf])Compute covariance ellipse geometry.
create_ellipse_patch([conf, color, alpha, ax])Create covariance ellipse Matplotlib patch.
-
__init__(data: numpy.ndarray)¶ Initialize self. See help(type(self)) for accurate signature.
-
add_ellipse_center(ax: matplotlib.pyplot.axes)¶ Add covariance ellipse patch to existing plot.
Given an input Matplotlib axis object, add an opaque white dot at the center of the computed confidence ellipse.
- Parameters
- ax
Matplotlib axis object.
-
compute_cov_ellipse(conf: float = 0.95) → Dict[str, float]¶ Compute covariance ellipse geometry.
Given a collection of 2D points, compute an elliptical confidence region representing the distribution of the points. Find the eigendecomposition of the covariance matrix of the data. The eigenvectors point in the directions of the ellipses axes. The eigenvalues specify the variance of the distribution in each of the principal directions. The 95% confidence interval in 2D spans 2.45 standard deviations in each direction, so the width of a 95% confidence ellipse in a principal direction is found by taking 4.9 * sqrt(variance) in that direction.
- Parameters
- conf
Confidence level.
- Returns
- dict
- Dictionary of data to describe resulting confidence ellipse: {
“x_center”: horizontal value of ellipse center “y_center”: vertical value of ellipse center “width”: diameter of ellipse in first principal direction “height”: diameter of ellipse in second principal direction “angle”: counterclockwise rotation angle of ellipse from
horizontal (in degrees)
}
-
create_ellipse_patch(conf: float = 0.95, color: str = 'black', alpha: float = 0.2, ax: Optional[matplotlib.pyplot.axes] = None) → matplotlib.patches.Ellipse¶ Create covariance ellipse Matplotlib patch.
Create a Matplotlib ellipse patch for a specified confidence level. Add resulting patch to ax if supplied.
- Parameters
- conf
Confidence level.
- color
Color of ellipse fill.
- alpha
Opacity of ellipse fill.
- ax
Matplotlib axis object.
- Returns
- patches.Ellipse
Matplotlib ellipse patch.
clscurves.generator module¶
-
class
clscurves.generator.MetricsGenerator(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None)¶ Bases:
clscurves.plotter.roc.ROCPlotter,clscurves.plotter.pr.PRPlotter,clscurves.plotter.prg.PRGPlotter,clscurves.plotter.rf.RFPlotter,clscurves.plotter.cost.CostPlotter,clscurves.plotter.dist.DistPlotter,clscurves.config.MetricsAliasesA class to generate classification curve metrics.
A class for computing Precision/Recall/Fraction metrics across a binary classification algorithm’s full range of discrimination thresholds, and plotting those metrics as ROC (Receiver Operating Characteristic), PR (Precision & Recall), or RF (Recall & Fraction) plots. The input data format for this class is a PySpark DataFrame with at least a column of labels and a column of scores (with an optional additional column of label weights).
Methods
compute_all_metrics(predictions_df[, …])Compute all metrics.
compute_metrics(predictions_df[, …])Compute metrics for a single bootstrap sample.
plot_cost([title, cmap, log_scale, x_col, …])Plot the “Misclassification Cost” curve.
plot_dist([weighted, label, kind, …])Plot the data distribution.
plot_pr([weighted, title, cmap, color_by, …])Plot the PR (Precision & Recall) curve.
plot_prg([title, cmap, color_by, cbar_rng, …])Plot the PRG (Precision-Recall-Gain) curve.
plot_rf([weighted, scale, title, cmap, …])Plot the RF (Recall & Fraction Flagged) curve.
plot_roc([weighted, title, cmap, color_by, …])Plot the ROC (Receiver Operating Characteristic) curve.
compute_cost
plot_cdf
plot_pdf
-
__init__(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None) → None¶ Instantiating this class computes all the metrics.
- Parameters
- predictions_dfpd.DataFrame
Input DataFrame which must contain a column of labels (integer column of 1s and 0s) and a column of scores (either a dense vector column with two elements [prob_0, prob_1] or a real-valued column of scores).
- max_num_examplesint
Max number of rows to sample to prevent numpy memory limits from being exceeded.
- label_columnstr
Name of the column containing the example labels, which must all be either 0 or 1.
- score_columnstr
Name of the column containing the example scores which rank model predictions. Even though binary classification models typically output probabilities, these scores need not be bounded by 0 and 1; they can be any real value. This column can be either a real value numeric type or a 2-element vector of probabilities, with the first element being the probability that the example is of class 0 and the second element that the example is of class 1.
- weight_columnOptional[str]
Name of the column containing label weights associated with each example. These weights are useful when the cost of classifying an example incorrectly varies from example to example; see fraud, for instance: getting high dollar value cases wrong is more costly than getting low dollar value cases wrong, so a good measure of recall is, “How much money did we catch”, not “How many cases did we catch”. If no column name is specified, all weights will be set to 1.
- score_is_probabilitybool
Specifies whether the values in the score column are bounded by 0 and 1. This controls how the threshold range is determined. If true, the threshold range will sweep from 0 to 1. If false, it will sweep from the minimum to maximum score value.
- num_bootstrap_samplesint
Number of bootstrap samples to generate from the original data when computing performance metrics.
- reverse_threshbool
Boolean indicating whether the score threshold should be treated as a lower bound on “positive” predictions (as is standard) or instead as an upper bound. If True, the threshold behavior will be reversed from standard so that any prediction falling BELOW a score threshold will be marked as positive, with all those falling above the threshold marked as negative.
- imbalance_multiplierfloat
Positive value to artifically increase the positive class example count by a multiplicative weighting factor. Use this if you’re generating metrics for a data distribution with a class imbalance that doesn’t represent the true distribution in the wild. For example, if you trained on a 1:1 artifically balanced data set, but you have a 10:1 class imbalance in the wild (i.e. 10 negative examples for every 1 positive example), set the
imbalance_multipliervalue to 10.- null_prob_columnOptional[str]
Column containing calibrated label probabilities to use as the sampling distribution for imputing null label values. We provide this argument so that you can evaluate a possibly-uncalibrated model score (specified by the score_column argument) on a different provided calibrated label distribution. If this argument is None, then the
score_columnwill be used as the estimated label distribution when necessary.- null_fill_methodOptional[NullFillMethod]
- Methods to use when filling in null label values. Possible values:
“0” - fill with 0
“1” - fill with 1
- “imb” - fill randomly according to the class imbalance of
labeled examples
- “prob” - fill randomly according to the
score_column probability distribution or the
null_prob_columnprobability distribution, if provided.
- “prob” - fill randomly according to the
If a method is provided, once the default metrics DF is computed without imputing any null labels, then a new metrics DF will be computed for each method and stored in a
curves_imputedobject. If not, only the default metrics DF will be computed.- seedOptional[int]
Random seed for bootstrapping.
Examples
>>> mg = MetricsGenerator( predictions_df, label_column="label", score_column="score", weight_column="weight", score_is_probability=False, reverse_thresh=False, num_bootstrap_samples=20, seed=123, )
>>> mg.plot_pr(bootstrapped=True) >>> mg.plot_roc()
-
compute_all_metrics(predictions_df: pandas.core.frame.DataFrame, return_results: bool = False) → Optional[clscurves.utils.MetricsResult]¶ Compute all metrics.
-
compute_metrics(predictions_df: pandas.core.frame.DataFrame, bootstrap_sample: Optional[int] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, rng: numpy.random._generator.Generator = Generator(PCG64) at 0x7FE55DEAE2E0) → clscurves.utils.MetricsResult¶ Compute metrics for a single bootstrap sample.
-