clscurves package¶

Subpackages¶

clscurves.plotter package

Submodules¶

clscurves.config module¶

class clscurves.config.MetricsAliases¶

Bases: object

Metrics key aliases.

A class to provide human-readable labels for each key in a metrics_dict that we might want use when coloring a classification curve plot.

cbar_dict = {'f1': 'F1 Score', 'fn': 'False Negative (FN) Count', 'fn_w': 'Weighted False Negative (FN) Sum', 'fp': 'False Positive (FP) Count', 'fp_w': 'Weighted False Positive (FP) Sum', 'fpr': 'FPR = FP/(FP + TN)', 'fpr_w': 'Weighted FPR = FP/(FP + TN)', 'frac': 'Fraction Flagged', 'frac_w': 'Weighted Fraction Flagged', 'precision': 'Precision = TP/(TP + FP)', 'recall': 'Recall = TP/(TP + FN)', 'thresh': 'Score Threshold Value', 'tn': 'True Negative (TN) Count', 'tn_w': 'Weighted True Negative (TN) Sum', 'tp': 'True Positive (TP) Count', 'tp_w': 'Weighted True Positive (TP) Sum', 'tpr': 'Recall = TP/(TP + FN)', 'tpr_w': 'Weighted Recall = TP/(TP + FN)'}¶

clscurves.covariance module¶

class clscurves.covariance.CovarianceEllipseGenerator(data: numpy.ndarray)¶

Bases: object

A class to generate a stylized covariance elipse.

Given a collection of 2D points that are assumed to be distributed according to a bivariate normal distribution, compute and plot an elliptical confidence region representing the distribution of the points.

Parameters

data: (2, M)-dim numpy array.

Examples

>>> data = ...
>>> ax = ...
>>> ceg = CovarianceEllipseGenerator(data)
>>> ceg.create_ellipse_patch(conf = 0.95, ax = ax)
>>> ceg.add_ellipse_center(ax)

Methods

`add_ellipse_center`(ax)	Add covariance ellipse patch to existing plot.
`compute_cov_ellipse`([conf])	Compute covariance ellipse geometry.
`create_ellipse_patch`([conf, color, alpha, ax])	Create covariance ellipse Matplotlib patch.

__init__(data: numpy.ndarray)¶: Initialize self. See help(type(self)) for accurate signature.

add_ellipse_center(ax: matplotlib.pyplot.axes)¶

Add covariance ellipse patch to existing plot.

Given an input Matplotlib axis object, add an opaque white dot at the center of the computed confidence ellipse.

Parameters

ax: Matplotlib axis object.

compute_cov_ellipse(conf: float = 0.95) → Dict[str, float]¶

Compute covariance ellipse geometry.

Given a collection of 2D points, compute an elliptical confidence region representing the distribution of the points. Find the eigendecomposition of the covariance matrix of the data. The eigenvectors point in the directions of the ellipses axes. The eigenvalues specify the variance of the distribution in each of the principal directions. The 95% confidence interval in 2D spans 2.45 standard deviations in each direction, so the width of a 95% confidence ellipse in a principal direction is found by taking 4.9 * sqrt(variance) in that direction.

Parameters

conf: Confidence level.

Returns

dict

Dictionary of data to describe resulting confidence ellipse: {: “x_center”: horizontal value of ellipse center “y_center”: vertical value of ellipse center “width”: diameter of ellipse in first principal direction “height”: diameter of ellipse in second principal direction “angle”: counterclockwise rotation angle of ellipse from

horizontal (in degrees)

}

create_ellipse_patch(conf: float = 0.95, color: str = 'black', alpha: float = 0.2, ax: Optional[matplotlib.pyplot.axes] = None) → matplotlib.patches.Ellipse¶

Create covariance ellipse Matplotlib patch.

Create a Matplotlib ellipse patch for a specified confidence level. Add resulting patch to ax if supplied.

Parameters

conf: Confidence level.
color: Color of ellipse fill.
alpha: Opacity of ellipse fill.
ax: Matplotlib axis object.

Returns

patches.Ellipse: Matplotlib ellipse patch.

clscurves.generator module¶

class clscurves.generator.MetricsGenerator(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None)¶

Bases: clscurves.plotter.roc.ROCPlotter, clscurves.plotter.pr.PRPlotter, clscurves.plotter.prg.PRGPlotter, clscurves.plotter.rf.RFPlotter, clscurves.plotter.cost.CostPlotter, clscurves.plotter.dist.DistPlotter, clscurves.config.MetricsAliases

A class to generate classification curve metrics.

A class for computing Precision/Recall/Fraction metrics across a binary classification algorithm’s full range of discrimination thresholds, and plotting those metrics as ROC (Receiver Operating Characteristic), PR (Precision & Recall), or RF (Recall & Fraction) plots. The input data format for this class is a PySpark DataFrame with at least a column of labels and a column of scores (with an optional additional column of label weights).

Methods

`compute_all_metrics`(predictions_df[, …])	Compute all metrics.
`compute_metrics`(predictions_df[, …])	Compute metrics for a single bootstrap sample.
`plot_cost`([title, cmap, log_scale, x_col, …])	Plot the “Misclassification Cost” curve.
`plot_dist`([weighted, label, kind, …])	Plot the data distribution.
`plot_pr`([weighted, title, cmap, color_by, …])	Plot the PR (Precision & Recall) curve.
`plot_prg`([title, cmap, color_by, cbar_rng, …])	Plot the PRG (Precision-Recall-Gain) curve.
`plot_rf`([weighted, scale, title, cmap, …])	Plot the RF (Recall & Fraction Flagged) curve.
`plot_roc`([weighted, title, cmap, color_by, …])	Plot the ROC (Receiver Operating Characteristic) curve.

compute_cost
plot_cdf
plot_pdf

__init__(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None) → None¶

Instantiating this class computes all the metrics.

Parameters

predictions_dfpd.DataFrame

Input DataFrame which must contain a column of labels (integer column of 1s and 0s) and a column of scores (either a dense vector column with two elements [prob_0, prob_1] or a real-valued column of scores).

max_num_examplesint

Max number of rows to sample to prevent numpy memory limits from being exceeded.

label_columnstr

Name of the column containing the example labels, which must all be either 0 or 1.

score_columnstr

Name of the column containing the example scores which rank model predictions. Even though binary classification models typically output probabilities, these scores need not be bounded by 0 and 1; they can be any real value. This column can be either a real value numeric type or a 2-element vector of probabilities, with the first element being the probability that the example is of class 0 and the second element that the example is of class 1.

weight_columnOptional[str]

Name of the column containing label weights associated with each example. These weights are useful when the cost of classifying an example incorrectly varies from example to example; see fraud, for instance: getting high dollar value cases wrong is more costly than getting low dollar value cases wrong, so a good measure of recall is, “How much money did we catch”, not “How many cases did we catch”. If no column name is specified, all weights will be set to 1.

score_is_probabilitybool

Specifies whether the values in the score column are bounded by 0 and 1. This controls how the threshold range is determined. If true, the threshold range will sweep from 0 to 1. If false, it will sweep from the minimum to maximum score value.

num_bootstrap_samplesint

Number of bootstrap samples to generate from the original data when computing performance metrics.

reverse_threshbool

Boolean indicating whether the score threshold should be treated as a lower bound on “positive” predictions (as is standard) or instead as an upper bound. If True, the threshold behavior will be reversed from standard so that any prediction falling BELOW a score threshold will be marked as positive, with all those falling above the threshold marked as negative.

imbalance_multiplierfloat

Positive value to artifically increase the positive class example count by a multiplicative weighting factor. Use this if you’re generating metrics for a data distribution with a class imbalance that doesn’t represent the true distribution in the wild. For example, if you trained on a 1:1 artifically balanced data set, but you have a 10:1 class imbalance in the wild (i.e. 10 negative examples for every 1 positive example), set the imbalance_multiplier value to 10.

null_prob_columnOptional[str]

Column containing calibrated label probabilities to use as the sampling distribution for imputing null label values. We provide this argument so that you can evaluate a possibly-uncalibrated model score (specified by the score_column argument) on a different provided calibrated label distribution. If this argument is None, then the score_column will be used as the estimated label distribution when necessary.

null_fill_methodOptional[NullFillMethod]

Methods to use when filling in null label values. Possible values:

“0” - fill with 0
“1” - fill with 1
“imb” - fill randomly according to the class imbalance of
labeled examples
“prob” - fill randomly according to the score_column
probability distribution or the null_prob_column probability distribution, if provided.

If a method is provided, once the default metrics DF is computed without imputing any null labels, then a new metrics DF will be computed for each method and stored in a curves_imputed object. If not, only the default metrics DF will be computed.

seedOptional[int]

Random seed for bootstrapping.

Examples

>>> mg = MetricsGenerator(
        predictions_df,
        label_column="label",
        score_column="score",
        weight_column="weight",
        score_is_probability=False,
        reverse_thresh=False,
        num_bootstrap_samples=20,
        seed=123,
    )

>>> mg.plot_pr(bootstrapped=True)
>>> mg.plot_roc()

compute_all_metrics(predictions_df: pandas.core.frame.DataFrame, return_results: bool = False) → Optional[clscurves.utils.MetricsResult]¶: Compute all metrics.

compute_metrics(predictions_df: pandas.core.frame.DataFrame, bootstrap_sample: Optional[int] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, rng: numpy.random._generator.Generator = Generator(PCG64) at 0x7FE55DEAE2E0) → clscurves.utils.MetricsResult¶: Compute metrics for a single bootstrap sample.

clscurves package¶

Subpackages¶

Submodules¶

clscurves.config module¶

clscurves.covariance module¶

clscurves.generator module¶

Module contents¶