clscurves package

Submodules

clscurves.config module

class clscurves.config.MetricsAliases

Bases: object

Metrics key aliases.

A class to provide human-readable labels for each key in a metrics_dict that we might want use when coloring a classification curve plot.

cbar_dict = {'f1': 'F1 Score', 'fn': 'False Negative (FN) Count', 'fn_w': 'Weighted False Negative (FN) Sum', 'fp': 'False Positive (FP) Count', 'fp_w': 'Weighted False Positive (FP) Sum', 'fpr': 'FPR = FP/(FP + TN)', 'fpr_w': 'Weighted FPR = FP/(FP + TN)', 'frac': 'Fraction Flagged', 'frac_w': 'Weighted Fraction Flagged', 'precision': 'Precision = TP/(TP + FP)', 'recall': 'Recall = TP/(TP + FN)', 'thresh': 'Score Threshold Value', 'tn': 'True Negative (TN) Count', 'tn_w': 'Weighted True Negative (TN) Sum', 'tp': 'True Positive (TP) Count', 'tp_w': 'Weighted True Positive (TP) Sum', 'tpr': 'Recall = TP/(TP + FN)', 'tpr_w': 'Weighted Recall = TP/(TP + FN)'}

clscurves.covariance module

class clscurves.covariance.CovarianceEllipseGenerator(data: numpy.ndarray)

Bases: object

A class to generate a stylized covariance elipse.

Given a collection of 2D points that are assumed to be distributed according to a bivariate normal distribution, compute and plot an elliptical confidence region representing the distribution of the points.

Parameters
data

(2, M)-dim numpy array.

Examples

>>> data = ...
>>> ax = ...
>>> ceg = CovarianceEllipseGenerator(data)
>>> ceg.create_ellipse_patch(conf = 0.95, ax = ax)
>>> ceg.add_ellipse_center(ax)

Methods

add_ellipse_center(ax)

Add covariance ellipse patch to existing plot.

compute_cov_ellipse([conf])

Compute covariance ellipse geometry.

create_ellipse_patch([conf, color, alpha, ax])

Create covariance ellipse Matplotlib patch.

__init__(data: numpy.ndarray)

Initialize self. See help(type(self)) for accurate signature.

add_ellipse_center(ax: matplotlib.pyplot.axes)

Add covariance ellipse patch to existing plot.

Given an input Matplotlib axis object, add an opaque white dot at the center of the computed confidence ellipse.

Parameters
ax

Matplotlib axis object.

compute_cov_ellipse(conf: float = 0.95)Dict[str, float]

Compute covariance ellipse geometry.

Given a collection of 2D points, compute an elliptical confidence region representing the distribution of the points. Find the eigendecomposition of the covariance matrix of the data. The eigenvectors point in the directions of the ellipses axes. The eigenvalues specify the variance of the distribution in each of the principal directions. The 95% confidence interval in 2D spans 2.45 standard deviations in each direction, so the width of a 95% confidence ellipse in a principal direction is found by taking 4.9 * sqrt(variance) in that direction.

Parameters
conf

Confidence level.

Returns
dict
Dictionary of data to describe resulting confidence ellipse: {

“x_center”: horizontal value of ellipse center “y_center”: vertical value of ellipse center “width”: diameter of ellipse in first principal direction “height”: diameter of ellipse in second principal direction “angle”: counterclockwise rotation angle of ellipse from

horizontal (in degrees)

}

create_ellipse_patch(conf: float = 0.95, color: str = 'black', alpha: float = 0.2, ax: Optional[matplotlib.pyplot.axes] = None)matplotlib.patches.Ellipse

Create covariance ellipse Matplotlib patch.

Create a Matplotlib ellipse patch for a specified confidence level. Add resulting patch to ax if supplied.

Parameters
conf

Confidence level.

color

Color of ellipse fill.

alpha

Opacity of ellipse fill.

ax

Matplotlib axis object.

Returns
patches.Ellipse

Matplotlib ellipse patch.

clscurves.generator module

class clscurves.generator.MetricsGenerator(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None)

Bases: clscurves.plotter.roc.ROCPlotter, clscurves.plotter.pr.PRPlotter, clscurves.plotter.prg.PRGPlotter, clscurves.plotter.rf.RFPlotter, clscurves.plotter.cost.CostPlotter, clscurves.plotter.dist.DistPlotter, clscurves.config.MetricsAliases

A class to generate classification curve metrics.

A class for computing Precision/Recall/Fraction metrics across a binary classification algorithm’s full range of discrimination thresholds, and plotting those metrics as ROC (Receiver Operating Characteristic), PR (Precision & Recall), or RF (Recall & Fraction) plots. The input data format for this class is a PySpark DataFrame with at least a column of labels and a column of scores (with an optional additional column of label weights).

Methods

compute_all_metrics(predictions_df[, …])

Compute all metrics.

compute_metrics(predictions_df[, …])

Compute metrics for a single bootstrap sample.

plot_cost([title, cmap, log_scale, x_col, …])

Plot the “Misclassification Cost” curve.

plot_dist([weighted, label, kind, …])

Plot the data distribution.

plot_pr([weighted, title, cmap, color_by, …])

Plot the PR (Precision & Recall) curve.

plot_prg([title, cmap, color_by, cbar_rng, …])

Plot the PRG (Precision-Recall-Gain) curve.

plot_rf([weighted, scale, title, cmap, …])

Plot the RF (Recall & Fraction Flagged) curve.

plot_roc([weighted, title, cmap, color_by, …])

Plot the ROC (Receiver Operating Characteristic) curve.

compute_cost

plot_cdf

plot_pdf

__init__(predictions_df: Optional[pandas.core.frame.DataFrame] = None, max_num_examples: int = 100000, label_column: str = 'label', score_column: str = 'probability', weight_column: Optional[str] = None, score_is_probability: bool = True, reverse_thresh: bool = False, num_bootstrap_samples: int = 0, imbalance_multiplier: float = 1, null_prob_column: Optional[str] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, seed: Optional[int] = None)None

Instantiating this class computes all the metrics.

Parameters
predictions_dfpd.DataFrame

Input DataFrame which must contain a column of labels (integer column of 1s and 0s) and a column of scores (either a dense vector column with two elements [prob_0, prob_1] or a real-valued column of scores).

max_num_examplesint

Max number of rows to sample to prevent numpy memory limits from being exceeded.

label_columnstr

Name of the column containing the example labels, which must all be either 0 or 1.

score_columnstr

Name of the column containing the example scores which rank model predictions. Even though binary classification models typically output probabilities, these scores need not be bounded by 0 and 1; they can be any real value. This column can be either a real value numeric type or a 2-element vector of probabilities, with the first element being the probability that the example is of class 0 and the second element that the example is of class 1.

weight_columnOptional[str]

Name of the column containing label weights associated with each example. These weights are useful when the cost of classifying an example incorrectly varies from example to example; see fraud, for instance: getting high dollar value cases wrong is more costly than getting low dollar value cases wrong, so a good measure of recall is, “How much money did we catch”, not “How many cases did we catch”. If no column name is specified, all weights will be set to 1.

score_is_probabilitybool

Specifies whether the values in the score column are bounded by 0 and 1. This controls how the threshold range is determined. If true, the threshold range will sweep from 0 to 1. If false, it will sweep from the minimum to maximum score value.

num_bootstrap_samplesint

Number of bootstrap samples to generate from the original data when computing performance metrics.

reverse_threshbool

Boolean indicating whether the score threshold should be treated as a lower bound on “positive” predictions (as is standard) or instead as an upper bound. If True, the threshold behavior will be reversed from standard so that any prediction falling BELOW a score threshold will be marked as positive, with all those falling above the threshold marked as negative.

imbalance_multiplierfloat

Positive value to artifically increase the positive class example count by a multiplicative weighting factor. Use this if you’re generating metrics for a data distribution with a class imbalance that doesn’t represent the true distribution in the wild. For example, if you trained on a 1:1 artifically balanced data set, but you have a 10:1 class imbalance in the wild (i.e. 10 negative examples for every 1 positive example), set the imbalance_multiplier value to 10.

null_prob_columnOptional[str]

Column containing calibrated label probabilities to use as the sampling distribution for imputing null label values. We provide this argument so that you can evaluate a possibly-uncalibrated model score (specified by the score_column argument) on a different provided calibrated label distribution. If this argument is None, then the score_column will be used as the estimated label distribution when necessary.

null_fill_methodOptional[NullFillMethod]
Methods to use when filling in null label values. Possible values:
  • “0” - fill with 0

  • “1” - fill with 1

  • “imb” - fill randomly according to the class imbalance of

    labeled examples

  • “prob” - fill randomly according to the score_column

    probability distribution or the null_prob_column probability distribution, if provided.

If a method is provided, once the default metrics DF is computed without imputing any null labels, then a new metrics DF will be computed for each method and stored in a curves_imputed object. If not, only the default metrics DF will be computed.

seedOptional[int]

Random seed for bootstrapping.

Examples

>>> mg = MetricsGenerator(
        predictions_df,
        label_column="label",
        score_column="score",
        weight_column="weight",
        score_is_probability=False,
        reverse_thresh=False,
        num_bootstrap_samples=20,
        seed=123,
    )
>>> mg.plot_pr(bootstrapped=True)
>>> mg.plot_roc()
compute_all_metrics(predictions_df: pandas.core.frame.DataFrame, return_results: bool = False)Optional[clscurves.utils.MetricsResult]

Compute all metrics.

compute_metrics(predictions_df: pandas.core.frame.DataFrame, bootstrap_sample: Optional[int] = None, null_fill_method: Optional[Literal[0, 1, imb, prob]] = None, rng: numpy.random._generator.Generator = Generator(PCG64) at 0x7FE55DEAE2E0)clscurves.utils.MetricsResult

Compute metrics for a single bootstrap sample.

Module contents