corradjust.metrics module#

class corradjust.metrics.CorrScorer(df_feature_ann, ref_feature_colls, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, verbose=True)[source]#

Bases: object

Class that contains data structures for features and reference feature sets, and computes different scores given feature-feature correlations. See documentation of CorrAdjust class constructor for detailed description of constructor parameters (they are the same).

Parameters:

df_feature_annpandas.DataFrame with index and 2 columns: A data frame providing annotation of features.
ref_feature_collsdict: A dictionary with information about reference feature set collections.
shuffle_feature_namesbool or list, optional, default=False: Whether (and how) to permute feature names.
metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”: Metric for evaluating feature correlations.
min_pairs_to_scoreint, optional, default=1000: Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.
random_seedint, optional, default=17: Random seed used by shuffling procedure.
verbosebool, optional, default=True: Whether to print progress messages to stderr.

Attributes:

datadict

The main data structure holding information about features and feature sets. Populated by calling compute_index method. Keys of data are names of reference feature set collections, and values are dicts with the following structure:

{

"sign" : {"positive", "negative", "absolute"}

Taken directly from the input data.

"mask" : numpy.array

Array that says whether each feature pair belongs to at least one shared reference set. Feature pairs are packed into a flat array with n_features * (n_features - 1) // 2 elements (see numpy.triu_indices). There are 3 possible values:

mask == 1: features belong to the same reference set.
mask == 0: features do not belong to the same reference set,
though they form allowed pair type and both are present in some reference sets.
mask == -1: otherwise.

"train_val_mask" : numpy.array

Array that says whether each feature pair belongs to training or validation set. Same order of pairs as in mask. There are 3 possible values:

train_val_mask == 0: pair belongs to training set.
train_val_mask == 1: pair belongs to validation set.
train_val_mask == -1: all other pairs

"features_total_pos" : numpy.ndarray

2D array of shape (3, n_features). The first index refers to feature pair type: 0 is training, 1 is validation, and 2 is all. The second index stands for feature index, and the array value says how many reference pairs the feature has. These are n_j in the paper’s terminology.

"features_total_neg" : numpy.ndarray

Same as "features_total_pos" but for non-reference pairs. These are N_j - n_j in the paper’s terminology.

"scoring_features_mask" : numpy.array

Binary array with n_features elements that says whether each feature should participate in the computation of enrichment scores. The value is determined based on min_pairs_to_score parameter.

"high_corr_frac" : float

Taken directly from the input data.

"high_corr_pairs_all" : int

Absolute number of highly ranked feature pairs (all pairs).

"high_corr_pairs_training" : int

Absolute number of highly ranked feature pairs (training pairs).

"high_corr_pairs_validation" : int

Absolute number of highly ranked feature pairs (validation pairs).

}

feature_ids, feature_names, feature_typesnp.array: Arrays of feature IDs, names, and types. Populated by compute_index method.
metric{“enrichment-based”, “BP@K”}: Metric for evaluating feature correlations. Populated by constructor.
verbosebool: Whether to print progress messages. Populated by constructor.

compute_index(ref_feature_colls, min_pairs_to_score)[source]#

Populate data attribute. Input arguments are the same as in the constructor.

Parameters:

ref_feature_collsdict: A dictionary with information about reference feature set collections.
min_pairs_to_scoreint: Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.

compute_corr_scores(corrs, full_output=False)[source]#

Evaluate feature-feature correlations with respect to the reference feature sets.

Parameters:

corrsnumpy.array: Array of feature-feature correlations (flat format).
full_outputbool, optional, default=False: Whether to return large arrays with detailed scores (not only final scores).

Returns:

res: dict

If full_output is False, dict has the following structure:

{

ref_feature_coll : {

"score {subset}" : float

}

Here, subset is "training", "validation", and "all" (feature pair subsets). Score is defined according to the metric attribute. If full_output is True, the following keys are added to the dict corresponding to each ref_feature_coll:

{

"BPs_at_K {subset}" : numpy.array

Balanced precisions for each feature.

"enrichments {subset}" : numpy.array

Enrichments for each feature.

"pvalues {subset}" : numpy.array

p-values for each feature.

"pvalues_adj {subset}" : numpy.array

Adjusted p-values for each feature.

"TPs_at_K {subset}" : numpy.array

True positives for each feature (k_j in the notation of paper).

"num_pairs {subset}" : numpy.array

True positives + false positives for each feature (K_j in the notation of paper).

"scoring_features_mask" : numpy.array

Binary array of whether each feature participates in score computation.

"corrs" : numpy.array

Feature-feature correlations.

"mask" : numpy.array

Whether each feature pair belong to at least one shared reference set.

"train_val_mask" : numpy.array

Whether each feature pair belong to training or validation set.

}

The last four arrays are sorted by correlation value in a direction specified by the "sign" value of the data attribute.

static _compute_ref_feature_coll_mask(feature_idxs_in_ref_sets, ref_feature_sets_boundaries, unique_feature_idxs_in_db, allowed_feature_pair_types, feature_idx_to_type_id, n_features)[source]#

Efficiently create a flat array that answers whether each feature pair belong at least one shared reference feature set.

Parameters:

feature_idxs_in_ref_setsnumpy.array: Array with indices of features that belong to each reference sets. Each reference set is represented by a consecutive block of feature indices.
ref_feature_sets_boundariesnumpy.array: Array with indices of feature_idxs_in_ref_sets, which tell where each referense set starts and ends.
unique_feature_idxs_in_dbnumpy.array: Array of feature indices that belong to at least one reference set.
allowed_feature_pair_typesnumpy.array: Array of feature pair types that are participating in the score computations. Since Numba cannot work with string and tuples normally, these should be integers computes using Cantor’s pairing function, see compute_index code.
feature_idx_to_type_idnumpy.array: Array mapping feature indices to feature types. Indices of the array stand for feature indices, and values stand for feature type indices.
n_featuresint: Total number of features.

Returns:

masknumpy.array: See data attribute description.

static _compute_train_val_mask(mask)[source]#

Efficiently create a flat array that answers whether each feature pair belong to training or validation set. The function is deterministic, as training and validation set are derived by taking each 2nd element.

Parameters:

masknumpy.array: Array returned by the _compute_ref_feature_coll_mask method.

Returns:

train_val_masknumpy.array: See data attribute description.

static _compute_feature_totals(mask, train_val_mask, feature_idxs1, feature_idxs2, n_features, min_pairs_to_score)[source]#

Efficiently compute statistics on number of feature pairs that involve each individual feature.

Parameters:

masknumpy.array: Array returned by the _compute_ref_feature_coll_mask method.
train_val_masknumpy.array: Array returned by the _compute_train_val_mask method.
feature_idxs1numpy.array: Indices of the first features in the flat array of pairs.
feature_idxs2numpy.array: Indices of the second features in the flat array of pairs.
n_featuresint: Number of features.
min_pairs_to_scoreint: Same as in the constructor.

Returns:

features_total_posnumpy.ndarray: See data attribute description.
features_total_negnumpy.ndarray: See data attribute description.
scoring_features_masknumpy.array: See data attribute description.

static _compute_scores_for_ref_feature_coll(corrs, corrs_sorted_args, mask, train_val_mask, high_corr_pairs_all, high_corr_pairs_train, high_corr_pairs_val, features_total_pos, features_total_neg, scoring_features_mask, feature_idxs1, feature_idxs2, num_regularization_pairs=5)[source]#

Evaluate feature-feature correlations with respect to one reference features set collection.

Parameters:

corrsnumpy.array: Array of feature-feature correlations (flat format).
corrs_sorted_argsnumpy.array: Array of indices that sort corrs array in needed way.
masknumpy.array: Array returned by the _compute_ref_feature_coll_mask method.
train_val_masknumpy.array: Array returned by the _compute_train_val_mask method.
high_corr_pairs_allint: See data attribute description.
high_corr_pairs_trainint: See data attribute description.
high_corr_pairs_valint: See data attribute description.
features_total_posnumpy.ndarray: Array returned by the _compute_feature_totals method.
features_total_negnumpy.ndarray: Array returned by the _compute_feature_totals method.
scoring_features_masknumpy.array: Array returned by the _compute_feature_totals method.
feature_idxs1numpy.array: Indices of the first features in the flat array of pairs.
feature_idxs2numpy.array: Indices of the second features in the flat array of pairs.
num_regularization_pairsint, optional, default=5: Number of feature pairs to add to each K_j for Bayesian regularization purposes.

Returns:

BPs_at_Knumpy.array: Balanced precisions for each feature.
enrichmentsnumpy.array: Enrichments for each feature.
pvaluesnumpy.array: p-values for each feature.
TPs_at_Knumpy.array: True positives for each feature.
FPs_at_Knumpy.array: False positives for each feature.
corrs_filterednumpy.array: Sorted feature-feature correlations with mask != -1.
mask_filterednumpy.array: Mask sorted in the same way as corrs_filtered.
train_val_mask_filterednumpy.array: Training/validation mask sorted in the same way as corrs_filtered.