corradjust.metrics module#

class corradjust.metrics.CorrScorer(df_feature_ann, ref_feature_colls, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, verbose=True)[source]#

Bases: object

Class that contains data structures for features and reference feature sets, and computes different scores given feature-feature correlations. See documentation of CorrAdjust class constructor for detailed description of constructor parameters (they are the same).

Parameters:
df_feature_annpandas.DataFrame with index and 2 columns

A data frame providing annotation of features.

ref_feature_collsdict

A dictionary with information about reference feature set collections.

shuffle_feature_namesbool or list, optional, default=False

Whether (and how) to permute feature names.

metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”

Metric for evaluating feature correlations.

min_pairs_to_scoreint, optional, default=1000

Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.

random_seedint, optional, default=17

Random seed used by shuffling procedure.

verbosebool, optional, default=True

Whether to print progress messages to stderr.

Attributes:
datadict

The main data structure holding information about features and feature sets. Populated by calling compute_index method. Keys of data are names of reference feature set collections, and values are dicts with the following structure:

{
"sign" : {"positive", "negative", "absolute"}
Taken directly from the input data.
"mask" : numpy.array
Array that says whether each feature pair belongs to at least one shared reference set. Feature pairs are packed into a flat array with n_features * (n_features - 1) // 2 elements (see numpy.triu_indices). There are 3 possible values:
mask == 1: features belong to the same reference set.
mask == 0: features do not belong to the same reference set, though they form allowed pair type and both are present in some reference sets.
mask == -1: otherwise.
"train_val_mask" : numpy.array
Array that says whether each feature pair belongs to training or validation set. Same order of pairs as in mask. There are 3 possible values:
train_val_mask == 0: pair belongs to training set.
train_val_mask == 1: pair belongs to validation set.
train_val_mask == -1: all other pairs
"features_total_pos" : numpy.ndarray
2D array of shape (3, n_features). The first index refers to feature pair type: 0 is training, 1 is validation, and 2 is all. The second index stands for feature index, and the array value says how many reference pairs the feature has. These are n_j in the paper’s terminology.
"features_total_neg" : numpy.ndarray
Same as "features_total_pos" but for non-reference pairs. These are N_j - n_j in the paper’s terminology.
"scoring_features_mask" : numpy.array
Binary array with n_features elements that says whether each feature should participate in the computation of enrichment scores. The value is determined based on min_pairs_to_score parameter.
"high_corr_frac" : float
Taken directly from the input data.
"high_corr_pairs_all" : int
Absolute number of highly ranked feature pairs (all pairs).
"high_corr_pairs_training" : int
Absolute number of highly ranked feature pairs (training pairs).
"high_corr_pairs_validation" : int
Absolute number of highly ranked feature pairs (validation pairs).
}
feature_ids, feature_names, feature_typesnp.array

Arrays of feature IDs, names, and types. Populated by compute_index method.

metric{“enrichment-based”, “BP@K”}

Metric for evaluating feature correlations. Populated by constructor.

verbosebool

Whether to print progress messages. Populated by constructor.

compute_index(ref_feature_colls, min_pairs_to_score)[source]#

Populate data attribute. Input arguments are the same as in the constructor.

Parameters:
ref_feature_collsdict

A dictionary with information about reference feature set collections.

min_pairs_to_scoreint

Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.

compute_corr_scores(corrs, full_output=False)[source]#

Evaluate feature-feature correlations with respect to the reference feature sets.

Parameters:
corrsnumpy.array

Array of feature-feature correlations (flat format).

full_outputbool, optional, default=False

Whether to return large arrays with detailed scores (not only final scores).

Returns:
res: dict

If full_output is False, dict has the following structure:

{
ref_feature_coll : {
"score {subset}" : float
}
}

Here, subset is "training", "validation", and "all" (feature pair subsets). Score is defined according to the metric attribute. If full_output is True, the following keys are added to the dict corresponding to each ref_feature_coll:

{
"BPs_at_K {subset}" : numpy.array
Balanced precisions for each feature.
"enrichments {subset}" : numpy.array
Enrichments for each feature.
"pvalues {subset}" : numpy.array
p-values for each feature.
"pvalues_adj {subset}" : numpy.array
Adjusted p-values for each feature.
"TPs_at_K {subset}" : numpy.array
True positives for each feature (k_j in the notation of paper).
"num_pairs {subset}" : numpy.array
True positives + false positives for each feature (K_j in the notation of paper).
"scoring_features_mask" : numpy.array
Binary array of whether each feature participates in score computation.
"corrs" : numpy.array
Feature-feature correlations.
"mask" : numpy.array
Whether each feature pair belong to at least one shared reference set.
"train_val_mask" : numpy.array
Whether each feature pair belong to training or validation set.
}

The last four arrays are sorted by correlation value in a direction specified by the "sign" value of the data attribute.

static _compute_ref_feature_coll_mask(feature_idxs_in_ref_sets, ref_feature_sets_boundaries, unique_feature_idxs_in_db, allowed_feature_pair_types, feature_idx_to_type_id, n_features)[source]#

Efficiently create a flat array that answers whether each feature pair belong at least one shared reference feature set.

Parameters:
feature_idxs_in_ref_setsnumpy.array

Array with indices of features that belong to each reference sets. Each reference set is represented by a consecutive block of feature indices.

ref_feature_sets_boundariesnumpy.array

Array with indices of feature_idxs_in_ref_sets, which tell where each referense set starts and ends.

unique_feature_idxs_in_dbnumpy.array

Array of feature indices that belong to at least one reference set.

allowed_feature_pair_typesnumpy.array

Array of feature pair types that are participating in the score computations. Since Numba cannot work with string and tuples normally, these should be integers computes using Cantor’s pairing function, see compute_index code.

feature_idx_to_type_idnumpy.array

Array mapping feature indices to feature types. Indices of the array stand for feature indices, and values stand for feature type indices.

n_featuresint

Total number of features.

Returns:
masknumpy.array

See data attribute description.

static _compute_train_val_mask(mask)[source]#

Efficiently create a flat array that answers whether each feature pair belong to training or validation set. The function is deterministic, as training and validation set are derived by taking each 2nd element.

Parameters:
masknumpy.array

Array returned by the _compute_ref_feature_coll_mask method.

Returns:
train_val_masknumpy.array

See data attribute description.

static _compute_feature_totals(mask, train_val_mask, feature_idxs1, feature_idxs2, n_features, min_pairs_to_score)[source]#

Efficiently compute statistics on number of feature pairs that involve each individual feature.

Parameters:
masknumpy.array

Array returned by the _compute_ref_feature_coll_mask method.

train_val_masknumpy.array

Array returned by the _compute_train_val_mask method.

feature_idxs1numpy.array

Indices of the first features in the flat array of pairs.

feature_idxs2numpy.array

Indices of the second features in the flat array of pairs.

n_featuresint

Number of features.

min_pairs_to_scoreint

Same as in the constructor.

Returns:
features_total_posnumpy.ndarray

See data attribute description.

features_total_negnumpy.ndarray

See data attribute description.

scoring_features_masknumpy.array

See data attribute description.

static _compute_scores_for_ref_feature_coll(corrs, corrs_sorted_args, mask, train_val_mask, high_corr_pairs_all, high_corr_pairs_train, high_corr_pairs_val, features_total_pos, features_total_neg, scoring_features_mask, feature_idxs1, feature_idxs2, num_regularization_pairs=5)[source]#

Evaluate feature-feature correlations with respect to one reference features set collection.

Parameters:
corrsnumpy.array

Array of feature-feature correlations (flat format).

corrs_sorted_argsnumpy.array

Array of indices that sort corrs array in needed way.

masknumpy.array

Array returned by the _compute_ref_feature_coll_mask method.

train_val_masknumpy.array

Array returned by the _compute_train_val_mask method.

high_corr_pairs_allint

See data attribute description.

high_corr_pairs_trainint

See data attribute description.

high_corr_pairs_valint

See data attribute description.

features_total_posnumpy.ndarray

Array returned by the _compute_feature_totals method.

features_total_negnumpy.ndarray

Array returned by the _compute_feature_totals method.

scoring_features_masknumpy.array

Array returned by the _compute_feature_totals method.

feature_idxs1numpy.array

Indices of the first features in the flat array of pairs.

feature_idxs2numpy.array

Indices of the second features in the flat array of pairs.

num_regularization_pairsint, optional, default=5

Number of feature pairs to add to each K_j for Bayesian regularization purposes.

Returns:
BPs_at_Knumpy.array

Balanced precisions for each feature.

enrichmentsnumpy.array

Enrichments for each feature.

pvaluesnumpy.array

p-values for each feature.

TPs_at_Knumpy.array

True positives for each feature.

FPs_at_Knumpy.array

False positives for each feature.

corrs_filterednumpy.array

Sorted feature-feature correlations with mask != -1.

mask_filterednumpy.array

Mask sorted in the same way as corrs_filtered.

train_val_mask_filterednumpy.array

Training/validation mask sorted in the same way as corrs_filtered.