corradjust.metrics module#
- class corradjust.metrics.CorrScorer(df_feature_ann, ref_feature_colls, shuffle_feature_names=False, metric='enrichment-based', min_pairs_to_score=1000, random_seed=17, verbose=True)[source]#
Bases:
object
Class that contains data structures for features and reference feature sets, and computes different scores given feature-feature correlations. See documentation of CorrAdjust class constructor for detailed description of constructor parameters (they are the same).
- Parameters:
- df_feature_annpandas.DataFrame with index and 2 columns
A data frame providing annotation of features.
- ref_feature_collsdict
A dictionary with information about reference feature set collections.
- shuffle_feature_namesbool or list, optional, default=False
Whether (and how) to permute feature names.
- metric{“enrichment-based”, “BP@K”}, optional, default=”enrichment-based”
Metric for evaluating feature correlations.
- min_pairs_to_scoreint, optional, default=1000
Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.
- random_seedint, optional, default=17
Random seed used by shuffling procedure.
- verbosebool, optional, default=True
Whether to print progress messages to stderr.
- Attributes:
- datadict
The main data structure holding information about features and feature sets. Populated by calling compute_index method. Keys of data are names of reference feature set collections, and values are dicts with the following structure:
{"sign"
:{"positive", "negative", "absolute"}
Taken directly from the input data."mask"
: numpy.arrayArray that says whether each feature pair belongs to at least one shared reference set. Feature pairs are packed into a flat array withn_features * (n_features - 1) // 2
elements (see numpy.triu_indices). There are 3 possible values:mask == 1
: features belong to the same reference set.mask == 0
: features do not belong to the same reference set, though they form allowed pair type and both are present in some reference sets.mask == -1
: otherwise."train_val_mask"
: numpy.arrayArray that says whether each feature pair belongs to training or validation set. Same order of pairs as inmask
. There are 3 possible values:train_val_mask == 0
: pair belongs to training set.train_val_mask == 1
: pair belongs to validation set.train_val_mask == -1
: all other pairs"features_total_pos"
: numpy.ndarray2D array of shape(3, n_features)
. The first index refers to feature pair type:0
istraining
,1
isvalidation
, and2
isall
. The second index stands for feature index, and the array value says how many reference pairs the feature has. These aren_j
in the paper’s terminology."features_total_neg"
: numpy.ndarraySame as"features_total_pos"
but for non-reference pairs. These areN_j - n_j
in the paper’s terminology."scoring_features_mask"
: numpy.arrayBinary array withn_features
elements that says whether each feature should participate in the computation of enrichment scores. The value is determined based on min_pairs_to_score parameter."high_corr_frac"
: floatTaken directly from the input data."high_corr_pairs_all"
: intAbsolute number of highly ranked feature pairs (all pairs)."high_corr_pairs_training"
: intAbsolute number of highly ranked feature pairs (training pairs)."high_corr_pairs_validation"
: intAbsolute number of highly ranked feature pairs (validation pairs).}- feature_ids, feature_names, feature_typesnp.array
Arrays of feature IDs, names, and types. Populated by compute_index method.
- metric{“enrichment-based”, “BP@K”}
Metric for evaluating feature correlations. Populated by constructor.
- verbosebool
Whether to print progress messages. Populated by constructor.
- compute_index(ref_feature_colls, min_pairs_to_score)[source]#
Populate data attribute. Input arguments are the same as in the constructor.
- Parameters:
- ref_feature_collsdict
A dictionary with information about reference feature set collections.
- min_pairs_to_scoreint
Minimum number of total pairs containing a feature to include the feature in the enrichment score computations.
- compute_corr_scores(corrs, full_output=False)[source]#
Evaluate feature-feature correlations with respect to the reference feature sets.
- Parameters:
- corrsnumpy.array
Array of feature-feature correlations (flat format).
- full_outputbool, optional, default=False
Whether to return large arrays with detailed scores (not only final scores).
- Returns:
- res: dict
If full_output is
False
, dict has the following structure:{ref_feature_coll
: {"score {subset}"
: float}}Here, subset is
"training"
,"validation"
, and"all"
(feature pair subsets). Score is defined according to the metric attribute. If full_output isTrue
, the following keys are added to the dict corresponding to eachref_feature_coll
:{"BPs_at_K {subset}"
: numpy.arrayBalanced precisions for each feature."enrichments {subset}"
: numpy.arrayEnrichments for each feature."pvalues {subset}"
: numpy.arrayp-values for each feature."pvalues_adj {subset}"
: numpy.arrayAdjusted p-values for each feature."TPs_at_K {subset}"
: numpy.arrayTrue positives for each feature (k_j
in the notation of paper)."num_pairs {subset}"
: numpy.arrayTrue positives + false positives for each feature (K_j
in the notation of paper)."scoring_features_mask"
: numpy.arrayBinary array of whether each feature participates in score computation."corrs"
: numpy.arrayFeature-feature correlations."mask"
: numpy.arrayWhether each feature pair belong to at least one shared reference set."train_val_mask"
: numpy.arrayWhether each feature pair belong to training or validation set.}The last four arrays are sorted by correlation value in a direction specified by the
"sign"
value of the data attribute.
- static _compute_ref_feature_coll_mask(feature_idxs_in_ref_sets, ref_feature_sets_boundaries, unique_feature_idxs_in_db, allowed_feature_pair_types, feature_idx_to_type_id, n_features)[source]#
Efficiently create a flat array that answers whether each feature pair belong at least one shared reference feature set.
- Parameters:
- feature_idxs_in_ref_setsnumpy.array
Array with indices of features that belong to each reference sets. Each reference set is represented by a consecutive block of feature indices.
- ref_feature_sets_boundariesnumpy.array
Array with indices of feature_idxs_in_ref_sets, which tell where each referense set starts and ends.
- unique_feature_idxs_in_dbnumpy.array
Array of feature indices that belong to at least one reference set.
- allowed_feature_pair_typesnumpy.array
Array of feature pair types that are participating in the score computations. Since Numba cannot work with string and tuples normally, these should be integers computes using Cantor’s pairing function, see compute_index code.
- feature_idx_to_type_idnumpy.array
Array mapping feature indices to feature types. Indices of the array stand for feature indices, and values stand for feature type indices.
- n_featuresint
Total number of features.
- Returns:
- masknumpy.array
See data attribute description.
- static _compute_train_val_mask(mask)[source]#
Efficiently create a flat array that answers whether each feature pair belong to training or validation set. The function is deterministic, as training and validation set are derived by taking each 2nd element.
- Parameters:
- masknumpy.array
Array returned by the _compute_ref_feature_coll_mask method.
- Returns:
- train_val_masknumpy.array
See data attribute description.
- static _compute_feature_totals(mask, train_val_mask, feature_idxs1, feature_idxs2, n_features, min_pairs_to_score)[source]#
Efficiently compute statistics on number of feature pairs that involve each individual feature.
- Parameters:
- masknumpy.array
Array returned by the _compute_ref_feature_coll_mask method.
- train_val_masknumpy.array
Array returned by the _compute_train_val_mask method.
- feature_idxs1numpy.array
Indices of the first features in the flat array of pairs.
- feature_idxs2numpy.array
Indices of the second features in the flat array of pairs.
- n_featuresint
Number of features.
- min_pairs_to_scoreint
Same as in the constructor.
- Returns:
- features_total_posnumpy.ndarray
See data attribute description.
- features_total_negnumpy.ndarray
See data attribute description.
- scoring_features_masknumpy.array
See data attribute description.
- static _compute_scores_for_ref_feature_coll(corrs, corrs_sorted_args, mask, train_val_mask, high_corr_pairs_all, high_corr_pairs_train, high_corr_pairs_val, features_total_pos, features_total_neg, scoring_features_mask, feature_idxs1, feature_idxs2, num_regularization_pairs=5)[source]#
Evaluate feature-feature correlations with respect to one reference features set collection.
- Parameters:
- corrsnumpy.array
Array of feature-feature correlations (flat format).
- corrs_sorted_argsnumpy.array
Array of indices that sort corrs array in needed way.
- masknumpy.array
Array returned by the _compute_ref_feature_coll_mask method.
- train_val_masknumpy.array
Array returned by the _compute_train_val_mask method.
- high_corr_pairs_allint
See data attribute description.
- high_corr_pairs_trainint
See data attribute description.
- high_corr_pairs_valint
See data attribute description.
- features_total_posnumpy.ndarray
Array returned by the _compute_feature_totals method.
- features_total_negnumpy.ndarray
Array returned by the _compute_feature_totals method.
- scoring_features_masknumpy.array
Array returned by the _compute_feature_totals method.
- feature_idxs1numpy.array
Indices of the first features in the flat array of pairs.
- feature_idxs2numpy.array
Indices of the second features in the flat array of pairs.
- num_regularization_pairsint, optional, default=5
Number of feature pairs to add to each
K_j
for Bayesian regularization purposes.
- Returns:
- BPs_at_Knumpy.array
Balanced precisions for each feature.
- enrichmentsnumpy.array
Enrichments for each feature.
- pvaluesnumpy.array
p-values for each feature.
- TPs_at_Knumpy.array
True positives for each feature.
- FPs_at_Knumpy.array
False positives for each feature.
- corrs_filterednumpy.array
Sorted feature-feature correlations with
mask != -1
.- mask_filterednumpy.array
Mask sorted in the same way as corrs_filtered.
- train_val_mask_filterednumpy.array
Training/validation mask sorted in the same way as corrs_filtered.