corradjust.utils module#
- class corradjust.utils.MeanCenterizer[source]#
Bases:
object
Center data by subtracting mean value from each feature. This class has two important features.
It is train/test friendly.
It can separately centralize different groups of samples, thus residualizing group effect from input data.
- Attributes:
- all_featureslist
List of all features of the training set.
- meansdict
A dict with keys being group names, and values being numpy arrays with feature-wise training set means. Populated by calling fit method. If df_samp_ann passed to fit is
None
, means will have a single key:"all_samples"
.
- fit(df_data, df_samp_ann=None)[source]#
Compute and save mean values for centering in the means attribute.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. If
None
, means across all samples will be computed.Index: sample names (should match index of df_data).
Column group: discrete set of group names (e.g.,
"normal"
and"tumor"
).
- transform(df_data, df_samp_ann=None)[source]#
Center the data using mean values computed during fit call.
- Parameters:
- df_datapandas.DataFrame
Input data. Index = samples, columns = feature ids.
- df_samp_annpandas.DataFrame or None, optional, default=None
A data frame providing group annotation of samples. See fit for the format.
- Returns:
- pandas.DataFrame
Mean-centered version of df_data.
- class corradjust.utils.Winsorizer(alpha=0.05)[source]#
Bases:
object
Winsorize each feature. This class is train/test friendly.
- Parameters:
- alphafloat, optional, default=0.05
For each feature, alpha fraction of the lowest values and alpha fraction of the highest values will be changed to the corresponding quantile values.
- Attributes:
- alphafloat
Populated by the constructor.
- all_featureslist
List of all features of the training set.
- lower_thresholds, upper_thresholdsnumpy.array
Arrays with feature-wise training set
alpha
and1 - alpha
quantiles. Populated by calling fit method.
- class corradjust.utils.MedianOfRatios[source]#
Bases:
object
Median of ratios normalization for RNA-seq data. This class is train/test friendly.
- Attributes:
- all_featureslist
List of all features of the training set.
- geo_meansnumpy.array
Array with gene-wise training set geometric means. Populated by calling fit method.
- corradjust.utils.make_dummy_samp_ann(df_data)[source]#
Make data frame with samples annotation, where all samples belong to the same group named
"all_samp"
.- Parameters:
- df_datapandas.DataFrame
Data frame with Index = samples.
- Returns:
- pandas.DataFrame
Data frame with the same Index and a single column
"group"
filled with a constant string"all_samp"
.
- corradjust.utils.compute_pearson_columns(X)[source]#
Compute Pearson’s correlation for all-vs-all columns of input 2D array.
- Parameters:
- Xnumpy.ndarray
2D array with variables of interest over columns.
- Returns:
- numpy.array
1D array of correlations (diagonal 1s are not included). For speed-up purposes, data type is
float32
.
Notes
The numpy.triu_indices function with parameter
k=1
can be used to establish correspondence between index of the resulting flat array and standard 2D correlation matrix.
- corradjust.utils.flatten_corrs(corrs_2d)[source]#
Convert 2D correlation matrix into 1D array. This is parallel version with result equivalent to
corrs_2d[numpy.triu_indices(n_features, k=1)]
.- Parameters:
- corrs_2dnumpy.ndarray
Square
(n, n)
symmetric matrix with correlations.
- Returns:
- numpy.array
1D array with correlations (without diagonal 1s). Size of array is
n * (n - 1) // 2
.
- corradjust.utils.compute_corr_pvalues(corrs, n)[source]#
Compute p-values testing null hypothesis of input Pearson’s correlations equal to zero.
- Parameters:
- corrsnumpy.array
1D array with correlations.
- nint
Sample size used when computing correlations.
- Returns:
- numpy.array
1D array of p-values corresponding to corrs.
Notes
This function produces the same p-values as the scipy.stats.pearsonr function with default parameters.
- corradjust.utils.regress_out_columns(X1, X2)[source]#
Regress out all columns of X2 from each columns of X1.
- Parameters:
- X1numpy.ndarray
2D array with variables of interest over columns.
- X2numpy.ndarray
2D array with variables to residualize over columns. Should have the same number of rows and X1.
- Returns:
- residualsnumpy.ndarray
Residuals after regressing out columns of X2 from each column of X1. Has the same shape as X1.
- rsquaredsnumpy.array
1D array of
R^2
values corresponding to regressions fit for each column of X1.
- corradjust.utils.find_peak(scores, patience=2, min_rel_improvement=0.05, max_rel_diff_from_max=0.5)[source]#
Find peak with early stopping criteria.
patience rounds with no min_rel_improvement compared to the previous maximum value (plateau reached).
relative difference between global maximum and current maximum is at most max_rel_diff_from_max (not terminating too early).
- Parameters:
- scoresnumpy.array
1D array with scores used to find the peak.
- patienceint, optional, default=2
See description above.
- min_rel_improvementfloat, optional, default=0.05
See description above.
- max_rel_diff_from_maxfloat, optional, default=0.5
See description above.
- Returns:
- int
Index of peak element in the scores array.
- corradjust.utils.hypergeom_pvalues(M_arr, n_arr, N_arr, k_arr)[source]#
Parallel computation of hypergeometric test p-values over arrays of values
M, n, N, k
. Semantics for argument names are the same as in scipy.stats.hypergeom.- Parameters:
- M_arrnumpy.array
Total number of objects in each test.
- n_arrnumpy.array
Number of Type I objects in each test.
- N_arrnumpy.array
Number of drawn objects in each test.
- k_arrnumpy.array
Number of Type I drawn objects in each test.
- Returns:
- numpy.array
Array of p-values corresponding to each test. Minimum value of p-value is defined by
numpy.finfo(numpy.float64).smallest_normal
.
- corradjust.utils.log_binom_coef(n, k)[source]#
Logarithm of binomial coefficient
(n k)
.- Parameters:
- n, kint
Input numbers.
- Returns:
- float
Binomial coefficient.
- corradjust.utils.compute_aggregated_scores(df)[source]#
Compute aggregated statistics from the data frame of feature-wise scores.
- Parameters:
- dfpandas.DataFrame
Data frame produced by CorrAdjust.compute_feature_scores.
- Returns:
- agg_BP_at_Kfloat
Aggregated balanced precision at K.
- agg_enrichment: float
Aggregated enrichment.
- agg_pvalue: float
Aggregated p-value.