corradjust.utils module#

class corradjust.utils.MeanCenterizer[source]#

Bases: object

Center data by subtracting mean value from each feature. This class has two important features.

It is train/test friendly.
It can separately centralize different groups of samples, thus residualizing group effect from input data.

Attributes:

all_featureslist: List of all features of the training set.
meansdict: A dict with keys being group names, and values being numpy arrays with feature-wise training set means. Populated by calling fit method. If df_samp_ann passed to fit is None, means will have a single key: "all_samples".

fit(df_data, df_samp_ann=None)[source]#

Compute and save mean values for centering in the means attribute.

Parameters:

df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. If None, means across all samples will be computed.

Index: sample names (should match index of df_data).
Column group: discrete set of group names (e.g., "normal" and "tumor").

transform(df_data, df_samp_ann=None)[source]#

Center the data using mean values computed during fit call.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids.
df_samp_annpandas.DataFrame or None, optional, default=None: A data frame providing group annotation of samples. See fit for the format.

Returns:

pandas.DataFrame: Mean-centered version of df_data.

class corradjust.utils.Winsorizer(alpha=0.05)[source]#

Bases: object

Winsorize each feature. This class is train/test friendly.

Parameters:

alphafloat, optional, default=0.05: For each feature, alpha fraction of the lowest values and alpha fraction of the highest values will be changed to the corresponding quantile values.

Attributes:

alphafloat: Populated by the constructor.
all_featureslist: List of all features of the training set.
lower_thresholds, upper_thresholdsnumpy.array: Arrays with feature-wise training set alpha and 1 - alpha quantiles. Populated by calling fit method.

fit(df_data)[source]#

Compute and save winsorization thresholds.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids.

transform(df_data)[source]#

Winsorize the data using threshold values computed during fit call.

Parameters:

df_datapandas.DataFrame: Input data. Index = samples, columns = feature ids.

Returns:

pandas.DataFrame: Winsorized version of df_data.

class corradjust.utils.MedianOfRatios[source]#

Bases: object

Median of ratios normalization for RNA-seq data. This class is train/test friendly.

Attributes:

all_featureslist: List of all features of the training set.
geo_meansnumpy.array: Array with gene-wise training set geometric means. Populated by calling fit method.

fit(df_data)[source]#

Compute and save gene-wise geometrical means.

Parameters:

df_datapandas.DataFrame: Gene expression table. Index = samples, columns = gene ids.

transform(df_data)[source]#

Normalize the data using geometric mean values computed during fit call.

Parameters:

df_datapandas.DataFrame: Gene expression table. Index = samples, columns = gene ids.

Returns:

pandas.DataFrame: Normalized version of df_data.

corradjust.utils.make_dummy_samp_ann(df_data)[source]#

Make data frame with samples annotation, where all samples belong to the same group named "all_samp".

Parameters:

df_datapandas.DataFrame: Data frame with Index = samples.

Returns:

pandas.DataFrame: Data frame with the same Index and a single column "group" filled with a constant string "all_samp".

corradjust.utils.compute_pearson_columns(X)[source]#

Compute Pearson’s correlation for all-vs-all columns of input 2D array.

Parameters:

Xnumpy.ndarray: 2D array with variables of interest over columns.

Returns:

numpy.array: 1D array of correlations (diagonal 1s are not included). For speed-up purposes, data type is float32.

Notes

The numpy.triu_indices function with parameter k=1 can be used to establish correspondence between index of the resulting flat array and standard 2D correlation matrix.

corradjust.utils.flatten_corrs(corrs_2d)[source]#

Convert 2D correlation matrix into 1D array. This is parallel version with result equivalent to corrs_2d[numpy.triu_indices(n_features, k=1)].

Parameters:

corrs_2dnumpy.ndarray: Square (n, n) symmetric matrix with correlations.

Returns:

numpy.array: 1D array with correlations (without diagonal 1s). Size of array is n * (n - 1) // 2.

corradjust.utils.compute_corr_pvalues(corrs, n)[source]#

Compute p-values testing null hypothesis of input Pearson’s correlations equal to zero.

Parameters:

corrsnumpy.array: 1D array with correlations.
nint: Sample size used when computing correlations.

Returns:

numpy.array: 1D array of p-values corresponding to corrs.

Notes

This function produces the same p-values as the scipy.stats.pearsonr function with default parameters.

corradjust.utils.regress_out_columns(X1, X2)[source]#

Regress out all columns of X2 from each columns of X1.

Parameters:

X1numpy.ndarray: 2D array with variables of interest over columns.
X2numpy.ndarray: 2D array with variables to residualize over columns. Should have the same number of rows and X1.

Returns:

residualsnumpy.ndarray: Residuals after regressing out columns of X2 from each column of X1. Has the same shape as X1.
rsquaredsnumpy.array: 1D array of R^2 values corresponding to regressions fit for each column of X1.

corradjust.utils.find_peak(scores, patience=2, min_rel_improvement=0.05, max_rel_diff_from_max=0.5)[source]#

Find peak with early stopping criteria.

patience rounds with no min_rel_improvement compared to the previous maximum value (plateau reached).
relative difference between global maximum and current maximum is at most max_rel_diff_from_max (not terminating too early).

Parameters:

scoresnumpy.array: 1D array with scores used to find the peak.
patienceint, optional, default=2: See description above.
min_rel_improvementfloat, optional, default=0.05: See description above.
max_rel_diff_from_maxfloat, optional, default=0.5: See description above.

Returns:

int: Index of peak element in the scores array.

corradjust.utils.hypergeom_pvalues(M_arr, n_arr, N_arr, k_arr)[source]#

Parallel computation of hypergeometric test p-values over arrays of values M, n, N, k. Semantics for argument names are the same as in scipy.stats.hypergeom.

Parameters:

M_arrnumpy.array: Total number of objects in each test.
n_arrnumpy.array: Number of Type I objects in each test.
N_arrnumpy.array: Number of drawn objects in each test.
k_arrnumpy.array: Number of Type I drawn objects in each test.

Returns:

numpy.array: Array of p-values corresponding to each test. Minimum value of p-value is defined by numpy.finfo(numpy.float64).smallest_normal.

corradjust.utils.log_binom_coef(n, k)[source]#

Logarithm of binomial coefficient (n k).

Parameters:

n, kint: Input numbers.

Returns:

float: Binomial coefficient.

corradjust.utils.compute_aggregated_scores(df)[source]#

Compute aggregated statistics from the data frame of feature-wise scores.

Parameters:

dfpandas.DataFrame: Data frame produced by CorrAdjust.compute_feature_scores.

Returns:

agg_BP_at_Kfloat: Aggregated balanced precision at K.
agg_enrichment: float: Aggregated enrichment.
agg_pvalue: float: Aggregated p-value.