corradjust.utils module#

class corradjust.utils.MeanCenterizer[source]#

Bases: object

Center data by subtracting mean value from each feature. This class has two important features.

  • It is train/test friendly.

  • It can separately centralize different groups of samples, thus residualizing group effect from input data.

Attributes:
all_featureslist

List of all features of the training set.

meansdict

A dict with keys being group names, and values being numpy arrays with feature-wise training set means. Populated by calling fit method. If df_samp_ann passed to fit is None, means will have a single key: "all_samples".

fit(df_data, df_samp_ann=None)[source]#

Compute and save mean values for centering in the means attribute.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. If None, means across all samples will be computed.

  • Index: sample names (should match index of df_data).

  • Column group: discrete set of group names (e.g., "normal" and "tumor").

transform(df_data, df_samp_ann=None)[source]#

Center the data using mean values computed during fit call.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids.

df_samp_annpandas.DataFrame or None, optional, default=None

A data frame providing group annotation of samples. See fit for the format.

Returns:
pandas.DataFrame

Mean-centered version of df_data.

class corradjust.utils.Winsorizer(alpha=0.05)[source]#

Bases: object

Winsorize each feature. This class is train/test friendly.

Parameters:
alphafloat, optional, default=0.05

For each feature, alpha fraction of the lowest values and alpha fraction of the highest values will be changed to the corresponding quantile values.

Attributes:
alphafloat

Populated by the constructor.

all_featureslist

List of all features of the training set.

lower_thresholds, upper_thresholdsnumpy.array

Arrays with feature-wise training set alpha and 1 - alpha quantiles. Populated by calling fit method.

fit(df_data)[source]#

Compute and save winsorization thresholds.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids.

transform(df_data)[source]#

Winsorize the data using threshold values computed during fit call.

Parameters:
df_datapandas.DataFrame

Input data. Index = samples, columns = feature ids.

Returns:
pandas.DataFrame

Winsorized version of df_data.

class corradjust.utils.MedianOfRatios[source]#

Bases: object

Median of ratios normalization for RNA-seq data. This class is train/test friendly.

Attributes:
all_featureslist

List of all features of the training set.

geo_meansnumpy.array

Array with gene-wise training set geometric means. Populated by calling fit method.

fit(df_data)[source]#

Compute and save gene-wise geometrical means.

Parameters:
df_datapandas.DataFrame

Gene expression table. Index = samples, columns = gene ids.

transform(df_data)[source]#

Normalize the data using geometric mean values computed during fit call.

Parameters:
df_datapandas.DataFrame

Gene expression table. Index = samples, columns = gene ids.

Returns:
pandas.DataFrame

Normalized version of df_data.

corradjust.utils.make_dummy_samp_ann(df_data)[source]#

Make data frame with samples annotation, where all samples belong to the same group named "all_samp".

Parameters:
df_datapandas.DataFrame

Data frame with Index = samples.

Returns:
pandas.DataFrame

Data frame with the same Index and a single column "group" filled with a constant string "all_samp".

corradjust.utils.compute_pearson_columns(X)[source]#

Compute Pearson’s correlation for all-vs-all columns of input 2D array.

Parameters:
Xnumpy.ndarray

2D array with variables of interest over columns.

Returns:
numpy.array

1D array of correlations (diagonal 1s are not included). For speed-up purposes, data type is float32.

Notes

The numpy.triu_indices function with parameter k=1 can be used to establish correspondence between index of the resulting flat array and standard 2D correlation matrix.

corradjust.utils.flatten_corrs(corrs_2d)[source]#

Convert 2D correlation matrix into 1D array. This is parallel version with result equivalent to corrs_2d[numpy.triu_indices(n_features, k=1)].

Parameters:
corrs_2dnumpy.ndarray

Square (n, n) symmetric matrix with correlations.

Returns:
numpy.array

1D array with correlations (without diagonal 1s). Size of array is n * (n - 1) // 2.

corradjust.utils.compute_corr_pvalues(corrs, n)[source]#

Compute p-values testing null hypothesis of input Pearson’s correlations equal to zero.

Parameters:
corrsnumpy.array

1D array with correlations.

nint

Sample size used when computing correlations.

Returns:
numpy.array

1D array of p-values corresponding to corrs.

Notes

This function produces the same p-values as the scipy.stats.pearsonr function with default parameters.

corradjust.utils.regress_out_columns(X1, X2)[source]#

Regress out all columns of X2 from each columns of X1.

Parameters:
X1numpy.ndarray

2D array with variables of interest over columns.

X2numpy.ndarray

2D array with variables to residualize over columns. Should have the same number of rows and X1.

Returns:
residualsnumpy.ndarray

Residuals after regressing out columns of X2 from each column of X1. Has the same shape as X1.

rsquaredsnumpy.array

1D array of R^2 values corresponding to regressions fit for each column of X1.

corradjust.utils.find_peak(scores, patience=2, min_rel_improvement=0.05, max_rel_diff_from_max=0.5)[source]#

Find peak with early stopping criteria.

  • patience rounds with no min_rel_improvement compared to the previous maximum value (plateau reached).

  • relative difference between global maximum and current maximum is at most max_rel_diff_from_max (not terminating too early).

Parameters:
scoresnumpy.array

1D array with scores used to find the peak.

patienceint, optional, default=2

See description above.

min_rel_improvementfloat, optional, default=0.05

See description above.

max_rel_diff_from_maxfloat, optional, default=0.5

See description above.

Returns:
int

Index of peak element in the scores array.

corradjust.utils.hypergeom_pvalues(M_arr, n_arr, N_arr, k_arr)[source]#

Parallel computation of hypergeometric test p-values over arrays of values M, n, N, k. Semantics for argument names are the same as in scipy.stats.hypergeom.

Parameters:
M_arrnumpy.array

Total number of objects in each test.

n_arrnumpy.array

Number of Type I objects in each test.

N_arrnumpy.array

Number of drawn objects in each test.

k_arrnumpy.array

Number of Type I drawn objects in each test.

Returns:
numpy.array

Array of p-values corresponding to each test. Minimum value of p-value is defined by numpy.finfo(numpy.float64).smallest_normal.

corradjust.utils.log_binom_coef(n, k)[source]#

Logarithm of binomial coefficient (n k).

Parameters:
n, kint

Input numbers.

Returns:
float

Binomial coefficient.

corradjust.utils.compute_aggregated_scores(df)[source]#

Compute aggregated statistics from the data frame of feature-wise scores.

Parameters:
dfpandas.DataFrame

Data frame produced by CorrAdjust.compute_feature_scores.

Returns:
agg_BP_at_Kfloat

Aggregated balanced precision at K.

agg_enrichment: float

Aggregated enrichment.

agg_pvalue: float

Aggregated p-value.