scProject API

Regression

scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, projectionName, alpha, L1, layer=False, iterations=10000, positive=True)

This method performs an elastic net regression from sci-kit learn. Currently it only takes in dense matrices.

Parameters:

dataset_filtered – AnnData object genes x samples
patterns_filtered – AnnData object genes x features
projectionName (String) – index of the projection in dataset_filtered.obsm
alpha (double) – regularization parameter
L1 (double) – regularization parameter
layer – Layer of dataset to regress on string
iterations – number of iterations while performing the regression
positive – Whether to restrict coefficient to be non negative

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

scProject.rg.NNLR_LeastSquares(dataset_filtered, patterns_filtered, projectionName)

This performs a non negative least squares regression using Scipy.

Parameters:

dataset_filtered – AnnData object genes x samples
patterns_filtered – AnnData object genes x features
projectionName – index of the projection in dataset_filtered.obsm

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

scProject.rg.NNLR_positive_Lasso(dataset_filtered, patterns_filtered, projectionName, alpha, layer=False, iterations=10000)

This method performs a positive lasso regression from sci-kit learn.

Parameters:

dataset_filtered – AnnData object genes x samples
patterns_filtered – AnnData object genes x features
projectionName – index of the projection in dataset_filtered.obsm
alpha – regularization parameter
layer – Layer with which to perform the regression
iterations – number of iterations while performing the regression

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

Visualization

scProject.viz.UMAP_Projection(dataset_filtered, cellTypeColumnName, projectionName, UMAPName, n_neighbors, metric='euclidean', plot=True, colorScheme='Paired', pointSize=0.5, subset=None, path=None, display=True, dpi=300)

This method projects the pattern matrix down into 2 dimensional space. Make sure the colorScheme chosen has enough colors for every cell type.

Parameters:

dataset_filtered – Anndata object cells x genes
cellTypeColumnName – index for cell type in dataset_filtered.obsm
projectionName – index for the projection in dataset_filtered.obsm
UMAPName – index for the created UMAP coordinates in dataset_filtered.obsm
n_neighbors – number of neighbors for the UMAP
metric – the distance metric used in the UMAP, defaults to euclidean
plot – If True a plot is displayed, defaults to True
colorScheme – seaborn color scheme to use, defaults to Paired
pointSize – size of the points, defaults to .5
subset – subset of types in cell type column name to plot
path – path to save figure
display – Whether to display the figure in the jupyter notebook
dpi – Quality of the plot to be saved

Returns:

void, mutates dataset_filtered and add the UMAP to obsm

scProject.viz.UMAP_Viz(dataset_filtered, UMAPName, cellTypeColumnName, colorScheme='Paired', pointSize=0.5, subset=None, path=None, display=True, dpi=300)

Plots the UMAP of the pattern matrix. Make sure colorScheme has at least as many colors as cell types in your dataset.

Parameters:

cellTypeColumnName – index for cell type in dataset_filtered.obs can be any column in .obs
dataset_filtered – Anndata object cells x genes
UMAPName – index for the UMAP in dataset_filtered.obsm
colorScheme – seaborn color scheme, defaults to Paired
pointSize – size of the points, defaults to .5
subset – subset of types in cell type column name to plot
path – path to save figure
display – Whether to display the figure in the jupyter notebook
dpi – Quality of the plot to be saved

Returns:

void

scProject.viz.featurePlots(dataset_filtered, num_patterns, projectionName, UMAPName, vmin=1e-11, clip=99.5, zeroColor='dimgrey', obsColumn=None, cmap='viridis', pointSize=0.1, subset=None, path=None, display=True, dpi=300)

Creates plots which show the weight of each feature in each cell.

Parameters:

clip – Stops colorbar at the percentile specified [0,100]
vmin – Min of the colorplot i.e. what to define as zero
zeroColor – What color the cells below vmin should be colored
dataset_filtered – Anndata object cells x genes
num_patterns – the number of the patterns to display starting from feature 1. It can also take a list of ints.
projectionName – index of the projection in dataset_filtered.obsm
UMAPName – index of the UMAP in dataset_filtered.obsm
obsColumn – Column in dataset_filtered to use for subsetting
cmap – colormap to use when creating the feature plots
pointSize – Size of the points on the plots
subset – subset of types in cell type column name to plot
path – path to save figure without a file type suffix like pdf png
display – Whether to display the figure in the jupyter notebook
dpi – Quality of the plot to be saved

Returns:

void, files will be .png

scProject.viz.patternWeightDistribution(dataset_filtered, projectionName, patterns, obsColumn, subset, numBins=100)

Parameters:

dataset_filtered – Anndata object cells x genes
projectionName – index of the projection in dataset_filtered.obsm
patterns – Which patterns to visualize (one indexed)
obsColumn – Column in dataset_filtered to use for subsetting
subset – What subset of cells in the obsColumn to visualize
numBins – How many bins in the histogram

Returns:

void displays a histogram of the pattern weights above 0

scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, cellTypeColumnName, num_cell_types, projectionName, plotName, plot, row_cluster=True, col_cluster=True, path=None, display=True, dpi=300, xtickSize=8, ytickSize=8)

This method finds the pearson correlation coefficient between every pattern and every cell type

Parameters:

dataset_filtered – Anndata object cells x genes
patterns_filtered – Anndata object features x genes
cellTypeColumnName (String) – index where the cell types are stored in dataset_filtered.obsm
num_cell_types (int) – The number of cell types in the dataset this parameter could be removed
projectionName (String) – The name of the projection created using one of the regression methods
plotName – The index for the pearson matrix in dataset_filtered.uns[plotName]
plot (boolean) – If True a plot is generated either saved or displayed
row_cluster – Bool whether to cluster
col_cluster – Bool whether to cluster columns or not
dpi – Quality of image to be saved
display – Bool whether to display the plot or not
xtickSize – Size of labels on the x-axis
ytickSize – Size of labels on the y-axis

Returns:

void

scProject.viz.pearsonViz(dataset_filtered, plotName, cellTypeColumnName, row_cluster=True, col_cluster=True, path=None, display=True, dpi=300, xtickSize=8, ytickSize=8)

Visualize or save a Pearson Matrix.

Parameters:

path –
dataset_filtered – Anndata object cells x genes
plotName – Index of pearson matrix to visualize
cellTypeColumnName – index for cell type in dataset_filtered.obsm
row_cluster – Bool whether to cluster rows or not
col_cluster – Bool whether to cluster columns or not
dpi – Quality of image to be saved
display – Bool whether to display the plot or not
xtickSize – Size of labels on the x-axis
ytickSize – Size of labels on the y-axis

Returns:

void

scProject.viz.rankedByWeightedCIViz(projectionDriverOutput, pointLabel, weightTitle, pathForWeight, bonTitle, pathForBon, numGenesToPlot=50)

Parameters:

projectionDriverOutput – Output from the projectionDriver function
pointLabel – label for the CI point
weightTitle – Title for the Weighted CI plot
pathForWeight – Path for the Weighted CI plot
bonTitle – Title for the Bon CI plot
pathForBon – Path for the Bon CI plot
numGenesToPlot – The number of genes to plot on both plots

Returns:

void

Statistics

scProject.stats.BonferroniCorrectedDifferenceMeans(cluster1, cluster2, alpha, varName, verbose=0, display=True)

Constructs Bonferroni corrected Confidence intervals for the difference of the means of cluster1 and cluster2

Parameters:

display – Whether to display CI visualization
verbose – Whether to print out genes and CIs or not. 0 for print out. !=0 for no printing.
cluster1 – Anndata containing cluster1 cells
cluster2 – Anndata containing cluster2 cells
alpha – Confidence value
varName – What column in var to use for gene names

Returns:

A dataframe of genes as index with their respective confidence intervals in columns low and high.

scProject.stats.HotellingT2(cluster1, cluster2)

Calculates Hotelling T2 statistic to evaluate significance of difference means using pooled covariance and pseudo-inverse.

Parameters:

cluster1 – Anndata with cluster 1
cluster2 – Anddata with cluster 2

Returns:

Tuple of F value, TSquared, p value

scProject.stats.featureExpressionSig(cluster1, projectionName, featureNumber, alpha, mu=0)

Measure the significance of the mean expression of a feature for a group of cells.

Parameters:

cluster1 – AnnData with group of cells in question
projectionName – Regression to use
featureNumber – feature-value to use one-index
alpha – Level of significance

Returns:

tuple T and t value at alpha and degrees of freedom cells minus 1

scProject.stats.featureImportance(dataset_filtered, num_patterns, projectionName)

Shows a bar graph of feature importance/usage as measured by average coefficient.

Parameters:

dataset_filtered – Anndata object cells x genes
num_patterns – the number of the patterns to display starting from feature 1. It can also take a list of ints.
projectionName – index of the projection in dataset_filtered.obsm

Returns:

void

scProject.stats.geneDriver(dataset_filtered, patterns_filtered, geneName, cellTypeColumnName, cellType, projectionName)

Parameters:

dataset_filtered – Anndata object cells x genes
patterns_filtered – AnnData object features x genes
geneName – Name of gene in question must be in .var
cellTypeColumnName – index for cell type in dataset_filtered.obsm
cellType – str celltype in question
projectionName – str projection from which to use the pattern weights

Returns:

void

scProject.stats.geneSelectivity(patterns_filtered, geneName, num_pattern, plot=True)

Computes the percentage of a genes expression in a feature out of the total gene expression over all features.

Parameters:

patterns_filtered – AnnData object features x genes
geneName – geneName must be in the format that is in .var
num_pattern – The number of the feature/pattern of interest
plot – boolean, if true plots expression of the gene across all of the patterns.

Returns:

void

scProject.stats.importantGenes(patterns_filtered, featureNumber, threshold)

Returns the list of genes that are expressed greater than threshold in the feature.

Parameters:

patterns_filtered – AnnData object features x genes
featureNumber – Which pattern you want to examine
threshold – Show genes that are greater than this threshold

Returns:

A list of genes that are expressed above threshold in the pattern

scProject.stats.projectionDriver(patterns_filtered, cluster1, cluster2, alpha, varName, featureNumber, display=True, num_annotated=25, path=None, titleFontsize=25, axisFontsize=25, pointSize=3, annotationSize=25)

Assumes patterns_filtered and dataset_filtered have the same index.

Parameters:

patterns_filtered – AnnData object features x genes
cluster1 – Anndata containing cluster1 cells
cluster2 – Anndata containing cluster2 cells
alpha – Confidence value for the bonferroni confidence intervals
varName – What column in var to use for gene names
featureNumber – Which pattern you want to examine
display – Whether to visualize or not
num_annotated – number of genes to annotate
path – Path to save the visualization
titleFontsize – fontsize of the title
axisFontsize – fontsize of the axis labels
pointSize – size of the points
annotationSize – Size of the annotations for the top ranked genes

Returns:

Three dataframes. The first is the significant gene drivers. The second is Bonferroni CIs from the weighted mean vector. The third is the standard bonferroni CIs equivalent to calling BonferroniCorrectedDifferenceMeans.

Utilities

scProject.matcher.filterAnnDatas(dataset, patterns, geneColumnName, normalizePatterns=True, normalizeData=False)

This method filters the patterns and the dataset to only include overlapping genes

Parameters:

normalizeData – Whether to normalize dataset postfilter with L1 norm
normalizePatterns – Whether to normalize patterns postfilter with L1 norm
dataset (AnnData object) – Anndata object cells x genes
patterns – Anndata object features x genes
geneColumnName – index for where the gene names are kept in .var

Returns:

A tuple of two filtered AnnData objects

scProject.matcher.filterPatterns(patterns, overlap)

Convenience function for using an inputted set of genes

Parameters:

patterns – Anndata object features x genes
overlap – list-like of genes

Returns:

Filtered patterns (AnnData)

scProject.matcher.filterSource(dataset, overlap)

Convenience function for using an inputted set of genes

Parameters:

dataset – Anndata object cells x genes
overlap – list-like of genes

Returns:

Filtered dataset (AnnData)

scProject.matcher.getOverlap(dataset, patterns)

Convenience function for overlap of genes

Parameters:

dataset – Anndata object cells x genes
patterns – Anndata object features x genes

Returns:

Overlap of genes

scProject.matcher.logTransform(dataset_filtered)

Adds a layer called log to the dataset which is the log transform.

Parameters:: dataset_filtered – Anndata object cells x genes
Returns:: Log tranform of dataset and patterns.

scProject.matcher.mapCellNamesToInts(adata, cellTypeColumnName)

Maps each cell type to an integer. This is used as a helper for coloring plots

Parameters:

adata – AnnData object
cellTypeColumnName – index of where cell type is stored in adata.obs

Returns:

void

scProject.matcher.orthologMapper(dataset, biomartFilePath, originalGeneColumn, transformGeneColumn, varName)

Convenience function for mapping genes to their orthologs. Then, use filterAnnDatas.

Parameters:

dataset – dataset to find orthologs
biomartFilePath – file path of csv from biomart to perform the mapping
originalGeneColumn – column name of original gene in biomart file
transformGeneColumn – column name of gene in biomart file
varName – What set of data in .var to transform

Returns:

void, mutates dataset

scProject.matcher.sourceIsValid(adata)

Checks whether adata is an AnnData object

Parameters:: adata – AnnData object
Returns:: SourceTypeError if adata is not an instance of an AnnData