scProject API

Regression

scProject.rg.NNLR_ElasticNet(dataset_filtered, patterns_filtered, projectionName, alpha, L1, layer=False, iterations=10000, positive=True)

This method performs an elastic net regression from sci-kit learn. Currently it only takes in dense matrices.

Parameters:
  • dataset_filtered – AnnData object genes x samples

  • patterns_filtered – AnnData object genes x features

  • projectionName (String) – index of the projection in dataset_filtered.obsm

  • alpha (double) – regularization parameter

  • L1 (double) – regularization parameter

  • layer – Layer of dataset to regress on string

  • iterations – number of iterations while performing the regression

  • positive – Whether to restrict coefficient to be non negative

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

scProject.rg.NNLR_LeastSquares(dataset_filtered, patterns_filtered, projectionName)

This performs a non negative least squares regression using Scipy.

Parameters:
  • dataset_filtered – AnnData object genes x samples

  • patterns_filtered – AnnData object genes x features

  • projectionName – index of the projection in dataset_filtered.obsm

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

scProject.rg.NNLR_positive_Lasso(dataset_filtered, patterns_filtered, projectionName, alpha, layer=False, iterations=10000)

This method performs a positive lasso regression from sci-kit learn.

Parameters:
  • dataset_filtered – AnnData object genes x samples

  • patterns_filtered – AnnData object genes x features

  • projectionName – index of the projection in dataset_filtered.obsm

  • alpha – regularization parameter

  • layer – Layer with which to perform the regression

  • iterations – number of iterations while performing the regression

Returns:

void, the dataset_filtered is mutated and the projection is stored in dataset_filtered.obsm[projectionName]

Visualization

scProject.viz.UMAP_Projection(dataset_filtered, cellTypeColumnName, projectionName, UMAPName, n_neighbors, metric='euclidean', plot=True, colorScheme='Paired', pointSize=0.5, subset=None, path=None, display=True, dpi=300)

This method projects the pattern matrix down into 2 dimensional space. Make sure the colorScheme chosen has enough colors for every cell type.

Parameters:
  • dataset_filtered – Anndata object cells x genes

  • cellTypeColumnName – index for cell type in dataset_filtered.obsm

  • projectionName – index for the projection in dataset_filtered.obsm

  • UMAPName – index for the created UMAP coordinates in dataset_filtered.obsm

  • n_neighbors – number of neighbors for the UMAP

  • metric – the distance metric used in the UMAP, defaults to euclidean

  • plot – If True a plot is displayed, defaults to True

  • colorScheme – seaborn color scheme to use, defaults to Paired

  • pointSize – size of the points, defaults to .5

  • subset – subset of types in cell type column name to plot

  • path – path to save figure

  • display – Whether to display the figure in the jupyter notebook

  • dpi – Quality of the plot to be saved

Returns:

void, mutates dataset_filtered and add the UMAP to obsm

scProject.viz.UMAP_Viz(dataset_filtered, UMAPName, cellTypeColumnName, colorScheme='Paired', pointSize=0.5, subset=None, path=None, display=True, dpi=300)

Plots the UMAP of the pattern matrix. Make sure colorScheme has at least as many colors as cell types in your dataset.

Parameters:
  • cellTypeColumnName – index for cell type in dataset_filtered.obs can be any column in .obs

  • dataset_filtered – Anndata object cells x genes

  • UMAPName – index for the UMAP in dataset_filtered.obsm

  • colorScheme – seaborn color scheme, defaults to Paired

  • pointSize – size of the points, defaults to .5

  • subset – subset of types in cell type column name to plot

  • path – path to save figure

  • display – Whether to display the figure in the jupyter notebook

  • dpi – Quality of the plot to be saved

Returns:

void

scProject.viz.featurePlots(dataset_filtered, num_patterns, projectionName, UMAPName, vmin=1e-11, clip=99.5, zeroColor='dimgrey', obsColumn=None, cmap='viridis', pointSize=0.1, subset=None, path=None, display=True, dpi=300)

Creates plots which show the weight of each feature in each cell.

Parameters:
  • clip – Stops colorbar at the percentile specified [0,100]

  • vmin – Min of the colorplot i.e. what to define as zero

  • zeroColor – What color the cells below vmin should be colored

  • dataset_filtered – Anndata object cells x genes

  • num_patterns – the number of the patterns to display starting from feature 1. It can also take a list of ints.

  • projectionName – index of the projection in dataset_filtered.obsm

  • UMAPName – index of the UMAP in dataset_filtered.obsm

  • obsColumn – Column in dataset_filtered to use for subsetting

  • cmap – colormap to use when creating the feature plots

  • pointSize – Size of the points on the plots

  • subset – subset of types in cell type column name to plot

  • path – path to save figure without a file type suffix like pdf png

  • display – Whether to display the figure in the jupyter notebook

  • dpi – Quality of the plot to be saved

Returns:

void, files will be .png

scProject.viz.patternWeightDistribution(dataset_filtered, projectionName, patterns, obsColumn, subset, numBins=100)
Parameters:
  • dataset_filtered – Anndata object cells x genes

  • projectionName – index of the projection in dataset_filtered.obsm

  • patterns – Which patterns to visualize (one indexed)

  • obsColumn – Column in dataset_filtered to use for subsetting

  • subset – What subset of cells in the obsColumn to visualize

  • numBins – How many bins in the histogram

Returns:

void displays a histogram of the pattern weights above 0

scProject.viz.pearsonMatrix(dataset_filtered, patterns_filtered, cellTypeColumnName, num_cell_types, projectionName, plotName, plot, row_cluster=True, col_cluster=True, path=None, display=True, dpi=300, xtickSize=8, ytickSize=8)

This method finds the pearson correlation coefficient between every pattern and every cell type

Parameters:
  • dataset_filtered – Anndata object cells x genes

  • patterns_filtered – Anndata object features x genes

  • cellTypeColumnName (String) – index where the cell types are stored in dataset_filtered.obsm

  • num_cell_types (int) – The number of cell types in the dataset this parameter could be removed

  • projectionName (String) – The name of the projection created using one of the regression methods

  • plotName – The index for the pearson matrix in dataset_filtered.uns[plotName]

  • plot (boolean) – If True a plot is generated either saved or displayed

  • row_cluster – Bool whether to cluster

  • col_cluster – Bool whether to cluster columns or not

  • dpi – Quality of image to be saved

  • display – Bool whether to display the plot or not

  • xtickSize – Size of labels on the x-axis

  • ytickSize – Size of labels on the y-axis

Returns:

void

scProject.viz.pearsonViz(dataset_filtered, plotName, cellTypeColumnName, row_cluster=True, col_cluster=True, path=None, display=True, dpi=300, xtickSize=8, ytickSize=8)

Visualize or save a Pearson Matrix.

Parameters:
  • path

  • dataset_filtered – Anndata object cells x genes

  • plotName – Index of pearson matrix to visualize

  • cellTypeColumnName – index for cell type in dataset_filtered.obsm

  • row_cluster – Bool whether to cluster rows or not

  • col_cluster – Bool whether to cluster columns or not

  • dpi – Quality of image to be saved

  • display – Bool whether to display the plot or not

  • xtickSize – Size of labels on the x-axis

  • ytickSize – Size of labels on the y-axis

Returns:

void

scProject.viz.rankedByWeightedCIViz(projectionDriverOutput, pointLabel, weightTitle, pathForWeight, bonTitle, pathForBon, numGenesToPlot=50)
Parameters:
  • projectionDriverOutput – Output from the projectionDriver function

  • pointLabel – label for the CI point

  • weightTitle – Title for the Weighted CI plot

  • pathForWeight – Path for the Weighted CI plot

  • bonTitle – Title for the Bon CI plot

  • pathForBon – Path for the Bon CI plot

  • numGenesToPlot – The number of genes to plot on both plots

Returns:

void

Statistics

scProject.stats.BonferroniCorrectedDifferenceMeans(cluster1, cluster2, alpha, varName, verbose=0, display=True)

Constructs Bonferroni corrected Confidence intervals for the difference of the means of cluster1 and cluster2

Parameters:
  • display – Whether to display CI visualization

  • verbose – Whether to print out genes and CIs or not. 0 for print out. !=0 for no printing.

  • cluster1 – Anndata containing cluster1 cells

  • cluster2 – Anndata containing cluster2 cells

  • alpha – Confidence value

  • varName – What column in var to use for gene names

Returns:

A dataframe of genes as index with their respective confidence intervals in columns low and high.

scProject.stats.HotellingT2(cluster1, cluster2)

Calculates Hotelling T2 statistic to evaluate significance of difference means using pooled covariance and pseudo-inverse.

Parameters:
  • cluster1 – Anndata with cluster 1

  • cluster2 – Anddata with cluster 2

Returns:

Tuple of F value, TSquared, p value

scProject.stats.featureExpressionSig(cluster1, projectionName, featureNumber, alpha, mu=0)

Measure the significance of the mean expression of a feature for a group of cells.

Parameters:
  • cluster1 – AnnData with group of cells in question

  • projectionName – Regression to use

  • featureNumber – feature-value to use one-index

  • alpha – Level of significance

Returns:

tuple T and t value at alpha and degrees of freedom cells minus 1

scProject.stats.featureImportance(dataset_filtered, num_patterns, projectionName)

Shows a bar graph of feature importance/usage as measured by average coefficient.

Parameters:
  • dataset_filtered – Anndata object cells x genes

  • num_patterns – the number of the patterns to display starting from feature 1. It can also take a list of ints.

  • projectionName – index of the projection in dataset_filtered.obsm

Returns:

void

scProject.stats.geneDriver(dataset_filtered, patterns_filtered, geneName, cellTypeColumnName, cellType, projectionName)
Parameters:
  • dataset_filtered – Anndata object cells x genes

  • patterns_filtered – AnnData object features x genes

  • geneName – Name of gene in question must be in .var

  • cellTypeColumnName – index for cell type in dataset_filtered.obsm

  • cellType – str celltype in question

  • projectionName – str projection from which to use the pattern weights

Returns:

void

scProject.stats.geneSelectivity(patterns_filtered, geneName, num_pattern, plot=True)

Computes the percentage of a genes expression in a feature out of the total gene expression over all features.

Parameters:
  • patterns_filtered – AnnData object features x genes

  • geneName – geneName must be in the format that is in .var

  • num_pattern – The number of the feature/pattern of interest

  • plot – boolean, if true plots expression of the gene across all of the patterns.

Returns:

void

scProject.stats.importantGenes(patterns_filtered, featureNumber, threshold)

Returns the list of genes that are expressed greater than threshold in the feature.

Parameters:
  • patterns_filtered – AnnData object features x genes

  • featureNumber – Which pattern you want to examine

  • threshold – Show genes that are greater than this threshold

Returns:

A list of genes that are expressed above threshold in the pattern

scProject.stats.projectionDriver(patterns_filtered, cluster1, cluster2, alpha, varName, featureNumber, display=True, num_annotated=25, path=None, titleFontsize=25, axisFontsize=25, pointSize=3, annotationSize=25)

Assumes patterns_filtered and dataset_filtered have the same index.

Parameters:
  • patterns_filtered – AnnData object features x genes

  • cluster1 – Anndata containing cluster1 cells

  • cluster2 – Anndata containing cluster2 cells

  • alpha – Confidence value for the bonferroni confidence intervals

  • varName – What column in var to use for gene names

  • featureNumber – Which pattern you want to examine

  • display – Whether to visualize or not

  • num_annotated – number of genes to annotate

  • path – Path to save the visualization

  • titleFontsize – fontsize of the title

  • axisFontsize – fontsize of the axis labels

  • pointSize – size of the points

  • annotationSize – Size of the annotations for the top ranked genes

Returns:

Three dataframes. The first is the significant gene drivers. The second is Bonferroni CIs from the weighted mean vector. The third is the standard bonferroni CIs equivalent to calling BonferroniCorrectedDifferenceMeans.

Utilities

scProject.matcher.filterAnnDatas(dataset, patterns, geneColumnName, normalizePatterns=True, normalizeData=False)

This method filters the patterns and the dataset to only include overlapping genes

Parameters:
  • normalizeData – Whether to normalize dataset postfilter with L1 norm

  • normalizePatterns – Whether to normalize patterns postfilter with L1 norm

  • dataset (AnnData object) – Anndata object cells x genes

  • patterns – Anndata object features x genes

  • geneColumnName – index for where the gene names are kept in .var

Returns:

A tuple of two filtered AnnData objects

scProject.matcher.filterPatterns(patterns, overlap)

Convenience function for using an inputted set of genes

Parameters:
  • patterns – Anndata object features x genes

  • overlap – list-like of genes

Returns:

Filtered patterns (AnnData)

scProject.matcher.filterSource(dataset, overlap)

Convenience function for using an inputted set of genes

Parameters:
  • dataset – Anndata object cells x genes

  • overlap – list-like of genes

Returns:

Filtered dataset (AnnData)

scProject.matcher.getOverlap(dataset, patterns)

Convenience function for overlap of genes

Parameters:
  • dataset – Anndata object cells x genes

  • patterns – Anndata object features x genes

Returns:

Overlap of genes

scProject.matcher.logTransform(dataset_filtered)

Adds a layer called log to the dataset which is the log transform.

Parameters:

dataset_filtered – Anndata object cells x genes

Returns:

Log tranform of dataset and patterns.

scProject.matcher.mapCellNamesToInts(adata, cellTypeColumnName)

Maps each cell type to an integer. This is used as a helper for coloring plots

Parameters:
  • adata – AnnData object

  • cellTypeColumnName – index of where cell type is stored in adata.obs

Returns:

void

scProject.matcher.orthologMapper(dataset, biomartFilePath, originalGeneColumn, transformGeneColumn, varName)

Convenience function for mapping genes to their orthologs. Then, use filterAnnDatas.

Parameters:
  • dataset – dataset to find orthologs

  • biomartFilePath – file path of csv from biomart to perform the mapping

  • originalGeneColumn – column name of original gene in biomart file

  • transformGeneColumn – column name of gene in biomart file

  • varName – What set of data in .var to transform

Returns:

void, mutates dataset

scProject.matcher.sourceIsValid(adata)

Checks whether adata is an AnnData object

Parameters:

adata – AnnData object

Returns:

SourceTypeError if adata is not an instance of an AnnData