Subcellular Localization Analysis
Contents
Subcellular Localization Analysis#
Author: Clarence Mah | Last Updated: 6/15/2022
Here we will analyze cultured 3T3 mouse embryonic stem cells in which 10k genes are spatially profiled with seqFISH+ in Eng. et al 2019. The dataset here consists of 211 cells, each with cell/nuclear segmentation masks and 2D transcript coordinates. Here we showcase how Bento enables subcellular analysis of spatial transcriptomics data.
Load Libraries#
We will be using bento
for subcellular spatial analysis and visualization, along with scanpy
for handling single-cell gene expression data.
Note
Similar to other tools in the python single-cell omics ecosystem, bento
adopts an API organized under a set of modules including:
bento.io
: reading and writing data stored in theAnnData
format, customized to hold molecular coordinates and segmentation masksbento.pp
: preprocessing databento.tl
: subcellular spatial analysesbento.pl
: visualizing subcellular-resolution spatial data, localization pattern statistics, and more
[1]:
import bento
import scanpy as sc
Load Data#
Let’s grab the seqFISH+
dataset. This will download the dataset into data_home
, which by default is set to ~/bento-data
.
The loaded object is an AnnData
object, structured similarly to single-cell omics anlayses, where observations are cells, features are genes, and the main matrix is an expression count matrix. To store subcellular information, bento
stores: - Molecular coordinates: formatted as DataFrames
in uns['points']
- Segmentation masks: formatted as geopandas.GeoSeries
in obs
denoted as {}_shape
. In this case, we have cell and nuclear segmentations stored in cell_shape
and
nucleus_shape
respectively.
[2]:
adata = bento.datasets.load_dataset("seqfish_raw")
adata
[2]:
AnnData object with n_obs × n_vars = 211 × 9506
obs: 'cell_shape', 'nucleus_shape', 'batch'
uns: 'points'
layers: 'spliced', 'unspliced'
What does our data look like? For starters, we can visualize molecules as their spatial density in a single field of view. You may notice not all some cells are missing nuclear masks for one reason or another. We can handle this with Quality Control metrics in the next section.
[3]:
bento.pl.cellplot(adata)
Quality Control#
Calculate and plot QC metrics including those used for single-cell RNA-seq analysis. We can use cell area and perimeter to identify outlier cells that are extremely large or small. Then we can filter those out from our data. For the purpose of this analysis, these outlier have already been filtered. We will instead identify cells without annotated nuclei and remove those.
[4]:
sc.pp.calculate_qc_metrics(adata, inplace=True, percent_top=None)
bento.tl.cell_area(adata)
bento.tl.cell_density(adata)
AnnData object modified:
obs:
+ cell_area
AnnData object modified:
obs:
+ cell_density
[5]:
bento.pl.qc_metrics(adata)
Filter out cells without a nucleus.
[6]:
adata = adata[adata.obs["nucleus_shape"] != None]
bento.pp.set_points(adata)
Trying to set attribute `.uns` of view, copying.
We will also filter genes and only include genes for which at least 10 molecules are detected in at least one cell. This helps reduce data sparsity for our downstream analysis, resulting in 3726 genes.
[7]:
valid_genes = adata.var_names[(adata.to_df() >= 10).sum() >= 1]
adata = adata[:,valid_genes]
bento.pp.set_points(adata)
Trying to set attribute `.uns` of view, copying.
[8]:
adata
[8]:
AnnData object with n_obs × n_vars = 179 × 3726
obs: 'cell_shape', 'nucleus_shape', 'batch', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'cell_area', 'cell_density'
var: 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
uns: 'points'
layers: 'spliced', 'unspliced'
[9]:
bento.pp.get_points(adata).shape
[9]:
(4531290, 6)
Subcellular Localization Patterns#
Now that we have a sense of our data, we will apply a pattern classifier to predict and annotate subcellular localization patterns for our dataset. A single “sample” refers to the set of points corresponding for a given gene in a single cell.
Calculate spatial features#
These features are used by the classifier to predict patterns for each sample. The entire dataset takes ~8 hours to process, therefore we have provided precomputed features to proceed with the analysis.
Note
Why does it take so long? Unlike single-cell analysis which uses expression counts, we are computing spatial relationships considering every molecule! In this dataset there are >4 million molecules and multiple high-resolution polygons for every cell.
For the curious, check out the “How it Works” page.
[10]:
# Load dataset with precomputed features
adata = bento.datasets.load_dataset('seqfish')
Predict Subcellular Patterns#
Predict the subcellular pattern for every sample (by default, only samples with count >= 5). The five subcellular patterns we can predict are:
cell edge: near the cell membrane
cytoplasmic: mostly outside the nucleus in the cytoplasm
nuclear: most in the nucleus
nuclear edge: near the nuclear membrane, either
none: none of the above patterns, more or less randomly distributed
[11]:
bento.tl.lp(adata)
We can view the observed pattern frequencies to get a rough idea of how transcripts are localizing.
[12]:
bento.pl.lp_dist(adata, percentage=True)
We can also visualize the localization of each gene where the point position denotes the balance between subcellular localization pattern frequencies. The color denotes the gene’s most frequent pattern. Interestingly, we see a wide range of variability in localization. A large number of genes are pulled towards none while nuclear enriched genes show strong bias and a high fraction of cells.
[13]:
bento.pl.lp_genes(adata, alpha=0.4)
No handles with labels found to put in legend.
Localization signatures: quantifying cell to cell variability#
Just as gene expression varies from cell to cell, gene localization is dynamic even with a single cell type. Here we will use a method called tensor decomposition to explore this variation. Tensor decomposition breaks down a tensor, or multi-dimensional matrix, into a set of factors. This is similar to how performing PCA on gene expression produces a set of principal components. In this case, we will apply tensor decomposition to a 3-dimensional tensor; the dimensions correspond to patterns, cells, and genes.
Here is the shape of our dataset tensor.
[14]:
adata.uns['tensor'].shape
[14]:
(5, 179, 3726)
First we will determine the number of factors used to represent our dataset. To do so, we will perform tensor decomposition for a range of values (default [1-5], three times each) and calculate the reconstruction loss with the original tensor at each value. Reconstruction accuracy across range of decomposition ranks (1 is perfect, 0 is noise). The best rank is highlighted in red as determined by the elbow method.
Note
Estimating rank can be difficult; here we are using a heuristic and therefore may not generalize perfectly to a different dataset. Try increasing the upper_rank
if the error does not flatten, or increasing runs
if the confidence interval is too wide.
[15]:
bento.tl.select_tensor_rank(adata, bento.PATTERN_NAMES)
Device: cpu
The rank at the elbow is: 3
Perform tensor decomposition at a chosen rank (e.g. best rank above).
[16]:
bento.tl.lp_signatures(adata, 3)
Device: cpu
Each factor is described by 3 vectors, the pattern vector (it’s mislabeled as layers TODO fix), cell vector, and gene vector. Genes and cells are ordered by clustering (hierarchical clustering) for visual clarity. Higher loading values, which indicate stronger association with the factor, are denoted by darker colors.
Pattern color key:
[17]:
bento.pl.lp_signatures(adata)
Let’s plot the top 5 cells and top 5 genes associated with each signature. As expected, the signatures generally agree with the pattern loadings above. The first signature’s pattern loading is dominated by “none” and “cytoplasmic”, while signature 2 is a combination of “nuclear” and “nuclear edge” and signature 3 is primarily “cell edge”.
[18]:
bento.pl.sig_samples(adata)
We find these signatures recapitulate patterns found in the original seqFISH+ study, in which three major clusters of spatially co-occurring genes were observed and manually annotated as protrusion, nuclear/perinuclear, and cytoplasmic. This demonstrates the ability of tensor decomposition to extract meaningful biological structure from localization patterns in a data-driven manner.
Note
We recommend characterizing the gene and cell loadings of each signature to aid in interpretation. Refer to our paper for more details regarding these localization signatures! If you are analyzing your own dataset, we encourage reaching out the authors directly if you have any questions about your analysis or feature requests for bento
!