Main Guide#

Author: Clarence Mah | Last Updated: May 15, 2023

Here we will analyze a subset of the U2-OS cell dataset from the Bento paper, in which 130 genes and 5 non-targeting controls are spatially profiled with MERFISH. The full dataset includes 1153 cells, each with cell/nuclear segmentation masks and 2D transcript coordinates. Here we demonstrate Bento’s key functionality on a representative subset.

Setup#

Load libraries and configure paths.

import bento as bt
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Load Data#

Bento includes access to datasets with the package. Datasets are downloaded and stored in data_home, which by default is set to ~/bento-data.

The loaded object is an AnnData object, structured similarly to single-cell omics analysis, where observations are cells, features are genes, and the main matrix is an expression count matrix. Bento additionally stores molecular coordinates in uns['points'] and polygons as columns in obs.

Note

See the docs on Bento data structures for more details.

adata = bt.ds.sample_data()

You can use built-in plotting functions to visualize data. Here we plot RNA distributions and cell/nuclear segmentation masks.

See also

See the data visualization tutorial for a more comprehensive guide on plotting.

bt.pl.density(adata)
../_images/118bc7c92395225f0e21963d1815f9b79edfda72bd1efe61f92534628e737923.png

Or visualize a specific gene of interest with bt.pl.points().

Tip

Plotting multiple genes can get slow quickly due to plotting the legend; if all you need is a visual, you can set legend=False and the plot should render more quickly.

bt.pl.points(adata, hue="gene", legend=False, s=1, palette="tab20")
../_images/c4f59bc918d87c4f40c87749326a314ae677ba54d9171b3f76101708b268dd45.png

Let’s filter out cells without a nucleus. There are several reasons why this may occur including missing segmentation, overlapping nuclei, or disagreement with cell segmentation. Accurate cell segmentation is a difficult task, especially in samples with high cell density (cells can overlap) and tissue sections.

adata = adata[adata.obs["nucleus_shape"] != None]
bt.sync(adata)
adata
Trying to set attribute `._uns` of view, copying.
AnnData object with n_obs × n_vars = 15 × 135
    obs: 'cell_shape', 'nucleus_shape', 'batch'
    uns: 'point_sets', 'points'
    layers: 'spliced', 'unspliced'

Keep genes where at least 10 molecules are detected in at least one cell.

gene_filter = (adata.X >= 10).sum(axis=0) > 0
adata = adata[:, gene_filter]
bt.sync(adata)
adata
Trying to set attribute `._uns` of view, copying.
AnnData object with n_obs × n_vars = 15 × 133
    obs: 'cell_shape', 'nucleus_shape', 'batch'
    uns: 'point_sets', 'points'
    layers: 'spliced', 'unspliced'

Spatial summary statistics#

You can get a quick summary of cell and nuclear properties, including area, shape i.e. aspect ratio, and RNA density.

See also

See the spatial features tutorial for additional features and how to implement custom ones.

bt.tl.obs_stats(adata)
AnnData object modified:
    obs:
        + cell_area, cell_aspect_ratio, cell_density
AnnData object modified:
    obs:
        + nucleus_density, nucleus_aspect_ratio, nucleus_area
bt.pl.obs_stats(adata)
../_images/0d5c42e8dca84ae83764acb4a540ec5e3af19bcdd5d0a434f8b57f5c337834c3.png

Predict subcellular domains#

RNAflux embedding#

RNAflux quantifies spatial composition gradients to capture subcellular changes in expression, represented as a [pixel x gene] embedding.

See also

Learn more about the algorithm here.

res = 0.1
bt.tl.flux(adata, radius=50, res=res)
AnnData object modified:
    obs:
        + cell_raster
    uns:
        + cell_raster
AnnData object modified:
    obs:
        + cell_raster
    uns:
        + cell_raster, flux_embed, flux, flux_variance_ratio, flux_genes, flux_color

bt.tl.flux() automatically performs PCA and saves the first 10 principal components (PCs) in uns['flux_embed']. We can visualize the first 3 PCs by mapping them to RGB values (red = PC1, green = PC2, and blue = PC3) for each pixel.

Tip

The embedding is calculated at 5% unit resolution by default for speed. Higher resolution trades off speed for smoother embeddings. Note that computation time scales quadratically in relation to resolution \(r\) or \(O(r^{2})\) e.g. 10% resolution takes 4x longer to compute than 5%.

bt.pl.flux(adata, res=res)
../_images/cd4406171fb7a67f3e49c0150b946cf5f3b8a2c9c7fc6a3b1b9eab1a776b550d.png

Fluxmaps (RNAflux domains)#

To identify distinct subcellular domains in a data-driven manner, we can cluster pixels by their RNAflux embeddings. The bt.tl.fluxmap() function fits a self-organizing map (SOM) to the reduced PCA space for a range of cluster numbers. We use the elbow method heuristic to recommend the optimal number of clusters. By default, a line plot will be rendered showing the model fit error for each cluster number and draw a vertical dotted line indicating the recommended number.

Note

Determining the number of clusters is not trivial and can be highly subjective. Occasionally, no number is suggested. You can either try a wider range of cluster numbers or manually pick one. We generally recommend settling on a smaller number of clusters i.e. less than 10 for interpretability.

bt.tl.fluxmap(adata, res=res)
../_images/218bc4011901f5f0965c10639d1281c68195d4e2a0c7c8c513043eba72cef2f0.png
AnnData object modified:
    obs:
        + fluxmap2_shape, fluxmap3_shape, fluxmap4_shape, fluxmap1_shape

Now let’s visualize the predicted subcellular domains. Notice how every cell contains instances of each domain (denoted by the different colors).

bt.pl.fluxmap(adata)
../_images/31e1ea41ef700927470a7339bd879843cc0ad736033d27c7562cad626f20b81f.png

Functional enrichment of fluxmaps#

We can utilize RNAflux embeddings to compute enrichment scores across the entire area of each cell. Given the appropriate genesets, they can help us identify functionally relevant domains such as organelles and subcellular compartments e.g. the nucleus and cytoplasm. Here we employ published APEX-seq data measuring the relative expression (log2 fold change) of genes in various compartments. We can compare geneset enrichment scores to the fluxmaps.

bt.tl.fe_fazal2019(adata)
4 samples of mat are empty, they will be removed.
Running wsum on mat with 19323 samples and 133 targets for 8 sources.
Infering activities on 2 batches.
100%|██████████| 2/2 [00:04<00:00,  2.14s/it]
AnnData object modified:
    uns:
        + flux_Lamina, flux_Nucleus, flux_ER Lumen, flux_Cytosol, flux_OMM, flux_ERM, flux_Nuclear Pore, fe_stats, fe_ngenes, flux_Nucleolus

You can visualize functional enrichment scores with bt.pl.fe() as well as specific shapes to overlay. In this case, we showcase the striking correspondence of fluxmap2 with “OMM” expression (outer mitochondrial membrane).

bt.pl.fe(
    adata,
    "flux_OMM",
    res=res,
    shapes=["cell", "fluxmap2"],
)
Trying to set attribute `._uns` of view, copying.
../_images/59cc8a92baf43c626be510862619a500b61c2660ff1ae9ceea0fa69eb6b8ec89.png

Here, fluxmap1 strongly associates with “Nucleus” expression.

bt.pl.fe(adata, "flux_Nucleus", res=res, shapes=["cell", "fluxmap1"])
Trying to set attribute `._uns` of view, copying.
../_images/4045f8733a953e4b04b1ed7634c3bc9f811575e0ed53afec13d4c923a2068484.png

Predict RNA Localization Patterns#

We will use the RNAforest model, to predict and annotate subcellular localization patterns. A single “sample” refers to the set of points corresponding to a given gene in a single cell. In the case that every cell expresses every gene, the number of samples is at most \(n * m\) for \(n\) cells and \(m\) genes.

RNA Localization Pattern Annotation Workflow

The five subcellular patterns we can predict are:

  • cell edge: near the cell membrane

  • cytoplasmic: mostly outside the nucleus in the cytoplasm

  • nuclear: most in the nucleus

  • nuclear edge: near the nuclear membrane, either

  • none: none of the above patterns, more or less randomly distributed

See also

See more details about the spatial statistics used as input features for classification.

bt.tl.lp(adata)
Calculating cell features...
AnnData object modified:
    obs:
        + cell_radius, cell_maxx, cell_minx, cell_maxy, cell_miny, cell_span
Processing point features...
Saving results...
Done.
AnnData object modified:
    obs:
        + cell_radius, cell_maxx, cell_minx, cell_maxy, cell_miny, cell_span
    uns:
        + cell_gene_features
Calculating cell features...
Processing point features...
Saving results...
Done.
AnnData object modified:
    obs:
        + cell_radius, cell_maxx, cell_minx, cell_maxy, cell_miny, cell_span
    uns:
        + lp, cell_gene_features, lpp

We can view the observed pattern frequencies to get a rough idea of how transcripts are localizing.

bt.pl.lp_dist(adata, percentage=True)
../_images/37c41c44fb59743668d8900dcbcc1af6864c1a95401d5fe36f67c87490539d87.png

We can also visualize the localization of each gene where the point position denotes the balance between subcellular localization pattern frequencies. The color denotes the gene’s most frequent pattern. Interestingly, we see a wide range of variability in localization. A large number of genes are pulled towards none while nuclear enriched genes show strong bias and a high fraction of cells.

bt.tl.lp_stats(adata)
AnnData object modified:
    uns:
        + lp_stats
bt.pl.lp_genes(adata, sizes=(10, 85), size_norm=(90, 100))
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/9ffee26f27aa54236ef76b04388e1aa140e797cf73edce87843bea03b437bbef.png

Colocalization analysis#

Here we use the Colocation Quotient or CLQ (Leslie & Kronenfeld, 2011) to measure pairwise colocalization between genes. Given two sets of points, A and B, the CLQ is the ratio of observed to expected proprtion of B among A’s neighbors.

At the same time, we quantify colocalization in a compartment-specific manner i.e. transcripts in the nucleus organize differently than they do in the cytoplasm.

colocalization analysis workflow

First lets create shapes for the cytoplasm.

# Cytoplasm = cell - nucleus
adata.obs["cytoplasm_shape"] = bt.geo.get_shape(adata, "cell_shape") - bt.geo.get_shape(
    adata, "nucleus_shape"
)

# Create point index
adata.uns["points"]["cytoplasm"] = (
    adata.uns["points"]["nucleus"].astype(int) < 0
).astype(int)

Now we can calculate CLQ values for every gene pair – one for the cytoplasm and once more for the nucleus.

bt.tl.coloc_quotient(adata, shapes=["cytoplasm_shape", "nucleus_shape"])
AnnData object modified:
    uns:
        + clq

We can represent the data as a 3-dimensional tensor: compartments, cells, and gene pairs and apply tensor decomposition - in this case, non-negative PARAFAC (Shashua & Hazan 2005) - to discover substructure considering cell and compartment-specific patterns.

The data tensor is broken down into \(k\) factors. When added together, the factors reconstruct the original data tensor with some degree of error. By plotting the error for each \(k\) rank decomposition, we can use the elbow method heuristic to recommend the optimal number of factors.

bt.tl.colocation(adata, ranks=range(1, 6))
Preparing tensor...
(2, 15, 17317)
:running: Decomposing tensor...
:heavy_check_mark: Done.
AnnData object modified:
    uns:
        + factors, tensor_labels, tensor, tensor_names, factors_error
../_images/90ce8d08ac16f7cc03afd57a2ed32fe9700181406917b902d73e7779579af899.png

Let’s plot the factor loadings for the suggested \(k = 3\). From left to right, the three heatmaps show the loadings of each factor for each dimension – compartments, cells, and gene pairs. We can limit the heatmap to show the top 5 associated gene pairs for each factor.

bt.pl.colocation(adata, rank=2)
../_images/52469655243efc242d6c4eada3ff0b1e3d3c3046d0389e2e6d519a6b651814ec.png