Main Guide

Author: Clarence Mah | Last Updated: Apr 26, 2024

Here we will analyze a subset of the U2-OS cell dataset from the Bento paper, in which 130 genes and 5 non-targeting controls are spatially profiled with MERFISH. The full dataset includes 1153 cells, each with cell/nuclear segmentation masks and 2D transcript coordinates. Here we demonstrate Bento’s key functionality on a representative subset.

Setup

Load libraries used in this analysis.

%load_ext autoreload
%autoreload 2
import bento as bt
import spatialdata as sd
import matplotlib.pyplot as plt
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Load Data

Download the MERFISH dataset from the Bento paper at this link and unzip it to data.zarr.

The loaded object is a SpatialData object, a container for spatial omics data. Under the hood, the SpatialData framework reads/writes data from/to disk and provides a unified interface in Python.

Note

See Data Format for more details.

sdata = sd.read_zarr("/mnt/d/mah2024_merfish/data.zarr/")
sdata
SpatialData object with:
├── Points
│     └── 'transcripts': DataFrame with shape: (16907948, 5) (2D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (1153, 1) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (1153, 1) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 1153 × 135
    obs: 'cell_boundaries', 'region'
    uns: 'spatialdata_attrs': AnnData (1153, 135)
with coordinate systems:
▸ 'global', with elements:
        transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes)
sdata = sd.bounding_box_query(
    sdata,
    axes=["x", "y"],
    min_coordinate=[0, 0],
    max_coordinate=[6000, 6000],
    target_coordinate_system="global",
)
sdata = bt.io.prep(sdata)
Done 🍱: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00,  4.66s/it]

Visualize Data

Let’s visualize the data to gain a better intuition for the dataset. Spatial plotting functions in bento-tools automatically render cell and nuclear boundaries by default. We then overlay the transcript density on top with bt.pl.density().

Tip

See the spatial plotting guide for more details.

plt.figure(figsize=(10,10))
bt.pl.density(sdata)
../_images/c6e390c696e18bca8210c22e0f7c61d9d0f9207a9df3973a9100b4bd1f4cc78d.png

Keep genes where at least 10 molecules are detected in at least one cell.

sdata = bt.ut.filter_by_gene(sdata, min_count=10)

Spatial summary statistics

You can get a quick summary of cell and nuclear properties, including area, shape i.e. aspect ratio, and RNA density. This is an easy way to assess data quality and manage outliers e.g. cells with an extreme number of transcripts, very small area (possibly segmentation artifacts), no nuclei, etc.

See also

See the spatial features tutorial for additional metrics.

bt.tl.shape_stats(sdata)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.79it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.24it/s]
bt.pl.shape_stats(sdata)
../_images/0fdefea8094ffdc29ab4a154fd4078670b9f8cb372fb6d1911c6403ccf278d17.png

RNAflux: semantic segmentation of subcellular domains

RNAflux first quantifies subcellular expression gradients at pixel resolution before identifying consistent subcellular domains via unsupervised clustering. The result is a semantic segmentation of the cell, where each pixel is assigned to a subcellular domain. For the purposes of this tutorial, we will lower the resolution res parameter to speed up computation.

Tip

The embedding is calculated at 10% unit resolution for speed. Higher resolution trades off speed for smoother embeddings. Note that computation time scales quadratically \(O(res^{2})\) in relation to resolution \(res\) e.g. 10% is 100x faster than 100% resolution.

Learn more about the algorithm here.

res = 0.1
bt.tl.flux(sdata, method="radius", res=res, recompute=True)
Embedding:   0%|                                                                                                               | 0/3 [00:00<?, ?it/s]
  0%|                                                                                                                        | 0/154 [00:00<?, ?it/s]
  1%|█▍                                                                                                              | 2/154 [00:00<00:07, 19.65it/s]
  3%|███▋                                                                                                            | 5/154 [00:00<00:07, 20.88it/s]
  5%|█████▊                                                                                                          | 8/154 [00:00<00:06, 22.94it/s]
  8%|████████▋                                                                                                      | 12/154 [00:00<00:05, 26.89it/s]
 10%|██████████▊                                                                                                    | 15/154 [00:00<00:05, 26.54it/s]
 12%|████████████▉                                                                                                  | 18/154 [00:00<00:05, 26.32it/s]
 14%|███████████████▏                                                                                               | 21/154 [00:00<00:05, 25.53it/s]
 16%|█████████████████▎                                                                                             | 24/154 [00:00<00:05, 24.23it/s]
 18%|███████████████████▍                                                                                           | 27/154 [00:01<00:04, 25.68it/s]
 19%|█████████████████████▌                                                                                         | 30/154 [00:01<00:04, 25.83it/s]
 21%|███████████████████████▊                                                                                       | 33/154 [00:01<00:04, 25.09it/s]
 24%|██████████████████████████▋                                                                                    | 37/154 [00:01<00:04, 26.20it/s]
 26%|████████████████████████████▊                                                                                  | 40/154 [00:01<00:04, 24.92it/s]
 28%|██████████████████████████████▉                                                                                | 43/154 [00:01<00:04, 24.61it/s]
 30%|█████████████████████████████████▏                                                                             | 46/154 [00:01<00:04, 23.09it/s]
 32%|███████████████████████████████████▎                                                                           | 49/154 [00:01<00:04, 23.62it/s]
 34%|█████████████████████████████████████▍                                                                         | 52/154 [00:02<00:04, 21.80it/s]
 36%|███████████████████████████████████████▋                                                                       | 55/154 [00:02<00:04, 22.15it/s]
 38%|█████████████████████████████████████████▊                                                                     | 58/154 [00:02<00:03, 24.02it/s]
 40%|███████████████████████████████████████████▉                                                                   | 61/154 [00:02<00:04, 21.89it/s]
 42%|██████████████████████████████████████████████▏                                                                | 64/154 [00:02<00:03, 22.58it/s]
 44%|████████████████████████████████████████████████▎                                                              | 67/154 [00:02<00:03, 23.60it/s]
 46%|███████████████████████████████████████████████████▏                                                           | 71/154 [00:02<00:03, 25.49it/s]
 48%|█████████████████████████████████████████████████████▎                                                         | 74/154 [00:03<00:03, 25.43it/s]
 50%|███████████████████████████████████████████████████████▌                                                       | 77/154 [00:03<00:02, 25.77it/s]
 52%|█████████████████████████████████████████████████████████▋                                                     | 80/154 [00:03<00:02, 26.25it/s]
 54%|███████████████████████████████████████████████████████████▊                                                   | 83/154 [00:03<00:02, 26.74it/s]
 56%|█████████████████████████████████████████████████████████████▉                                                 | 86/154 [00:03<00:02, 25.33it/s]
 58%|████████████████████████████████████████████████████████████████▏                                              | 89/154 [00:03<00:02, 24.41it/s]
 60%|██████████████████████████████████████████████████████████████████▎                                            | 92/154 [00:03<00:02, 24.51it/s]
 62%|████████████████████████████████████████████████████████████████████▍                                          | 95/154 [00:03<00:02, 24.39it/s]
 64%|██████████████████████████████████████████████████████████████████████▋                                        | 98/154 [00:03<00:02, 25.23it/s]
 66%|████████████████████████████████████████████████████████████████████████▏                                     | 101/154 [00:04<00:02, 23.56it/s]
 68%|███████████████████████████████████████████████████████████████████████████                                   | 105/154 [00:04<00:01, 24.64it/s]
 71%|█████████████████████████████████████████████████████████████████████████████▊                                | 109/154 [00:04<00:01, 26.25it/s]
 73%|████████████████████████████████████████████████████████████████████████████████                              | 112/154 [00:04<00:01, 23.89it/s]
 75%|██████████████████████████████████████████████████████████████████████████████████▏                           | 115/154 [00:04<00:01, 23.44it/s]
 77%|████████████████████████████████████████████████████████████████████████████████████▎                         | 118/154 [00:04<00:01, 24.42it/s]
 79%|██████████████████████████████████████████████████████████████████████████████████████▍                       | 121/154 [00:04<00:01, 24.06it/s]
 81%|████████████████████████████████████████████████████████████████████████████████████████▌                     | 124/154 [00:05<00:01, 22.92it/s]
 82%|██████████████████████████████████████████████████████████████████████████████████████████▋                   | 127/154 [00:05<00:01, 23.87it/s]
 84%|████████████████████████████████████████████████████████████████████████████████████████████▊                 | 130/154 [00:05<00:00, 24.10it/s]
 86%|███████████████████████████████████████████████████████████████████████████████████████████████               | 133/154 [00:05<00:00, 23.69it/s]
 88%|█████████████████████████████████████████████████████████████████████████████████████████████████▏            | 136/154 [00:05<00:00, 22.90it/s]
 90%|███████████████████████████████████████████████████████████████████████████████████████████████████▎          | 139/154 [00:05<00:00, 23.18it/s]
 92%|█████████████████████████████████████████████████████████████████████████████████████████████████████▍        | 142/154 [00:05<00:00, 24.08it/s]
 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████▌      | 145/154 [00:05<00:00, 25.40it/s]
 97%|██████████████████████████████████████████████████████████████████████████████████████████████████████████▍   | 149/154 [00:06<00:00, 26.24it/s]
 99%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 152/154 [00:06<00:00, 25.17it/s]
Done. 🍱: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:16<00:00,  5.39s/it]

bt.tl.flux() also performs dimensional reduction using PCA and saves the first 10 principal components (PCs). We can visualize each pixel as the first 3 PCs mapped to RGB values (red = PC1, green = PC2, and blue = PC3) and scale alpha by the RNA density.

fig, ax = plt.subplots(figsize=(12,12))
bt.pl.flux(sdata, res=res, ax=ax)
../_images/0ea2878adfcc7f5e1c766eb6538de207f5d4a3c61c225eb6dc0acebc58b540f3.png

Identify subcellular domains

To identify distinct subcellular domains in a data-driven manner, we can cluster pixels by their RNAflux embeddings. The bt.tl.fluxmap() function fits a self-organizing map (SOM) to the reduced PCA space for a range of cluster numbers. We use the elbow method heuristic to recommend the optimal number of clusters. By default, a line plot will be rendered showing the model fit error for each cluster number and draw a vertical dotted line indicating the recommended number.

Tip

Determining the number of clusters is not trivial and can be highly subjective. Occasionally, no number is suggested. You can either try a wider range of cluster numbers or manually pick one. We generally recommend settling on a smaller number of clusters i.e. less than 10 for interpretability.

bt.tl.fluxmap(sdata, res=res, min_count=100)
Optimizing # of clusters:   0%|                                                                                                | 0/4 [00:00<?, ?it/s]
  0%|                                                                                                                          | 0/7 [00:00<?, ?it/s]
 14%|████████████████▎                                                                                                 | 1/7 [00:00<00:00,  8.76it/s]
 29%|████████████████████████████████▌                                                                                 | 2/7 [00:00<00:00,  8.81it/s]
 43%|████████████████████████████████████████████████▊                                                                 | 3/7 [00:00<00:00,  8.70it/s]
 57%|█████████████████████████████████████████████████████████████████▏                                                | 4/7 [00:00<00:00,  8.43it/s]
 71%|█████████████████████████████████████████████████████████████████████████████████▍                                | 5/7 [00:00<00:00,  8.00it/s]
 86%|█████████████████████████████████████████████████████████████████████████████████████████████████▋                | 6/7 [00:00<00:00,  8.04it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00,  8.16it/s]
Vectorizing domains:  50%|██████████████████████████████████████████████▌                                              | 2/4 [00:02<00:02,  1.38s/it]
  0%|                                                                                                                        | 0/154 [00:00<?, ?it/s]
  9%|██████████                                                                                                    | 14/154 [00:00<00:01, 132.46it/s]
 18%|████████████████████                                                                                          | 28/154 [00:00<00:00, 131.37it/s]
 27%|█████████████████████████████▉                                                                                | 42/154 [00:00<00:00, 130.82it/s]
 36%|████████████████████████████████████████                                                                      | 56/154 [00:00<00:00, 130.38it/s]
 45%|██████████████████████████████████████████████████                                                            | 70/154 [00:00<00:00, 127.74it/s]
 54%|███████████████████████████████████████████████████████████▎                                                  | 83/154 [00:00<00:00, 127.66it/s]
 62%|████████████████████████████████████████████████████████████████████▌                                         | 96/154 [00:00<00:00, 127.19it/s]
 71%|█████████████████████████████████████████████████████████████████████████████▏                               | 109/154 [00:00<00:00, 123.67it/s]
 79%|██████████████████████████████████████████████████████████████████████████████████████▎                      | 122/154 [00:00<00:00, 124.15it/s]
 88%|███████████████████████████████████████████████████████████████████████████████████████████████▌             | 135/154 [00:01<00:00, 124.84it/s]
 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████▊    | 148/154 [00:01<00:00, 114.36it/s]
Done: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:16<00:00,  4.04s/it]

Now let’s visualize the predicted subcellular domains. Notice how every cell contains instances of each domain (denoted by the different colors).

fig, ax = plt.subplots(figsize=(12, 12))
bt.pl.fluxmap(sdata, palette=bt.colors.bento6, ax=ax)
../_images/990d5c9ab14d8dcc402566d544cbd762272073dad80b707a7d9af8ffc102750c.png
fluxmap_names = [s for s in sdata.shapes.keys() if s.startswith('fluxmap')]
bt.tl.comp(
    sdata,
    points_key="transcripts",
    shape_names= fluxmap_names,
)
bt.pl.comp(sdata, annotate=5)
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Adjusting text positions...
../_images/f731b90a241138933dd100dcb6b4f0a33b2190e970813e2446274c98dc3808f6.png

Functional enrichment of fluxmaps

We can utilize RNAflux embeddings to compute enrichment scores across the entire area of each cell. Given the appropriate genesets, they can help us identify functionally relevant domains such as organelles and subcellular compartments e.g. the nucleus and cytoplasm. Here we employ published APEX-seq data measuring the relative expression (log2 fold change) of genes in various compartments. We can compare geneset enrichment scores to the fluxmaps.

bt.tl.fe_fazal2019(sdata, batch_size=50000)
105 samples of mat are empty, they will be removed.
Running wsum on mat with 180780 samples and 135 targets for 8 sources.
Infering activities on 4 batches.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:41<00:00, 10.34s/it]

You can visualize functional enrichment scores with bt.pl.fe() as well as specific shapes to overlay. In this case, we showcase the striking correspondence of the “Nucleus” enrichment with one of the fluxmaps.

plt.figure(figsize=(12,12))
bt.pl.fe(
    sdata,
    "flux_Nucleus",
    cmap="Reds",
    res=res,
    vmin=0,
)
../_images/a640de62b18d9c9bcd7434d0c2eb0863b0698361bf245b2cca88cb0cce260ac8.png

Predict RNA Localization Patterns

We will use the RNAforest model, to predict and annotate subcellular localization patterns. A single “sample” refers to the set of points corresponding to a given gene in a single cell. In the case that every cell expresses every gene, the number of samples is at most \(n * m\) for \(n\) cells and \(m\) genes.

RNA Localization Pattern Annotation Workflow

The five subcellular patterns we can predict are:

  1. cell edge: near the cell membrane

  2. cytoplasmic: mostly outside the nucleus in the cytoplasm

  3. nuclear: most in the nucleus

  4. nuclear edge: near the nuclear membrane, either

  5. none: none of the above patterns, more or less randomly distributed

See also

See more details about the spatial statistics used as input features for classification.

bt.tl.lp(sdata)
Crunching shape features...
Crunching point features...
Saving results...
Done.
Crunching shape features...
Crunching point features...
Saving results...
Done.

We can view the observed pattern frequencies to get a rough idea of how transcripts are localizing.

bt.pl.lp_dist(sdata)
../_images/31d09424303f82c7407eb1e33f8b64bbe4f90586aa0b656b3756a5f486be7d44.png

We can also visualize the localization of each gene where the point position denotes the balance between subcellular localization pattern frequencies. The color denotes the gene’s most frequent pattern. Interestingly, we see a wide range of variability in localization. A large number of genes are pulled towards none while nuclear enriched genes show strong bias and a high fraction of cells.

bt.tl.lp_stats(sdata)
bt.pl.lp_genes(sdata, sizes=(10, 85), size_norm=(90, 100))
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/73e79aa4640fb93e72973102de4112362089371697ea208c8a1d420c62d2f287.png

Colocalization analysis

Here we use the Colocation Quotient or CLQ (Leslie & Kronenfeld, 2011) to measure pairwise colocalization between genes. Given two sets of points, A and B, the CLQ is the ratio of observed to expected proprtion of B among A’s neighbors.

At the same time, we quantify colocalization in a compartment-specific manner i.e. transcripts in the nucleus organize differently than they do in the cytoplasm.

colocalization analysis workflow

First lets create shapes for the cytoplasm.

bt.geo.overlay(
    sdata,
    s1="cell_boundaries",
    s2="nucleus_boundaries",
    name="cytoplasm",
    how="difference",
)
Done 🍱: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00,  4.12s/it]

Now we can calculate CLQ values for every gene pair – one for the cytoplasm and once more for the nucleus.

bt.tl.coloc_quotient(sdata, shapes=["cytoplasm", "nucleus_boundaries"])
cytoplasm:  83%|██████████████████████████████████████████████████████████████████████████████████▎                | 128/154 [00:17<00:03,  7.26it/s]

We can represent the data as a 3-dimensional tensor: compartments, cells, and gene pairs and apply tensor decomposition - in this case, non-negative PARAFAC (Shashua & Hazan 2005) - to discover substructure considering cell and compartment-specific patterns.

The data tensor is broken down into \(k\) factors. When added together, the factors reconstruct the original data tensor with some degree of error. By plotting the error for each \(k\) rank decomposition, we can use the elbow method heuristic to recommend the optimal number of factors.

bt.tl.colocation(sdata, ranks=range(1, 6))

Let’s plot the factor loadings for the suggested \(k = 3\). From left to right, the three heatmaps show the loadings of each factor for each dimension – compartments, cells, and gene pairs. We can limit the heatmap to show the top 5 associated gene pairs for each factor.

bt.pl.colocation(sdata, rank=2)