Data Prep Guide

To take advantage of Bento’s analysis capabilities, we need to provide three layers of information acquired from molecule-resolved spatial transcriptomics experiments. At a high level these include:

  1. Molecular coordinates: The x and y coordinates for individual transcripts, as well as gene identifiers.

  2. Cell boundaries: This allows us to assign transcripts to cells. These are usually derived from cell staining i.e. WGA, cell surface markers, polyA stains, nuclear expansion etc.

  3. Nucleus boundaries: Nuclei are essential for providing subcellular context in our spatial analysis. These are usually derived from DAPI staining or other nuclear markers.

This table shows how the layers of data are used across Bento’s set of analysis tools.

Shape Features

Point Features

RNAflux

RNAforest

RNAcoloc

Molecular coordinates

No

Yes

Yes

Yes

Yes

Cell boundaries

Yes

Yes

Yes

Yes

Yes

Nucleus boundaries

Optional

Optional

No

Yes

Optional

Using SpatialData

Under the hood, we use the SpatialData framework to manage SpatialData objects in Python, allowing us to store and manipulate spatial data in a standardized format. Briefly, SpatialData objects are stored on-disk in the Zarr storage format. We aim to be fully compatible with SpatialData, so you can use the same objects in both Bento and SpatialData.

The SpatialData Python object is a flexible container for a variety of elements. For an in-depth look data elements, see this SpatialData tutorial on SpatialElements. Briefly, they are:

  • Images: raw images, segmented images

  • Labels: cell masks, nucleus masks

  • Points: transcript coordinates, cell coordinates, landmarks

  • Shapes: boundaries, circles, polygons

  • Tables: annotations, count matrices

For the purposes of Bento, we will represent data as Points (molecules) and Shapes (cell & nuclear boundaries). Gene expression is automatically calculated from spatially aggregating transcript counts by shapes and saved as a Table.

Platform-specific data

Steps to ingest your data will depend on how it was generated. Data from some platforms should work out of the box while others will require some more custom formatting.

  1. CosMx, Xenium: use the corresponding spatialdata-io reader functions.

    1. This will have all three layers of info (molecules, cell boundaries, nuclear boundaries).

  2. Merscope: use the corresponding spatialdata-io reader functions.

    1. This will only have molecules and cell boundary information.

  3. Other platforms: Molecular Cartography, seqFISH, STARmap, home-built MERFISH, smFISH etc.

    1. We will cover how to do this in the next section. For a comprehensive tutorial, see their detailed guide on constructing your own SpatialData.

Loading custom data

Here we will walk through formatting staining images and transcript coordinates into a SpatialData object.

Note

This guide is a simple demonstration and is not meant to be comprehensive. Steps will vary depending on output format and platform.

import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import spatialdata as sd
from cellpose import models
from shapely.geometry import Polygon
from skimage.io import imread
from skimage.measure import find_contours
from spatialdata.models import PointsModel, ShapesModel

Cell segmentation data

Using a sample image from the Cellpose website, we will demonstrate transforming an image into a set of polygons.

img = imread(
    "https://github.com/ckmah/bento-tools/raw/master/bento/datasets/tutorial/cell_stain.tif"
)
plt.imshow(img, "binary_r")
<matplotlib.image.AxesImage at 0x7f136e3e6050>
../_images/bb7eeccf7e6efa9f169bf5e0567a4809259fa9df80ccbf9ab753514de7d0f48d.png

Load the Cellpose model and segment the image. Each cell is encoded as a separate integer in the mask.

cell_model = models.Cellpose(model_type="cyto3")
cell_mask, _, _, _ = cell_model.eval(img, channels=[[0, 0]])
plt.imshow(cell_mask)
<matplotlib.image.AxesImage at 0x7f136c2091e0>
../_images/2c268ad81e88ae07a937e5f54b0e71af2876a05e54a95fe8c2397eb0734f3588.png

Convert the mask to a set of polygons using skimage.measure.find_contours. Each polygon represents a cell boundary.

Note

As we do below, note that SpatialData requires the polygons to be stored in a GeoPandas GeoDataFrame.

contours = []
for i in range(1, cell_mask.max()):
    contours.append(find_contours(cell_mask == i, 0.5)[0])

cell_polygons = np.array([Polygon(p[:, [1, 0]]) for p in contours])
cell_polygons = gpd.GeoDataFrame(geometry=cell_polygons)
cell_polygons.head()
geometry
0 POLYGON ((64.000 14.500, 63.000 14.500, 62.000...
1 POLYGON ((86.000 7.500, 85.000 7.500, 84.500 7...
2 POLYGON ((137.000 41.500, 136.000 41.500, 135....
3 POLYGON ((244.000 58.500, 243.000 58.500, 242....
4 POLYGON ((142.000 39.500, 141.000 39.500, 140....
cell_polygons.plot(edgecolor="black")
plt.gca().invert_yaxis()
../_images/6a82efca1329c493e15e68f698eaee8b829cd23521499557bfadc4a3750ceb47.png

Nuclear segmentation data

Nuclear segmentation data can be obtained in a similar way to cell segmentation data. We repeat the same steps as above to obtain a set of polygons representing nuclear boundaries.

img2 = imread("https://github.com/ckmah/bento-tools/raw/master/bento/datasets/tutorial/nuclear_stain.tif")
plt.imshow(img2, "binary_r")
<matplotlib.image.AxesImage at 0x7fb454a28d30>
../_images/6f1b1c517755e26c944b0d040090163ba9306bb1818935495195397c17e5f7a0.png
nuclei_model = models.Cellpose(model_type="nuclei")
nuclei_mask, _, _, _ = nuclei_model.eval(img2, channels=[[0, 0]])
plt.imshow(nuclei_mask)
<matplotlib.image.AxesImage at 0x7fb52c2b7b50>
../_images/7bd75c810b6f48d9f38af6438ac83f8c020bc770c4b41adab9970ed5ae94d1ca.png
n_contours = []
for i in range(1, nuclei_mask.max()):
    n_contours.append(find_contours(nuclei_mask == i, 0.5)[0])

nuclei_polygons = np.array([Polygon(p[:, [1, 0]]) for p in n_contours])
nuclei_polygons = gpd.GeoDataFrame(geometry=nuclei_polygons)
nuclei_polygons.head()
geometry
0 POLYGON ((86.500 0.000, 87.000 0.500, 87.500 1...
1 POLYGON ((210.000 12.500, 209.000 12.500, 208....
2 POLYGON ((58.000 9.500, 57.000 9.500, 56.000 9...
3 POLYGON ((249.000 21.500, 248.000 21.500, 247....
4 POLYGON ((268.000 21.500, 267.000 21.500, 266....
nuclei_polygons.plot(facecolor="grey", edgecolor="black")
plt.gca().invert_yaxis()
../_images/6915e114032e0093d18e3bb8157133ab571bc6f43a75b69523a552d2c2821ae1.png

Molecule data

For now, we will generate random transcript coordinates to demonstrate how to format them into a SpatialData object.

Note

In practice, your transcript coordinates need columns for x, y, and gene. We will name them x, y, and feature_name respectively.

p_x = np.random.randint(0, 300, size=10000)
p_y = np.random.randint(0, 200, size=10000)
molecules = pd.DataFrame(dict(x=p_x, y=p_y))
molecules["feature_name"] = np.random.choice(["MALAT1", "ACTB", "BRCA1"], size=10000)
molecules.head()
x y feature_name
0 209 154 MALAT1
1 33 178 MALAT1
2 264 11 BRCA1
3 299 31 ACTB
4 101 82 ACTB

Plot the cell and nuclear boundaries, and the transcript coordinates to ensure they are correctly formatted.

fig, ax = plt.subplots()
sns.scatterplot(
    data=molecules, x="x", y="y", hue="feature_name", ax=ax, s=2, legend=False
)
cell_polygons.plot(ax=ax, facecolor="none", edgecolor="black", lw=0.5)
nuclei_polygons.reset_index().plot(
    ax=ax, column="index", edgecolor="black", cmap="Reds", lw=0.5
)
plt.gca().invert_yaxis()
../_images/f3346dc99c00ad9a522bca6e3cff1c0dbf9b47e163a21301b62f0691a427b73f.png

Save SpatialData object

Finally, we will save the data as a SpatialData object. We use each element type’s parse() method to guarantee that we have formatted the data correctly.

Note

The SpatialData object is saved in the Zarr format. This is a compressed, high-performance storage format that is compatible with SpatialData and Bento. When using Bento, you must run bt.io.prep() after loading the SpatialData object to ensure compatibility.

shapes = dict(
    cell_boundaries=ShapesModel.parse(cell_polygons),
    nucleus_boundaries=ShapesModel.parse(nuclei_polygons),
)
points = dict(
    transcripts=PointsModel.parse(
        molecules, coordinates={"x": "x", "y": "y"}, feature_key="feature_name"
    )
)
sdata = sd.SpatialData(
    shapes=shapes,
    points=points,
)

sdata
SpatialData object with:
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 3) (2D points)
└── Shapes
      ├── 'cell_boundaries': GeoDataFrame shape: (54, 1) (2D shapes)
      └── 'nucleus_boundaries': GeoDataFrame shape: (47, 1) (2D shapes)
with coordinate systems:
▸ 'global', with elements:
        transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes)

Save to disk.

sdata.write("tutorial.zarr")

prep() for Bento compatibility

Always run bt.io.prep() on SpatialData objects for compatibility with Bento.

%load_ext autoreload
%autoreload 2
import bento as bt

bt.io.prep(sdata)
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
SpatialData object with:
├── Points
│     └── 'transcripts': DataFrame with shape: (<Delayed>, 5) (2D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (54, 3) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (47, 2) (2D shapes)
└── Table
      └── AnnData object with n_obs × n_vars = 54 × 3
    obs: 'cell_boundaries', 'region'
    uns: 'spatialdata_attrs': AnnData (54, 3)
with coordinate systems:
▸ 'global', with elements:
        transcripts (Points), cell_boundaries (Shapes), nucleus_boundaries (Shapes)