Back

Aug 23, 2023

TileDB 101: Single Cell

Genomics
Single Cell
21 min read
Aaron Wolen

Aaron Wolen

Principal Software Engineer

In this article I will cover the basics of TileDB’s support for single-cell data. At its core, TileDB is a database system architected around multi-dimensional arrays. This powerful and flexible data structure allows us to easily build any special-purpose database system to capture demanding use cases with complex data. Single-cell sequencing is one of these use cases.

Our single-cell solution is the product of an amazing collaboration with the Chan Zuckerberg Initiative (CZI). Together the TileDB and CZI teams have jointly developed:

  • SOMA (Stack Of Matrices, Annotated), an open, language-agnostic data model for representing stored collections of annotated matrices and derived results, coupled with a language-agnostic API specification for interacting with the data model.
  • TileDB-SOMA, an open-source (MIT License) implementation of the SOMA API based on the TileDB multi-dimensional array storage engine. TileDB-SOMA includes both Python and R packages that provide interoperability with popular single-cell toolkits like ScanPy, Seurat, and Bioconductor.

From a high level TileDB-SOMA is:

  • cloud native and highly optimized for cost-effective data storage and retrieval
  • highly scalable, proven to handle tens of millions of cells
  • language-agnostic, eliminating framework-specific silos and enabling seamless interoperability and sharing
  • multi-modal, with out-of-the-box support for storing multiple measurements per experiment

This article covers the basic Python and R functionality of TileDB-SOMA. You can run all examples locally on your machine, or launch the following notebooks directly on TileDB Cloud for convenience (sign up and we’ll give you free credits to get started!):

At the end of the article, I briefly cover how you can scale your analysis and securely share your single-cell datasets and work on TileDB Cloud. Enjoy!

Dataset

The dataset we’ll be using contains RNA data for 2,700 peripheral blood mononuclear cells (PBMC) from a healthy donor. The raw data was generated by 10X Genomics and is available here. The version of the dataset we’ll be using was processed with this scanpy notebook.

Note

We’re using a small and familiar dataset for the purpose of this tutorial. However, TileDB-SOMA has proven performant with massive datasets like the CZ CELLxGENE Discover Census, which comprises over 33 million cells and 60 thousand genes as of this writing.

TileDB-SOMA in Python

Setup

The tiledbsoma Python package is available from both PyPI and Conda, and can be installed via pip or conda/mamba. Here, we’ll create a new Conda environment and install packages we’ll need using mamba:

mamba create -n tiledbsoma-101 -c tiledb -c conda-forge tiledbsoma-py
mamba activate tiledbsoma-101

Let’s import the tiledbsoma Python package as well as a few other packages we’ll use in this tutorial:

import tiledbsoma
import tiledbsoma.io
import tiledb
import anndata as ad
import scanpy as sc

tiledbsoma.show_package_versions()
tiledbsoma.__version__        1.4.0
TileDB-Py tiledb.version()    (0, 22, 2)
TileDB core version           2.16.2
libtiledbsoma version()       libtiledb=2.16.2
python version                3.10.12.final.0
OS version                    Darwin 22.5.0

The standard tiledb Python package provides access to TileDB’s virtual filesystem (VFS), which allows you to interact with data on local disk, S3, and TileDB Cloud using the same API. We’ll use it below to read data from an H5AD file stored in S3.

cfg = tiledb.Config({"vfs.s3.no_sign_request": True})
vfs = tiledb.VFS(config=cfg)

Ingest

Using TileDB’s VFS we can read the H5AD directly from S3 and load it into memory using the AnnData package:

H5AD_URI = "s3://tiledb-inc-demo-data/singlecell/h5ad/pbmc3k_processed.h5ad"

with vfs.open(H5AD_URI) as h5ad:
    adata = ad.read_h5ad(h5ad)

adata
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_draw_graph_fr', 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Inspecting the adata object, we can see that in addition to the expression data, cell-level annotations in obs, and feature-level annotations in var, it also contains analysis results in obsm, varm, obsp, and uns. All of these components can be ingested into a SOMA experiment by passing the anndata object to the tiledbsoma.io.from_anndata.

The experiment_uri argument is a URI that points to the location where the SOMA experiment will be created. In this case, we’re using a local directory, but it could just as easily be an S3 bucket or a TileDB Cloud URI.

tiledbsoma.io.from_anndata(
    experiment_uri="soma-exp-pbmc3k",
    measurement_name="RNA",
    anndata=adata
)
'soma-exp-pbmc3k'

The file path "soma-exp-pbmc3k" now points to a local directory containing the SOMA experiment, which itself is a collection of TileDB groups and arrays, each of which contains a different component of the original dataset, organized according to the SOMA data model.

Accessing SOMA components

We can open the new SOMA experiment in read mode to view its structure:

experiment = tiledbsoma.Experiment.open("soma-exp-pbmc3k")
experiment
<Experiment 'soma-exp-pbmc3k' (open for 'r') (2 items)
    'ms': 'soma-exp-pbmc3k/ms' (unopened)
    'obs': 'soma-exp-pbmc3k/obs' (unopened)>

Note that opening a SOMAExperiment (or any SOMA object) only returns a pointer to the object on disk. No data is actually loaded into memory until it’s requested.

The top-level of the experiment contains 2 elements: obs, a SOMA DataFrame containing the cell annotations, and ms, a SOMA Collection of the measurements (e.g., RNA) in the experiment.

We can access the obs array directly with:

experiment.obs
<DataFrame 'soma-exp-pbmc3k/obs' (open for 'r')>

Other elements are nested within the experiment according to the SOMA data model but can be accessed in a similar way. For example, feature-level annotations are stored in the var array, which is located at the top-level of each SOMA Measurement. For this dataset, we have a single measurement, RNA, whose var we can access with:

experiment.ms["RNA"].var
<DataFrame 'soma-exp-pbmc3k/ms/RNA/var' (open for 'r')>

Reading and filtering data

All SOMA objects provide a read method for loading the data into memory. Designed with large datasets in mind, these methods always return an iterator, allowing data to be loaded in chunks intelligently sized by TileDB to accommodate the available memory, and efficiently materialize the results as Arrow Tables, leveraging zero-copy memory sharing where possible.

In this case, since we’re looking at a smaller dataset, we’re using the concat method to automatically concatenate the results into a single Arrow Table before converting to a Pandas DataFrame:

experiment.obs.read().concat().to_pandas()
      soma_joinid            obs_id  ...  n_counts          louvain
0               0  AAACATACAACCAC-1  ...    2419.0      CD4 T cells
1               1  AAACATTGAGCTAC-1  ...    4903.0          B cells
2               2  AAACATTGATCAGC-1  ...    3147.0      CD4 T cells
3               3  AAACCGTGCTTCCG-1  ...    2639.0  CD14+ Monocytes
4               4  AAACCGTGTATGCG-1  ...     980.0         NK cells
...           ...               ...  ...       ...              ...
2633         2633  TTTCGAACTCTCAT-1  ...    3459.0  CD14+ Monocytes
2634         2634  TTTCTACTGAGGCA-1  ...    3443.0          B cells
2635         2635  TTTCTACTTCCTCG-1  ...    1684.0          B cells
2636         2636  TTTGCATGAGAGGC-1  ...    1022.0          B cells
2637         2637  TTTGCATGCCTCAC-1  ...    1984.0      CD4 T cells

[2638 rows x 6 columns]

One of the most useful features of SOMA is the ability to efficiently filter and select only the data necessary for your analysis without loading the entire dataset into memory first. The read methods offer several arguments to access a specific subset of data:

  • coords to slice by coordinates from the underlying TileDB array’s indexed dimension(s)
  • value_filter to filter by values using a query string
  • column_names to select the specific columns that should be returned

In this example we’re loading in the first 100 records from obs with at least 2,000 detected reads, and only accessing two columns of interest.

experiment.obs.read(
    coords=[slice(0, 99)],
    value_filter="n_counts > 2000",
    column_names=["obs_id", "n_counts"],
).concat().to_pandas()
              obs_id  n_counts
0   AAACATACAACCAC-1    2419.0
1   AAACATTGAGCTAC-1    4903.0
2   AAACATTGATCAGC-1    3147.0
3   AAACCGTGCTTCCG-1    2639.0
4   AAACGCACTGGTAC-1    2163.0
..               ...       ...
68  AAGATGGAGATAAG-1    4019.0
69  AAGATTACAACCTG-1    2117.0
70  AAGATTACAGATCC-1    4058.0
71  AAGATTACCCGTTC-1    2762.0
72  AAGATTACCGCCTT-1    3062.0

[73 rows x 2 columns]

Experiment-level queries

The real power of the SOMA API comes from the ability to slice and filter measurement data based on the cell- and feature-level annotations stored in the experiment. For datasets containing millions of cells, this means you can easily access expression values for cells within a specific cluster, or that meet a certain quality threshold, etc.

In this example we’re interested in cells assigned to a particular Louvain cluster where mitochondrial DNA makes up less than 5% of the total reads. We’re also only interested in genes detected in at least 100 cells.

query = experiment.axis_query(
    measurement_name="RNA",
    obs_query=tiledbsoma.AxisQuery(
        value_filter="louvain == 'CD4 T cells' and percent_mito < 0.02",
    ),
    var_query=tiledbsoma.AxisQuery(
        value_filter="n_cells > 100",
    ),
)

The returned query object allows us to inspect the query results and selectively access the data we’re interested in. For example, we can see how many cells and genes were returned by the query:

{"cells": query.n_obs, "genes": query.n_vars}
{'cells': 695, 'genes': 1035}

We can also load the expression data into memory for the selected cells and genes as an Arrow Sparse Tensor.

query.X(layer_name="data").coos().concat()
<pyarrow.SparseCOOTensor>
type: float
shape: (2147483646, 2147483646)

Note

Note the shape of the returned tensor corresponds to the capacity of the underlying TileDB array. By default, SOMA creates arrays with plenty of room for adding new data down the road.

Importantly, we can also materialize all of the relevant query results as a new AnnData object:

pbmc3k = query.to_anndata(X_name="data")
pbmc3k
AnnData object with n_obs × n_vars = 695 × 1035
    obs: 'soma_joinid', 'obs_id', 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'soma_joinid', 'var_id', 'n_cells'

And with this, we can now leverage the full suite of analysis and visualization methods provided by Scanpy.

sc.pl.violin(pbmc3k, ['n_counts', 'n_genes', 'percent_mito'], jitter=0.4, multi_panel=True)

single-cell-101_fig001-t.png

As mentioned above, the SOMA API is designed to be language-agnostic, which means that the same SOMA objects can be accessed using the Python or R tiledbsoma packages. Now we’ll demonstrate how to access the same SOMA experiment we created above using the R package.

TileDB-SOMA in R

The tiledbsoma R package will be available on CRAN shortly. In the meantime, it’s currently available from both R-universe and Conda.

For simplicity’s sake we’ll continue using Conda for this tutorial. With the tiledbsoma-101 environment activated, we can install the required packages with the following command:

mamba install \
  -c tiledb -c bioconda -c conda-forge \
  r-tiledbsoma r-seurat bioconductor-singlecellexperiment

Let’s load the R packages we’ll be using.

library(tiledb)
library(tiledbsoma)
suppressPackageStartupMessages(library(Seurat))

show_package_versions()
tiledbsoma:    1.4.0
tiledb-r:      0.20.3
tiledb core:   2.16.2
libtiledbsoma: libtiledb=2.16.2
R:             R version 4.2.3 (2023-03-15)
OS:            macOS Big Sur ... 10.16

Accessing SOMA components

We can open the SOMA experiment in read mode to view its structure:

experiment <- SOMAExperimentOpen("soma-exp-pbmc3k")
experiment
<SOMAExperiment>
  uri: soma-exp-pbmc3k 
  arrays: obs 
  groups: ms 

We can access the obs array directly with:

experiment$obs
<SOMADataFrame>
  uri: soma-exp-pbmc3k/obs 
  dimensions: soma_joinid 
  attributes: obs_id, n_genes, percent_mito, n_counts, louvain 

Let’s access the var array within the experiment’s single RNA measurement:

experiment$ms$get("RNA")$var
<SOMADataFrame>
  uri: soma-exp-pbmc3k/ms/RNA/var 
  dimensions: soma_joinid 
  attributes: var_id, n_cells 

Reading and filtering data

Just like in Python, all R SOMA objects provide a read method that returns an iterator for loading the data into memory in chunks. Here too we can call the concat() method to automatically concatenate the results into a single Arrow Table before converting to a data frame.

experiment$obs$read()$concat()$to_data_frame()
# A tibble: 2,638 × 6
  soma_joinid obs_id           n_genes percent_mito n_counts louvain          
                                                 
 1           0 AAACATACAACCAC-1     781      0.0302      2419 CD4 T cells      
 2           1 AAACATTGAGCTAC-1    1352      0.0379      4903 B cells          
 3           2 AAACATTGATCAGC-1    1131      0.00890     3147 CD4 T cells      
 4           3 AAACCGTGCTTCCG-1     960      0.0174      2639 CD14+ Monocytes  
 5           4 AAACCGTGTATGCG-1     522      0.0122       980 NK cells         
 6           5 AAACGCACTGGTAC-1     782      0.0166      2163 CD8 T cells      
 7           6 AAACGCTGACCAGT-1     783      0.0382      2175 CD8 T cells      
 8           7 AAACGCTGGTTCTT-1     790      0.0310      2260 CD8 T cells      
 9           8 AAACGCTGTAGCCA-1     533      0.0118      1275 CD4 T cells      
10           9 AAACGCTGTTTCTG-1     550      0.0290      1103 FCGR3A+ Monocytes
# ℹ 2,628 more rows

We also have access to the same filtering capabilities in R:

experiment$obs$read(
  coords = 0:99L,
  value_filter = "n_counts > 2000",
  column_names = c("obs_id", "n_counts")
)$concat()$to_data_frame()
# A tibble: 73 × 2
   obs_id           n_counts
                  
 1 AAACATACAACCAC-1     2419
 2 AAACATTGAGCTAC-1     4903
 3 AAACATTGATCAGC-1     3147
 4 AAACCGTGCTTCCG-1     2639
 5 AAACGCACTGGTAC-1     2163
 6 AAACGCTGACCAGT-1     2175
 7 AAACGCTGGTTCTT-1     2260
 8 AAACTTGAAAAACG-1     3914
 9 AAACTTGATCCAGA-1     2388
10 AAAGAGACGAGATA-1     2410
# ℹ 63 more rows

Now we’ll construct our experiment-level query to filter the cells and genes according to the criteria detailed above:

query <- experiment$axis_query(
    measurement_name = "RNA",
    obs_query = SOMAAxisQuery$new(
        value_filter = "louvain == 'CD4 T cells' & percent_mito < 0.02"
    ),
    var_query = SOMAAxisQuery$new(
        value_filter = "n_cells > 100"
    )
)

And the returned R6 query object provides access to the same suite of methods for accessing data or inspecting the results:

c("cells" = query$n_obs, "genes" = query$n_vars)
cells genes 
  695  1035 

From here you could use the to_sparse_matrix() method to easily load query results for any matrix-like data as a sparse dgTMatrix (from the Matrix package). At a minimum, you need to pass a collection (e.g., X, or obsm) and layer (e.g., data). You can also populate the matrix dimension names by specifying which obs column contains the values to use for row names and which var column contains the values to use for column names.

mat <- query$to_sparse_matrix(
  collection = "X",
  layer_name = "data",
  obs_index = "obs_id",
  var_index = "var_id"
)
mat[1:10, 1:5]
10 x 5 sparse Matrix of class "dgTMatrix"
                    TNFRSF4     CPSF3L    C1orf86       RER1   TNFRSF25
AAACATTGATCAGC-1 -0.3768875 -0.2950843 -0.5209721  1.3326473 -0.3093624
AAACGCTGTAGCCA-1  4.8617630 -0.2305487  2.8116059 -0.3710031 -0.2219476
AAACTTGATCCAGA-1 -0.3280211 -0.2693628 -0.4601668  1.7225345 -0.2724454
AAAGAGACGAGATA-1 -0.3297788 -0.2700860 -0.4619023 -0.5168150 -0.2735897
AAAGCCTGTATGCG-1  8.5038023 -0.2917450 -0.5083421 -0.5924059 -0.2851624
AAAGTTTGTAGAGA-1 -0.2356792 -0.2264081 -0.3578959 -0.3534963 -0.2078178
AAATCAACACCAGT-1 -0.2129427 -0.2397138 -0.3861018 -0.3995942 -0.2136028
AAATCAACGGAAGC-1 -0.3199450 -0.3112433 -0.5532420  1.1781110 -0.3082560
AAATCAACTCGCAA-1 -0.2838825 -0.2883788 -0.4996897 -0.5791501  5.4258475
AAATGTTGCCACAA-1  4.1355896 -0.2447071 -0.4057947 -0.4264639 -0.2530961

Using tiledbsoma’s Seurat support, we can materialize the query results as a Seurat object:

pbmc3k <- query$to_seurat(
  X_layers = c(data = "data"),
  obs_index = "obs_id",
  var_index = "var_id"
)

pbmc3k
An object of class Seurat 
1035 features across 695 samples within 1 assay 
Active assay: RNA (1035 features, 0 variable features)
 4 dimensional reductions calculated: umap, tsne, pca, draw_graph_fr

And with this, we can now leverage the full suite of analysis and visualization methods provided by Seurat.

Seurat::VlnPlot(pbmc3k, features = c("n_counts", "n_genes", "percent_mito"))

single-cell-101_fig002_t.png

You can even create a Bioconductor SingleCellExperiment object from the same query:

query$to_single_cell_experiment(
  X_layers = c(logcounts = "data"),
  obs_index = "obs_id",
  var_index = "var_id"
)
class: SingleCellExperiment 
dim: 1035 695 
metadata(0):
assays(1): logcounts
rownames(1035): TNFRSF4 CPSF3L ... SUMO3 PRMT2
rowData names(1): n_cells
colnames(695): AAACATTGATCAGC-1 AAACGCTGTAGCCA-1 ... TTTCCAGAGGTGAG-1
  TTTGCATGCCTCAC-1
colData names(4): n_genes percent_mito n_counts louvain
reducedDimNames(4): UMAP TSNE PCA DRAW_GRAPH_FR
mainExpName: RNA
altExpNames(0):

Cross-language interoperability

Everything we’ve shown in this tutorial has used the SOMA APIs. However, TileDB-SOMA’s interoperability goes beyond just R and Python because, as I mentioned earlier, a SOMA experiment is just a collection of TileDB groups and arrays. This means we can access the underlying data using any of TileDB’s APIs (albeit without the convenience methods provided by SOMA). The list of supported APIs is growing, but currently includes: C++, C, Python, R, Java, Go, C#, Javascript, and even SQL. This is a real boon for developers who want to build applications that leverage SOMA data.

Scaling up with TileDB Cloud

By virtue of building on top of TileDB embedded, TileDB-SOMA automatically inherits the ability to work transparently with cloud object stores. What that means for you is SOMA experiments can be stored in S3 buckets and accessed in the same way as local files, without changing anything in your code other than the URI. Taken together, this provides a clear path forward for simpler data management: centralize your datasets in the cloud and access them from any client, allowing R and Python researchers to collaborate on the same shared datasets.

Of course, properly managing cloud infrastructure brings its own set of challenges, especially when it comes to data governance, security, and compute. To address this, we built TileDB Cloud, a SaaS product that offers critical data management and scalable compute features.

With TileDB Cloud, single-cell datasets can be ingested directly to an S3 bucket you own and automatically registered in TileDB Cloud’s data catalog, which provides powerful search and discovery capabilities. You can easily find your SOMA experiments by visiting the Life Sciences tab and then opening the SOMA view:

SOMA view within TileDB Cloud’s data catalog
SOMA view within TileDB Cloud’s data catalog

From there, you can narrow your search by filtering on the dataset name, description, metadata, or tags. But TileDB Cloud’s capabilities go far beyond data cataloging. It also provides:

  • hosted Jupyter notebooks that can be launched on various instance types with a single click
  • comprehensive data governance and security features, including fine-grained access controls and full audit logging
  • simple (point-and-click) and secure sharing of any asset (data, code, notebooks, ML models, etc) with specific individuals or organizations—or simply make them public
  • a serverless compute platform that can be used to build scalable data pipelines and run distributed analyses

Let’s take a look at some of these features in action. One of the single-cell datasets available in TileDB Cloud is the Tabula Sapiens atlas, which contains nearly half a million cells from 24 different tissues. SOMA experiments for each of the tissue-specific datasets as well as the four functional compartments are publicly available on TileDB Cloud. Below we’re showing the TileDB Cloud overview page for a SOMA experiment that contains the Tabula Sapiens immune compartment:

TileDB Cloud overview page for a SOMA experiment containing the Tabula Sapiens immune component.
TileDB Cloud overview page for a SOMA experiment containing the Tabula Sapiens immune component.

Every asset in TileDB Cloud has a similar overview page, which provides a host of useful information about the asset. In this case we can see the UUID-based tiledb:// URI that can be used to securely access the dataset, any tags that have been added for organization purposes, as well as a README-like description field that provides full Markdown support for documentation. From here we can also explore the contents of this dataset by clicking on the Contents button, which reveals that the SOMA experiment contains two root-level assets: a group called ms (the collection of SOMA measurements) and an array called obs (the data frame containing cell-level annotations). Clicking on obs brings us to the overview page for this TileDB array. Here I’ve clicked on the Schema button to inspect the schema used to store the data:

Inspecting the schema for the obs array within a SOMA experiment on TileDB Cloud.
Inspecting the schema for the obs array within a SOMA experiment on TileDB Cloud.

You can also click Activity to see a detailed history of who accessed this array and when, which is useful for auditing purposes.

If you’d like to learn more about TileDB Cloud please see our previous post: Getting Started with TileDB Cloud in \< 4 Minutes.

Conclusion

The incredible progress of single-cell sequencing technologies has led to an explosion of data. This has created a growing need for scalable, interoperable, and accessible data management solutions. TileDB-SOMA is our proposed solution to these challenges, and we’re excited to share it with the community.

I hope you enjoyed this article covering the basics of the open-source Python and R TileDB-SOMA packages, as well as the benefits of upgrading your work with TileDB Cloud, including secure governance and sharing, easy discoverability, and a powerful serverless compute platform.

We’d love to hear what you think! Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.

Want to see TileDB Cloud in action?
Aaron Wolen

Aaron Wolen

Principal Software Engineer