Tech Talk

TileDB Carrara x Snowflake: From fragmented to unified data | December 11th Learn more

News

TileDB x Databricks Partner to Power Multimodal Data for Agentic AI in Healthcare + Life Sciences. Read the news

2 min read

Announcement
Data Management
Life Sciences

TileDB Announces Integration with Snowflake to Launch TileDB Carrara to Bridge Scientific and Multimodal Data in AI Data Cloud

Originally published: Dec 17, 2025

Table Of Contents:

The Multimodal Data Challenge in Precision Medicine

Why TileDB and Databricks Together

Use Case 1: Distributed Single-Cell Analysis with NVIDIA RAPIDS

Results and Visualization

Broader Implications

Use Case 2: AI-Powered Tertiary Analysis of Cancer Genomics Data

Looking Forward

The Multimodal Data Challenge in Precision Medicine

Precision medicine represents a paradigm shift in healthcare, moving away from one-size-fits-all treatments toward personalized care tailored to individual patients. At the heart of this transformation lies a fundamental challenge: the need to integrate and analyze diverse types of biomedical data at unprecedented scale.

However, managing multimodal data presents significant technical obstacles. Traditional database systems struggle with the complexity and scale of genomic arrays, multi-gigabyte medical images, and high-dimensional single-cell datasets. Data often remains siloed across different platforms, making it difficult for researchers and clinicians to draw connections across modalities. The infrastructure required to store, query, and analyze these diverse datasets typically becomes prohibitively expensive and slow, limiting the pace of discovery and clinical implementation.

Why TileDB and Databricks Together

The partnership between TileDB and Databricks addresses these challenges by combining complementary strengths into a unified platform for multimodal precision medicine.

TileDB brings specialized capabilities for managing complex, high-dimensional scientific data. Its omnimodal intelligence platform excels at structuring and governing diverse data types—from genomic variants and single-cell RNA sequencing to medical imaging and clinical records. TileDB's native format enables high-performance querying of multi-dimensional arrays, allowing researchers to slice and dice massive datasets efficiently. The platform's integrated catalog provides unified governance across all modalities, with comprehensive access controls and audit logging.

Databricks contributes enterprise-grade data processing power and AI capabilities. As the creator of Apache Spark, Delta Lake, and MLflow, Databricks offers the industry's most performant compute engine for large-scale data analytics. The Databricks Lakehouse architecture unifies data warehousing and data lake capabilities, while Unity Catalog provides enterprise governance. Perhaps most importantly, Databricks' AI capabilities—including support for machine learning workflows, model serving, and agentic systems like Genie—enable advanced analytics on top of the integrated data.

The integration between these platforms creates a powerful synergy. TileDB and Databricks share catalog metadata through Unity Catalog and Delta Share, providing a unified view of all datasets regardless of where they physically reside. Compute workloads can be distributed across both platforms, with TileDB task graphs orchestrating workflows that leverage Databricks Photon clusters for intensive processing. Researchers can access genomic data stored in TileDB format directly from Databricks notebooks, or query Databricks tables from within TileDB's interface. This flexibility allows teams to use the right tool for each job while maintaining consistent governance and reproducibility.

Use Case 1: Distributed Single-Cell Analysis with NVIDIA RAPIDS

To demonstrate the power of this integration, we'll walk through a real-world example: performing submanifold analysis on single-cell RNA sequencing data using NVIDIA Rapids acceleration on Databricks Photon clusters, all orchestrated through TileDB.

The Scientific Context

Single-cell RNA sequencing generates massive datasets, with modern experiments often profiling millions of individual cells. Analyzing these datasets typically involves dimensionality reduction techniques like PCA and UMAP to identify cell populations and states. These computations are resource-intensive, and researchers often want to process multiple tissue types or conditions in parallel. NVIDIA RAPIDS provides GPU-accelerated versions of common single-cell analysis tools, dramatically speeding up these workflows.

The Technical Implementation

The demo leveraged the Chan Zuckerberg Initiative's Cell Census, a public dataset containing over 100 million cells. This dataset was registered in TileDB Carrera's catalog, providing governed access with full audit logging.

Data Access and Preparation

The workflow began by connecting to both TileDB and a Databricks cluster from a Jupyter notebook. Using TileDB's SOMA (Stack of Matrices, Annotated) API, the analysis queried the Cell Census for a specific subset: human macrophages across six tissue types—blood, bone marrow, liver, lung, small intestine, and spleen.

One of TileDB's key advantages became apparent immediately. Rather than downloading the entire massive dataset, the query retrieved only the relevant cells and genes needed for the analysis. TileDB's array format enables selective reads at a granular level, dramatically reducing data transfer and improving performance.

Flexible Compute Execution

The analysis function was designed to work in multiple execution contexts. When run locally in the notebook without GPU access, it automatically used standard CPU-based tools like Scanpy. However, when dispatched to the Databricks Photon cluster with NVIDIA Rapids support, the same code automatically leveraged GPU acceleration—a testament to the flexibility of the integration.

The core analysis performed standard single-cell preprocessing: quality control metric calculation, filtering, normalization, scaling, principal component analysis, nearest neighbor computation, UMAP projection, and clustering. With Rapids, these operations ran significantly faster than CPU-only implementations.

Distributed Processing

The real power emerged when scaling the analysis across all six tissue types simultaneously. Using Spark's mapInPandas function, the workflow distributed tissue-specific analyses across the Databricks cluster. Each worker node:

  1. 1

    Retrieved the relevant subset of cells from TileDB for its assigned tissue

  2. 2

    Converted the data to AnnData format (the standard for single-cell analysis)

  3. 3

    Executed the full analysis pipeline using NVIDIA Rapids

  4. 4

    Returned results for aggregation and visualization

This parallelization was seamless—the same analysis code ran across all tissues without modification. The Databricks environment handled job scheduling, resource allocation, and fault tolerance automatically.

Alternative Orchestration with Task Graphs

The demo also showed an alternative approach using TileDB's native task graph system for orchestration. From within TileDB, the workflow submitted jobs directly to the Databricks cluster for each tissue type. This demonstrated the bidirectional nature of the integration—compute can be initiated from either platform and leverage resources from both.

Unified Governance and Auditing

Throughout the workflow, both platforms maintained comprehensive audit logs. Every data access, query execution, and compute job was tracked with user attribution and timestamps. This audit trail is essential for regulated environments like clinical research and drug development, where reproducibility and compliance requirements are stringent.

Results and Visualization

The distributed analysis completed efficiently, returning UMAP embeddings and cluster assignments for macrophages from all six tissues. These results could be visualized directly in the notebook, revealing tissue-specific macrophage populations and states. The workflow demonstrated how complex, computationally intensive analyses can be made accessible to researchers without requiring deep expertise in distributed computing infrastructure.

Broader Implications

This single-cell analysis example illustrates principles that apply across precision medicine use cases:

Scalability: The architecture handles datasets from single samples to population-scale studies with millions of participants

Flexibility: Researchers can use familiar tools and APIs while transparently leveraging powerful distributed compute resources

Integration: Multimodal data from different sources and formats can be brought together for joint analysis

Governance: Consistent access controls, audit logging, and data lineage tracking across all data types and compute environments

Performance: Specialized optimizations for scientific data formats combined with enterprise-grade compute infrastructure deliver the speed needed for interactive analysis and large-scale batch processing

Use Case 2: AI-Powered Tertiary Analysis of Cancer Genomics Data

Mark's demonstration showcased another powerful application of the TileDB-Databricks integration: using AI agents to perform tertiary analysis on cancer genomics data from The Cancer Genome Atlas (TCGA).

Seamless Data Integration via Delta Share

The workflow began with connecting TileDB-managed genomic data to Databricks through Delta Share, an open protocol for secure data sharing. From the Databricks Unity Catalog interface, Mark imported a share file generated by TileDB containing oncology datasets, including SOMA analysis of genomic variants.

This integration required no data copying or movement—Databricks accessed the TileDB data through a zero-copy mechanism. The datasets appeared directly in the Databricks catalog alongside native Delta tables, with full metadata and schema information visible. This unified view meant analysts could work with genomic data from TileDB and clinical data in Databricks tables without switching contexts or tools.

Natural Language Queries with Databricks Genie

The most striking aspect of the demo was the use of Databricks Genie, an AI-powered business intelligence tool that enables natural language querying of complex datasets. Mark configured a Genie Room—a curated data environment—that included case diagnoses, patient demographics, exposure data from TCGA, and the TileDB genomic datasets.

With this setup, researchers could ask sophisticated analytical questions in plain English:

  • "What are these tables and how are they connected?"

  • "How does the average age at diagnosis vary across different tumor grades?"

  • "What is the average age at diagnosis for each primary diagnosis?"

Behind the scenes, Genie performed text-to-SQL translation, automatically generating optimized queries that joined data across multiple sources, including the TileDB genomic data. The system then executed these queries and automatically generated visualizations—charts and graphs that revealed patterns in the data.

Democratizing Genomic Analysis

This capability represents a fundamental shift in who can perform genomic analysis. Traditionally, querying variant data or correlating genomic features with clinical outcomes required expertise in bioinformatics tools, query languages, and data structures. With the TileDB-Databricks integration and AI agents:

  • Clinical researchers without programming expertise can explore genomic associations

  • Data analysts can incorporate genomic features into their analyses without specialized training

  • Cross-functional teams can iterate on hypotheses much faster

  • The barrier between "bioinformatics experts" and "domain experts" begins to dissolve

Enterprise-Grade Governance

Importantly, this accessibility doesn't compromise security or compliance. Unity Catalog enforced access controls consistently across all datasets. Users could only query data they had permission to access, and every interaction was logged for audit purposes. This governance model is essential for working with sensitive patient data under regulations like HIPAA.

Tertiary Analysis at Scale

The demo focused on exploratory analysis, but the same infrastructure supports production tertiary analysis workflows. Researchers can:

  • Identify genomic biomarkers associated with drug response

  • Stratify patient populations based on molecular profiles

  • Discover novel gene-phenotype associations

  • Generate evidence for clinical decision support systems

By combining TileDB's efficient storage and querying of genomic arrays with Databricks' AI capabilities and compute power, these analyses can operate at population scale while remaining interactive enough for iterative exploration.

Looking Forward

The TileDB and Databricks partnership opens new possibilities for precision medicine research and clinical implementation. As datasets grow larger and more diverse, as AI models become more sophisticated, and as the need for real-time clinical decision support increases, the integrated platform provides a foundation that can evolve with these demands.

From rare disease diagnosis to drug target discovery, from clinical trial optimization to population health management, the ability to seamlessly work with multimodal data at scale is becoming a key differentiator. Organizations leveraging this integrated approach can move faster from data to insights, from research to clinical impact, ultimately delivering on the promise of truly personalized medicine.

Meet the authors