Breaking Down Data Silos: How TileDB and Snowflake Are Transforming Multimodal Research

Table Of Contents:

The Multimodal Data Problem

From Multimodal to Omnimodal: A Vision for Tomorrow

Snowflake's AI Data Cloud for Healthcare

Breaking Down R&D Silos

Better Together: The Connected App Architecture

How It Works: TileDB's Perspective

How It Works: Snowflake’s Perspective

Three Key Use Cases for the Partnership

The Agentic Future

Key Capabilities Explained

The Bottom Line

The life sciences industry faces a critical challenge: scientific breakthroughs increasingly depend on analyzing diverse data types together, yet the infrastructure to do so remains fragmented, expensive, and complex. TileDB and Snowflake have partnered to solve the multimodal data puzzle that's been holding back precision medicine and drug discovery.

The Multimodal Data Problem

Multimodal data has become the foundation for competitive advantage in life sciences. From genomics and imaging to clinical records and phenotypic data, organizations need to connect these diverse data types to drive decision-making from bench to bedside.

But there's a catch. This data is terribly complex to use. The web of connections between different data modalities creates expensive infrastructure requirements, fractured tooling, and shadow IT organizations within companies—teams building bespoke solutions that others can't access or understand.

The reality is clear: while most organizations use complex multimodal data types, their data remains spread across multiple environments. This fragmentation makes AI adoption particularly difficult, as agents need unified access to tools and data that's currently scattered everywhere.

From Multimodal to Omnimodal: A Vision for Tomorrow

TileDB introduces the concept of the "omnimodal intelligence platform"—moving beyond today's multimodal approaches to a future where all data types are truly interconnected.

TileDB has built integrations across the AI ecosystem, from TensorFlow and PyTorch to NVIDIA tools, with native vector database functionality built on its multidimensional array foundation. The platform serves as an end-to-end workbench for multi-omics, supporting everything from secondary analysis with Nextflow pipelines to tertiary analysis where data gets married with information from partner systems like Snowflake.

Snowflake's AI Data Cloud for Healthcare

Snowflake has evolved alongside the data landscape—from structured data into multimodal and unstructured sources, from data engineering into analytics, and now into AI and ML applications.

With over 12,000 customers globally (including 1,400 in life sciences), Snowflake has seen remarkable AI adoption: 60% of customers—nearly 7,000 organizations—have already adopted their AI capabilities. This momentum in life sciences is driven by security, governance, and regulatory compliance (HIPAA, GXP, FedRAMP), plus the ecosystem advantage of many health tech and life tech companies building on or powered by Snowflake.

True scientific innovation comes from bringing together multiple data sources. The key isn't just storing data, but creating optimized ways to integrate it into analytics, machine learning, and AI workflows while adhering to FAIR principles for self-service discovery.

Breaking Down R&D Silos

R&D organizations remain deeply siloed—sometimes by design for regulatory compliance, but often to the detriment of innovation. Even within R&D, teams work with vastly different niche tools and specialized storage systems.

What modern data platforms unlock is cross-collaboration use cases. For example, multi-omic data from clinical trials can feed back into early-stage virtual screening for drug discovery, speeding up the entire process of drug design and development.

Snowflake's AI stack includes enterprise-grade features fully integrated with role-based access control, serverless hosted models from major providers, and native agentic functions like Cortex AI. The platform also democratizes access to multimodal data through native SQL functions, allowing analysts—not just computational scientists—to work with complex imaging and genomic metadata.

Better Together: The Connected App Architecture

See the TileDB–Snowflake Connected App in Action

The TileDB-Snowflake Connected App enables:

Best-in-class performance by keeping each data type in its optimized system
Catalog synchronization between TileDB's Multimodal Catalog and Snowflake's Horizon Catalog via Iceberg integration
Zero-copy sharing in both directions—no data duplication, no versioning headaches
Flexible deployment supporting many-to-many relationships between TileDB and Snowflake instances

The architecture supports data ingestion from wet labs into both systems in parallel, with users accessing everything through either platform while maintaining complete governance.

How It Works: TileDB's Perspective

Flexible Configuration for Compliance

The connected app offers deployment flexibility crucial for meeting compliance requirements. IT administrators can configure Snowflake integrations at the workspace level (making them available to everyone) or at granular team space levels for specific projects or geographic regions. This allows organizations to have different teams access different Snowflake roles or accounts, maintaining proper governance boundaries while enabling collaboration where appropriate.

Organizations can integrate with single sign-on systems like Okta and Microsoft Entra ID, use SCIM protocol for automatic configuration, or manually configure user accounts and permissions. This flexibility extends to the connected app itself—one team space might access a specific role inside Snowflake with restricted permissions, while a European-based team space accesses a different Snowflake role to maintain geographic data boundaries.

Unified Catalog Experience

In the TileDB catalog, Snowflake tables appear alongside TileDB arrays, genomic datasets, and notebooks. Users can open a dataset stored in Snowflake and run SQL queries directly from TileDB's interface to preview the data. With one click, they can jump directly to Snowflake's UI to see the same details, tables, and data previews.

The system handles diverse data types seamlessly. A biomedical image from the Open Cancer Image Archive project stored in TileDB format on S3 can be previewed interactively—users can pan and zoom the image live in their browser. All of this data is also available to view and access through the connected app in Snowflake.

The integration works both ways. From Snowflake's Horizon Catalog, users can see the TileDB Iceberg catalog structure exposed through TileDB's catalog API, with all team spaces, genomics, variants, imaging, and single-cell data visible and accessible. TileDB can store arbitrary files, data in the TileDB format, and data in Parquet and other file formats—all exposed through the Iceberg catalog API to systems like Snowflake that support Iceberg as an external data type.

Programmatic Access Through SQL

The Connected App includes SQL functions that enable direct querying of TileDB data from Snowflake. Users can query variant data for specific genes using helper functions, with results returned directly to Snowflake, ready to be passed to other functions.

The same capability extends to imaging data. While viewing image pixels as a table output—seeing X and Y positions and color codes—isn't particularly useful on its own, the power lies in what happens next: the output from a TileDB image query can flow directly into Snowflake's native image analysis functions, seamlessly combining TileDB's specialized storage with Snowflake's ML capabilities.

Building GWAS Workflows

From TileDB Jupyter notebooks, researchers can build comprehensive genome-wide association study (GWAS) workflows that span both platforms. Consider a workflow that accesses the gVCF 1000 Genomes dataset—an 11 TiB dataset from the Open Data Project—alongside Snowflake tables containing phenotypes and Ensemble data.

TileDB offers domain-specific APIs for different modalities of data. For genomic data, the TileDB VCF package allows researchers to open datasets and select specific fields—important because genomic datasets can have hundreds of fields, and TileDB allows subselection of only what's needed. Queries can specify regions (in the typical bed file format) and particular samples, with results returned as Pandas dataframes.

From the same notebook, researchers can query Snowflake tables directly. The connected app handles authentication behind the scenes with service accounts, roles, and proper authentication setup, so users can access any authorized data and bring it directly into their analysis workflow.

Distributed Compute with TaskGraphs

The most sophisticated capability involves building large-scale distributed workflows. TileDB's helper functions enable distributed genomic workloads at scale. The innovation lies in "delayed SQL objects"—part of TileDB's TaskGraph API for lazy, large-scale execution. This allows researchers to daisy-chain entire workflows and mix and match data sources and integrations.

Queries to Snowflake can be "delayed" and integrated into larger computational workflows without immediately executing them. This enables workflows that access Snowflake data as part of distributed parallel processing, with visualizations showing the workflow running in parallel and collecting results.

The output from entire GWAS analyses can be stored directly back into Snowflake—either in native tables or Iceberg formats—creating a seamless flow from raw genomic data through analysis to results ready for AI-powered insights.

See a Genomics Workflow That Spans TileDB and Snowflake

Explore TileDB Carrara for Genomics on Snowflake

How It Works: Snowflake’s Perspective

Building Agentic Intelligence

Snowflake's capabilities focus on building conversational AI on top of the integrated data foundation. The architecture includes multiple layers:

The Data Layer combines:

GWAS results from TileDB analyses, stored in Snowflake
Cortex Search services for PubMed and ClinicalTrials.gov, enabling agents to access external scientific research and clinical trial data
Cortex Analyst with semantic models built around GWAS datasets to enable text-to-SQL understanding of genes, variants, diseases, and associations
Custom TileDB tools for querying patient-level data directly from TileDB using VCF APIs

Semantic Models: Making Data AI-Ready

Snowflake's semantic models provide context for AI consumption. These models can be published following FAIR principles, making data accessible for AI-native consumption into agents.

Configuring Agents

Agents are built with specific identities through descriptions, instructions on how to respond, and access to key tools:

1
GWAS results in tabular form in native Snowflake tables (which can also come through Iceberg connections or other TileDB integration methods)
2
Search services tied to the agent for accessing external research
3
Custom TileDB tools using TileDB Carrara APIs to query individual patient-level data slices

This architecture recognizes that when working with clinical data, researchers need to query TileDB directly to get information about patients with specific pathogenic mutations within particular genes or genomic locations. Orchestration instructions guide how agents process and respond to queries.

Agents in Practice

Through Snowflake Intelligence, users can ask questions in natural language. The agent reasons about how to answer, then retrieves results from appropriate sources. For example, asking about a specific gene triggers the agent to query GWAS study results and present findings conversationally.

More complex workflows demonstrate the true power: querying different genes associated with neurological phenotypes, summarizing pathogenic variants or mutations, then asking the agent to corroborate findings with public research. The agent can determine how specific genes or mutations are involved in investigational or existing drug pathways by orchestrating across multiple tools—structured SQL queries to GWAS results, vector search through scientific literature, and direct API calls to TileDB for patient-level data—all while maintaining governance and security.

Data Stays at Rest

A critical principle guides the joint architecture: genomics data is large, and organizations don't want to copy it or store it in multiple places. Data should stay at rest in its optimized location.

The custom tool architecture allows agents to query TileDB on demand for specific patient-level slices rather than copying entire genomic datasets into Snowflake. This approach maintains performance while respecting the principle of keeping data where it belongs, whether in native Snowflake tables, external stages, or TileDB.

Three Key Use Cases for the Partnership

The connected app enables three primary use cases:

1
Preclinical and target discovery - Combining multi-omics data with clinical signals to accelerate drug discovery and development
2
Trusted research environments - Extending governance across platforms while respecting geographic boundaries through region-aware deployments
3
Genomic precision medicine - Enabling rapid whole genome analysis for bedside diagnoses of rare diseases

These represent just a fraction of potential applications. The flexibility of the architecture supports hundreds of use cases across the drug discovery and precision medicine landscape.

The Agentic Future

The connected app represents the beginning of a broader evolution. The shift from generative AI to agentic AI brings a crucial element for life sciences: determinism. LLMs now serve as reasoning and orchestration layers that speak to sets of tools, combining highly structured tools, unstructured tools, and custom connectors in purpose-built applications for scientific research.

The roadmap extends beyond data and catalog synchronization to support agent-to-agent protocols. Through technologies like MCP (Model Context Protocol), the connected app will evolve to enable cross-access between systems at the agent level, building a true agentic ecosystem for life sciences research.

Key Capabilities Explained

On Data Strategy

The fundamental principle: no data duplication. Even small tables, when duplicated, create versioning problems, lineage issues, and reproducibility challenges. The connected app enables single sources of truth with cross-platform access.

On External Data Access

On Linking Multimodal Data

The connected app architecture directly addresses how to link images to sample information. Biomedical images stored in TileDB can be accessed alongside phenotypes and patient details in Snowflake through cross-database queries. Researchers can perform cohort selection from tables in Snowflake, then access corresponding images from TileDB that match identified patients, bringing results directly into ML or foundational models running in Snowflake.

This eliminates the need to export data from one system and copy it into another. The workflow remains seamless while maintaining governance and optimized storage for each data type.

The Bottom Line

The TileDB-Snowflake partnership addresses a fundamental challenge in life sciences: how to analyze diverse data types together without creating copies, losing governance, or building expensive bespoke infrastructure.

This isn't theoretical. Researchers can now run GWAS analyses that pull from multi-terabyte genomic datasets and Snowflake tables in the same workflow, store results back to either system, and then surface those findings through conversational AI agents that can query patient-level data on demand—all while maintaining governance and keeping data at rest.

As organizations navigate the complexity of multimodal data and the promise of AI-driven discovery, this partnership demonstrates that the future isn't about choosing between specialized systems or general-purpose platforms. It's about making them work together seamlessly, so scientists can focus on discovery rather than data plumbing.

The connected app transforms what was once a choice between data systems into a collaborative ecosystem where each platform contributes its strengths, governance remains unified, and AI agents can access the full breadth of an organization's research data to accelerate the path from research to clinical impact.

The TileDB-Snowflake connected app is available in private preview. Organizations interested in learning more can reach out to either their TileDB or Snowflake account teams to explore use cases and deployment options.

Meet the authors