A conversation with Stephen Pirpinias: Telling a better data story with TileDB’s array architecture

Table Of Contents:

Using bioinformatics to ask better life sciences questions

Engineering better multimodal data analysis with TileDB arrays

Challenges and predictions for agentic AI in 2026

The complexity inherent to multimodal data demands a platform that can organize and analyze different modalities at scale. In this interview, TileDB’s bioinformatics engineer Stephen Pirpinias unpacks how a data architecture built on multi-dimensional arrays elegantly addresses many of the data management problems that life sciences organizations face. As agentic AI prepares to transform the story we tell about all our data, Stephen describes how TileDB is providing the flexible data layer that AI and other cutting edge technologies can use to drive discovery.

Using bioinformatics to ask better life sciences questions

How did your career in bioinformatics lead you to TileDB?

Pirpinias: I've always loved learning things, and programming is a forever course in this discipline. Everyday is a different story, which is exciting. I started as a bench scientist at the laboratory of Eric Nestler in the Icahn School of Medicine in NYC, where we studied depression and addiction. I quickly realized all the post-docs were dependent on this one particular person who analyzed their data. And this really stuck out to me. Eventually, the post-docs spelled out that the future was computational biology and told me making that pivot was the right move. So I made that move to ask bigger questions.

I found TileDB through a mutual colleague, and the technology stuck out as a good solution to a difficult problem. TileDB-SOMA and TileDB-VCF were designed to address serious problems in the field of computational biology. These were data-hungry applications in multiple formats without a proper orchestrator in the middle to make sense of it. Now add on the sheer size of these files and the specific analysis to be done on these files, and it quickly adds up while demanding a diverse skill set. TileDB allowed us to manipulate the data into scenarios that benefited our relationship and usage with the data and across each other. This had been difficult to manage efficiently. With TileDB, instead of this bucket of files, you got this array that you can talk to easily and it supports all kinds of data and possibilities when hooked up to web applications at various institutes.

The second reason is more personal: Half of the TileDB team is Greek, like me. There aren’t many of us in the world working for software companies, much less one with a technology that could drive real change. It’s exciting for me to join with my fellow countrymen to deliver a transformative technology to last a lifetime.

Engineering better multimodal data analysis with TileDB arrays

How are TileDB’s solutions addressing the challenges of multimodal data in a way that other platforms are not?

Pirpinias: The lack of organization in the life science domain is a really tricky problem. You've got a lot of formats and many things are not tables. You can't cram all your multimodal data into a tabular solution expecting it to perform. TileDB's infrastructure allows you to bend that data into multiple shapes and sizes while keeping your priorities in focus. This gives the field a unifying tool to pull all of these data types into one ecosystem so they play nicely with one another.

As I show clients how TileDB’s arrays can solve this problem, I’m finding that education is a critical part of my work. Teaching clients how to use arrays is essentially telling them a story, understanding the context of how the person I’m talking to can use our platform and the problems they want to solve. Sometimes these problems easily fit TileDB’s packaged solutions; for instance, my post-doc friends who have lots of single-cell data and don’t know what to do with it, TileDB SOMA is the solution for them because it’s built to easily manage single-cell data. Personally, I’m more excited when clients have lots of multimodal data they need to work with, from VCF files to CSVs to TSVs, and I get to engineer them a custom solution that fits their infrastructure needs and is performant at the same time.

Tell me about a TileDB project where you engineered a custom solution like this.

Pirpinias: The work we did powering a data visualization platform for Takeda is a great example. I worked heavily with Takeda on developing a library on top of TileDB for a large swath of modalities extracted from public repositories. The goal was to visualize amino acid changes for individual patients in 3D, and they needed a performant solution that could capture multiple measurements of a modality in a single ecosystem that empowered researchers every day.

We constructed their library similarly to a physical library with sections of books by topic. In this case it was by organ: you had skin, lungs, heart etc. Within each category was a collection of "books" that were arrays. These arrays were individual experiments: mostly SOMAs with a few ordinary arrays for niche cases that felt applicable. This enables researchers to leverage APIs to act on these arrays either sequentially and/or concurrently. While it depended on your use case, the framework was in place to do a whole lot with this implementation and design.

The result is Takeda transforms this multimodal dump of data into a coherent structure that can be used for all kinds of reasons and leveraged by the whole organization, which is a big key to its success. And while the system was populating, I was really proud of what the platform was capable of handling at scale. The use case was a perfect fit for TileDB.

Challenges and predictions for agentic AI in 2026

Agentic AI is shaping up to be the pivotal technology of 2026. What challenges do you see in implementing agentic AI?

Pirpinias: At the end of the day, any agentic AI service is a tool, one based on certain assumptions it was trained to learn. If those assumptions are off or false, do not expect the AI to magically produce useful results. Too many people approach agentic AI from a simplistic point of view that only asks “Is AI good or bad?” To me, this question is less useful than asking “How is AI useful now, and how is it going to be useful in the future?” For example, one performant healthcare AI application today is in radiology. Artificial intelligence trained on pixel data of mammograms and MRIs predicting cancer types is an amazing, real application that is used in practice today assisting radiologists in identifying potential malignancies. And we can’t ignore that the preparation and training of this AI application is what makes these exciting results possible.

As for the future of agentic AI, any service that autonomously synthesizes creates a need to establish ground truths and checks on the system. Otherwise, the model could drift or not perform as expected, but nobody would notice because we blindly trust its "intelligence." Working with Anthropic’s Claude gives you that impression quickly. It's best to carefully observe what the model is producing to protect against hallucinations and inefficient implementations. Far too often, people push code without understanding what the code is actually doing. This can lead to catastrophe for larger code bases. If that’s a serious risk with today’s generative AI, we need to be extra vigilant about transparency with AI agents if we’re going to give them independent control of our infrastructure and software.

As more organizations explore using AI agents, how do you see TileDB helping agentic AI work more effectively?

Pirpinias: One of the most performant applications of agentic AI is enabling chat bot interfaces that can interact with data. This uses agents to empower bench scientists and others who aren’t programmers to ask analytical questions and run complex queries using multimodal data, all without the user having to code.

TileDB’s architecture fits this agentic AI use case perfectly. It provides a catalog of efficient data structures holding your data that need be queried by an interface to deliver results. This frees up the programmer in the middle to work on more important details than executing simple queries. You enable the agent to empower the scientists while programmers continuously develop. It's an exercise in efficiency. This will be crucial for organizations exploring not only how they can deploy AI agents onto code bases but also working with multimodal data platforms at much larger scales.

Are there any other trends in life sciences research you’re following closely?

Pirpinias: I’m excited to see what TileDB is going to make possible in genomics as well as proteomics and metabolomics, since that ties into my old research field studying the gut microbiome at Mount Sinai. TileDB offers an infrastructure that can layer proteomics and metabolomics data with genomic sequencing data to gain better insights and drive discoveries in gut health. I’m predicting that proteomics and metabolomics is going to undergo a similar evolution in low-cost analysis comparable to cheaper genomic sequencing. And when researchers need data infrastructure to house, store, and analyze this data at scale, TileDB is going to be that platform.

To learn more about how TileDB helps empower organizations to develop AI agents, contact us.

Meet the authors