A conversation with Aneesa Valentine: Bringing harmony to multimodal data

Table Of Contents:

Discovering a passion for using computation to harmonize data

Implementing data infrastructure built for diverse modalities

How data harmonization is enabling the future of life sciences

Before you can identify the right targets for discovery or build sophisticated machine-learning models, you need all your data in a usable format. But while life sciences data has always been diverse, the rise of multimodal data from sources like genomics, transcriptomics and single-cell data has made the process of harmonizing data more complex than ever. To learn more about this challenge and how TileDB’s multidimensional arrays are addressing it, we interviewed Aneesa Valentine, a Solutions Architect on our field engineering team.

Discovering a passion for using computation to harmonize data

How did your career journey lead you to your current role as a Solutions Architect at TileDB?

Valentine: During my Ph. D program in biomedical research, I discovered I had a passion for computational methods. This led me to specialize in computational genomics and pivot to commercial data science and AI, which gave me an opportunity to develop new methods to solve new problems that I hadn’t even known existed. I’m a scientist at heart, so I have a drive to ask the best possible questions.

I also work as an instructor at a nonprofit called Gen Space in New York, which is a community biolab that functions as an academic institution. I instruct professionals who want to polish their skills or gain new skills. That's been very fulfilling because I'm super passionate about education and science communication and enabling the next generation of scientists and engineers. What's surprising for me is just how sharp they are. They're grasping methods and concepts that I couldn't even wrap my head around until grad school, with some who started coding in Python in high school. So I think the next generation will probably be unstoppable.

What appealed to me about joining TileDB was that it was a truly different approach from the tabular databases used all over life sciences. TileDB offered me an opportunity to use my domain expertise in bioinformatics and computational genomics while communicating with diverse audiences like scientists and non-technical teams among our customers, which I enjoy. I sit on the field engineering team, serving as a bridge between the customer and our core engineering team, so 70% of my work is just interfacing with customers to build the best solutions for them.

So when you work with customers, what are their most common bioinformatics and data science struggles? What do they need help with?

Valentine: Most of our customers are large enterprise pharma and we also have some smaller biotech firms. While these different types of customers face different challenges because of resources and size, a common thread is being able to effectively harmonize data. Whether your ultimate goal is to train models or construct huge single-cell atlases, the bottleneck they all face is harmonizing their data wherever it is.

What does it mean to harmonize data?

Valentine: It’s a good question. Harmonizing data can sound like a pretty marketing term, but it’s actually very practical and important. It means being able to effectively merge datasets from different modalities. So if you have gene expression data that's associated with a particular set of samples, you might want to merge that with protein data that's associated with that same set of samples. Like harmonizing with a chorus, you want the basses and the altos singing in the same tune.

This is tougher than it sounds, because life sciences has a lot of modalities and formats.

Implementing data infrastructure built for diverse modalities

So let’s say you were advising a life sciences startup on how to avoid this problem. How would you suggest they architect their data infrastructure from the beginning?

Valentine: I love this question. You should start by defining your research goal. Too many startups want to optimize their tech stack first. While this is not a terrible thing to do, when you run into bottlenecks down the line only to realize that you haven't defined your research question or the scope of what it's going to take to accomplish said question, you wind up with a lot of tech debt. Now you have bottlenecks and a lot of money already spent on solutions that you didn't necessarily need for your actual research.

But if you define your biological question and what you're trying to accomplish first, then you can optimize your data infrastructure and plan for the data types you are actually going to need to satisfy your question. Now you’re ready to optimize your tech stack to fit the model of whatever is truly important to your goal.

So you should plan your data infrastructure for the data modalities that are actually relevant to your research. How does TileDB’s multi-dimensional arrays help with managing complex modalities like genomics and single-cell data?

Valentine: First and foremost, performing operations on arrays is more efficient than performing operations on a data frame or another traditional method. For instance, if you have single cell data and you're trying to get a mean gene expression across cell types across cell populations, it's much more advantageous to do that when your data is formatted as an array with regards to speed and memory footprint reduction.

Besides the performance advantage of having your data formatted as an array, there's also a flexibility advantage because any data type can fit into an array. Even if you're doing machine learning applications and you've got embeddings from some sort of data, those embeddings are just one-dimensional vectors. So you could theoretically do whatever you wanted with whatever data type that you had simply by converting it to an array format.

Let’s talk more about the flexibility aspect. Walk me through how multi-dimensional arrays bring structure to data types that are often considered unstructured.

Valentine: I’ll give you a practical example. Say I'm a scientist doing single-cell research and my wet lab folks have done experiments and just sent me the data. Now I have single cell RNA seq data and some spatial data in the form of spatial coordinates and images. And if all this research was conducted on a particular tissue type, I also have histopathology images of that tissue. Multi-dimensional arrays help structure this data because you can turn every data type that I just mentioned into an array and effectively query across all those different data types.

So say I have “sample one” and I want to see what the gene expression was like for sample one, what the spatial coordinates were for sample one and where sample one falls on my histopathology slide. Now I can get a really holistic picture of whatever sample one was by converting all of my data modalities into a single data structure—the array. This goes back to your earlier question about what it means to harmonize data. This is it: To be able to get insights across multiple data modalities for a cell type or population of interest and use that analysis to make recommendations. This is the kind of data harmonization that’s going to really drive future discovery.

How data harmonization is enabling the future of life sciences

Talk more about that. How do you see data harmonization being essential to the future of life sciences?

Valentine: Data harmonization is what makes multiomics integration possible. That’s one of the life sciences research goals I’m most excited about. It will be extremely powerful to be able to easily gather insights across a particular cell type or cell population using all kinds of different data modalities. Historically, science has always moved towards precision. Three or four years ago there was this boom in precision health and sequencing the whole genome as genetic testing got really cheap. The acquisition of these very different and sophisticated data types in multiomics will move us further towards precision, enabling really holistic pictures of health.

How do you see TileDB Carrara contributing to this multiomics-driven future?

Valentine: Again, the bottleneck at so many life sciences firms is data harmonization, and Carrara is built for that. A lot of people are interested in secondary and tertiary analysis, which are good things like analyzing the actual data and generating insights and training models. But the common bottleneck is upstream of that. Nothing happens if you can’t get all your data together into a usable format. This is precisely where TileDB comes in, taking all your different data modalities and putting them into an array that’s performant and flexible. So I look forward to what we will make possible in industry.

Explore how TileDB Carrara harmonizes multimodal data to drives life science discovery.

Meet the authors