A conversation with Aaron Wolen: Fighting the friction of data management in life sciences

Table Of Contents:

Data science designed for the software of life

How the complexity of multimodal data demands a new approach

Streamlining the future of life sciences with TileDB Carrara

Life sciences data management has many forms of friction, from spending days waiting on a query request from the bioinformatics team or struggling to analyze multimodal data using tabular databases. Aaron Wolen faced such data management friction across his career as an academic researcher and professor. In his role as Single Cell Product Manager at TileDB, Wolen is always looking for ways to ease friction and simplify collaboration in life sciences data management. In this interview, Wolen describes how his background as a single-cell researcher and bioinformatics engineer inspires his work at TileDB.

Data science designed for the software of life

Why did you become a bioinformatician?

Wolen: Since my electrical engineer dad introduced me to computers at a young age, I’ve been a computer nerd interested in writing simple scripts and automation tools. Toward the end of college, I took a genetics course that really clicked for me. I remember thinking, “Wow, this is like the software of life.” That led to me getting a job in a genetics lab that used early high-throughput sequencers to compare the genomes of different species of Drosophila, tiny fruit flies, which inspired me to pursue a Ph.D in genetics.

I had every intention of becoming a molecular biologist working in a wet lab. But the lab I joined was early in the adoption of gene expression microarrays, which was one of the first technologies that made it possible to measure the expression of thousands of genes at once. This was huge at the time. They were largely using Excel to analyze the data and it was just so cumbersome. They literally handed me a 400MB Excel spreadsheet at some point. That inspired me to learn R, which was becoming the de facto choice for bioinformatics at the time. And I just fell in love with it. I went really deep into software engineering and what we now think of as data science, specializing in computational biology and bioinformatics. I barely ever touched a pipet for the rest of my graduate career.

So is it fair to say you had an early traumatic experience with tabular databases?

Wolen: Hah, that is a fair summary. But I wasn’t the only one. I was faculty at Virginia Commonwealth University for years, and as technological advances made it easier and easier for small labs to generate large amounts of data, it created a lot of demand for computational expertise. I worked on a lot of interesting projects with different labs, but I was really struck by how much friction there was in the collaboration process. The scientists would just email me new Excel files every week, so the data would be scattered all over the place. There would be lots of little tables in one tab, and important experimental metadata would be denoted with only different highlighting colors. These sort of things made it very difficult to work with data programmatically so the whole process was so much slower and more error-prone than it needed to be. As data was getting larger and more complex, there was a real lack of data management which is a critical piece of the puzzle.

After VCU, I joined the University of Tennessee Health Sciences Center as the director of bioinformatics for their new transplant research institute. We were just starting to dip our toes into single cell RNA seq, and sharing the data with collaborators was just so painful. We really wanted to provide a dashboard for the collaborators we were working with that would let them query specific genes or transcripts and generate a few simple plots. But getting approval from IT or the info group made it really difficult to do that, so I was looking for alternatives.

I stumbled across TileDB’s website, which checked a lot of my boxes. I didn't want to have to maintain my own database server, and TileDB was this beautiful embedded software data engine that natively handled sparse matrices which is really important for single cell data. It worked out of the box with S3 so we could store this data really cheaply on AWS and share it with collaborators. TileDB also let them query it directly using R or Python without having to construct SQL queries or having to download the whole data set and then load it into memory. And this all happened when I was getting a little worn down by the endless grant writing cycle and administrative tasks in academia. I wanted to focus on the software and data science-oriented work I loved, and TileDB was hiring. I was their first life sciences hire, and it was a great opportunity to figure out the different life sciences applications for this exciting technology TileDB had built.

How the complexity of multimodal data demands a new approach

Speaking from your background in single-cell data and bioinformatics, how well do you think tabular database solutions are handling the data needs of single-cell research?

Wolen: Tabular databases are still in widespread use, and for certain use cases they are really awesome. But multimodal life sciences data like single-cell are not always rectangular, so if you’re working with a sparse single-cell matrix or a multi-channel tissue image or genomic variants these are fundamentally multi-dimensional structures. We see a lot of folks who usually start off trying to shoehorn this data into flat 2D tables because that’s the tool they use. But what you end up with is a pretty suboptimal solution that's not only difficult to use, but also difficult to maintain and to scale.

Alternatively, you can skip the database entirely and stick with the more bespoke file formats that have been developed for each new data type. If you're in bioinformatics, you know people in that field love to create new formats. However, it can be very difficult to jump from one to the other, and a lot of those formats are just flat files so you lose the ability to query and slice the data efficiently. You end up with a lot of custom code to manage these different file formats. That ends up creating a lot of technical debt and ultimately slowing down the actual important work, which is the research.

Considering these shortcomings, why do you think tabular databases are still in widespread use in life sciences?

Wolen: Familiarity is one reason. Most biologists and even bioinformaticians are initially trained on tables and Excel. It’s the hammer everyone knows, which can make every data problem look like a nail. But to be fair, for a long time the scale of life science data didn’t really require anything more complex. If you only have a handful of samples or a few CSV files, an Excel spreadsheets or even a simple SQLite database can work just fine. I don't think the pain becomes acute until you start working with new modalities that aren't naturally tabular or until your data size explodes as technology advances. It’s the slow boil effect where tables work fine at first, but slowly you realize as your data is getting larger and more complex you’re spending a lot of time doing extract, transform and load work to get data in and out of your tools.

This is why the multi-dimensional arrays TileDB uses are so important. These arrays are a much more natural representation for the multimodal range of life sciences data, plus they can also represent tables perfectly well when needed. This makes arrays one beautiful, very flexible format that is a much better fit for all kinds of use cases in life sciences.

How is TileDB’s technology directly addressing some of the collaboration challenges you encountered in your previous research work?

Wolen: One nice thing about TileDB itself that I really appreciate is because it's written in C++, we can write APIs for the storage engine in a lot of different languages. This is great for interoperability. Now, in my field of single-cell data, the language or the single cell analysis toolkit you use typically dictates the format used to save your data. A lot of the orgs we work with include both R and Python programmers, and traditionally they’ve had to duplicate their data and convert it to a different format whenever two people wanted to collaborate on the same experiment. This is painful and risks data loss during the conversion.

So for single-cell on TileDB, we have built what we call TileDB-SOMA, and in it we have an R package and a Python package. This means we can store single-cell data once and build upon TileDB's core APIs to provide an idiomatic interface for Python users and an idiomatic interface for R users and they can both work with the same data.

TileDB Carrara takes this further by providing a data catalog. This creates a single source of truth to access your data. You don't have to worry about people making duplicate copies on S3. Just go to your catalog and you can search for whatever your particular terms are for your analysis to find the right data sets. You get the URI and then you can just load it in a notebook with R or Python and it reduces a lot of the friction that would otherwise slow down that work.

Streamlining the future of life sciences with TileDB Carrara

That’s not the first time you mentioned “friction” in data management. What does data management friction look like? How would an org know it’s time to find a new data management platform?

Wolen: A key signal would be when you find yourself spending more time attempting to retrieve data from the database and restructure it into a format that their analysis tools require rather than just analyzing the data. For example, if you're spending more time writing custom SQL queries to slice and dice your data than actually running the analyses, that's a red flag.

Also you need to keep an eye on performance, the time it takes to retrieve the data. That has brought numerous customers to us. They often say, “A solution worked great when our genomic variant database contained a thousand samples, but now we have 5,000 samples and our comp bios are waiting forever for every query to complete.” Those are both signs of friction and that it's time to look for a new solution.

Speaking of new solutions, how do you see TileDB Carrara contributing to the future of life sciences?

Wolen: I’m really excited and fascinated by the rise of foundational models for biology like scGPT and Geneformer, which are so fundamentally different from how I was trained to analyze biological phenomena. These models can learn the language of cells from these huge data sets, making it possible to predict things like cell states, cell communication and other complex phenomena that were almost black boxes before. So I'm really excited to see where that goes and it hopefully helps tackle very difficult problems like Alzheimer's disease and conditions like that.

Here’s how I believe TileDB Carrara will help contribute to these breakthroughs. The less time you have to spend worrying about infrastructure, data management and collaboration problems, the more of that cognitive energy you can focus on the actual research. That's exactly what we're trying to do at TileDB with the storage engine itself, with our APIs and of course with Carrara itself. We want to abstract away all of those data management issues so our customers can just focus on the science.

Explore how TileDB Carrara removes the friction of multimodal data management to drive life science discovery and collaboration.

Meet the authors