Discover why metadata is the glue of reproducible research in life sciences. Senior Product Manager Jeremy Leipzig shares insights on metadata, federated queries, and how TileDB Carrara empowers collaboration without compromising patient privacy.

Table Of Contents:

The underappreciated importance of reproducible research

Making the most of metadata in life sciences data management

How TileDB is simplifying how we share multimodal data

Reproducible research is essential to effective collaboration in life sciences, making it possible for scientists and bioinformaticians to build on existing datasets to drive discovery. This idea is integral to the vision of TileDB, guiding how we design trusted research environments and enable federated queries to share data across organizations without risking patient privacy. To learn more about these ideas, we interviewed Jeremy Leipzig, our Senior Product Manager for Genomics and longtime evangelist for the value of reproducible data.

The underappreciated importance of reproducible research

Let’s start with you. Why did you study and make your career in bioinformatics?

Leipzig: I got out of undergrad with a biology degree, and I was working in a wet lab doing whole animal research. So my days were filled with rats and mice for the first few years after college, and I got tired of that pretty quickly. But in the course of those experiments, I had written several computer programs to help us manage data from these real-time animal psychopharmacology experiments, and I discovered I enjoyed writing those programs a lot more than my day job.

So I went back to school for computer science with a bioinformatics angle. I got my first job at Penn with the Bushman lab doing virology research. We were one of the first to use next generation sequencing to study how viruses, especially retroviruses, are able to integrate into various genomes, replicate and become infectious. After that, I did a stint in industry and then worked at The Children’s Hospital of Philadelphia for several years. During those years I did a Ph.D oriented toward reproducible research, which I viewed as an underappreciated challenge that became a recurring theme in my work.

Yes, you’ve become a bit of an evangelist on that topic. Why is reproducibility in research so important?

Leipzig: Reproducibility is a recurring problem in terms of having scientists encounter papers and either try to reproduce them or or essentially replicate them. They take their own data and apply the same algorithms and analysis to that data, and then find out they do not arrive at the same results. I saw a lot of the mechanics of how that problem arose because I worked in a bioinformatics core handling analysis for a university, and we didn’t have a lot of reproducible research best practices instituted at that time.

So I thought a lot about the pieces of the puzzle that were necessary to reproduce a paper. And I started organizing and looking at case studies that have been done about reproducibility, collecting them in my GitHub repo called “Awesome reproducible research.” Some of the more interesting ones involve taking a data set and having many different scientists analyze that on their own, seeing what types of results or P values come out of those. Others are simply trying to reproduce or replicate a paper, or taking a certain scientific hypothesis and applying different tool sets to make sure those results can rise above the noise of using different tool sets and algorithms against the data set. If the underlying scientific hypothesis is true, you should still arrive at essentially the same result. It’s essential to the scientific method, and it was the impetus for my dissertation and later publications on metadata and reproducible computational research.

Making the most of metadata in life sciences data management

What’s the role of metadata in life sciences data management? Why is it so important?

Leipzig: Metadata is the who, what, where, when and why of data. It can be technical metadata that describes the instrumentation that was used, or procedural metadata that describes what transformations were applied to a certain data set. Metadata is essentially the glue of reproducibility. It’s what allows you to communicate the steps that are necessary to arrive at a result that another scientist can use.

For example, when we get a VCF file, it’s not just straight off the sequencer. There have been alignments and variant calling steps done by some software. Typically those transformations are run using a pipeline framework like Nextflow, and then there's often some analysis that was done that we've inherited. So we need to know what the code was that generated this analysis. Maybe it lives in a notebook, maybe there's off-the-shelf statistical algorithms that were applied, but we need to know what they're called and what version of those tools were used. These are all key questions, which an experiment’s metadata answers if it’s properly implemented. I also wrote a paper that breaks down the five major components of metadata: input metadata, tools, workflows, statistics and then the papers themselves.

Every time we get some new data, inevitably someone wants to know something about it beyond what the raw data conveys. Metadata is where that information can come from, making it essential for sharing research data in life sciences. TileDB supports metadata a lot better than the file system that you have on your computer, helping you define key value stores that are tagged for any data asset. And once we have Carrara released, these capabilities are only going to get better.

Go into more detail about that. How does TileDB Carrara better support metadata and reproducible research?

Leipzig: With Carrara, our next projects will offer first-class support for ontologies. Like metadata, ontologies describe how different data entities relate to each other in a structured language, so they’re like the nouns and verbs of the data universe. What we want to do is enable people to go from something like a LIMS, a laboratory information management system that would come from a wet lab, to a solution that makes all that data accessible no matter how far it got processed in TileDB. So if a physician like Dr. Kingsmore from Rady Children’s Hospital was looking at a report and he needed to know some fairly arcane detail about how something was processed or where it came from, he can find that without leaving TileDB. That’s the ultimate goal. This is vital for reproducibility and the secure collaboration across organizations that is needed for true breakthroughs to happen.

Another key capability around reproducibility and collaboration is how TileDB Carrara supports federated queries, which are key to our ongoing work with Rady’s and the BegiNGS consortium. That consortium really wants to share data where they're looking at aggregate results between parties at different hospitals without allowing identifiable information to be transmitted between those institutions for the sake of the privacy of the patients involved. Federated queries let them share the aggregate results in a way that’s flexible enough that someone at another institution can write an analysis that processes someone else's information in a new way without breaking patient privacy. Our partnership with Rady’s is what led us to build a solid federated framework and reinforce what we’re doing with our trusted research environment, delivering the safeguards and auditing capabilities that people demand in these highly scrutinized environments.

How TileDB is simplifying how we share multimodal data

Speaking of partnerships, how do you see the new partnership between DataBricks and TileDB helping to empower life sciences research using multimodal data?

Leipzig: Databricks comes from a long lineage of using Apache Spark to its fullest capabilities. Spark has always been an interesting processing solution that has been very effective across data science, across business analytics and in the life sciences as well. But what was always lacking in the Spark world was a database that could keep up with Spark. For the longest time they've essentially required something called Parquet, which is a columnar-indexed file that can be queried from S3, but that data layer has never really been as mature as it could have been.

This is where TileDB comes in. As more and more firms prioritize multimodal data in scientific analyses, TileDB offers a lot of synergy in allowing Databricks to fully spread its wings and process these other types of data modalities. There’s a real “better together” story that I’m excited to see unfold in the coming months.

Let’s close by talking more about the future. What kinds of life sciences research projects are you most excited about, and how could TileDB Carrara support them?

Leipzig: As a population genomics person, my number one thing is the biobanks. We want to get as many biobanks as we can into TileDB to accelerate the ability to extract information from those and uncover new insights from those effectively. We've seen a lot of large projects where someone has taken UK biobank data and discovered something brand new about some disease process that was under the radar. So the more data we can get from these biobanks into TileDB, the more these discoveries are possible.

And I’m not just talking about the UK biobank. We're also interested in biobanks under national auspices in developing countries as well as biobanks from domestic hospitals where there's some specific disease focus. So we want as many of those bio banks to be in TileDB as possible under very secure data sharing auspices and allowing federated queries to be run on these data sets. We're at the tip of the iceberg with the total number of people who have been sequenced, but we can get to a point where we essentially sampled enough individuals to get a representative sampling of the entire world's population. I’ll be excited to see where that goes.

TileDB Carrara essentially rounds out our product, allowing TileDB to develop from a more specialized scientific application database to something that handles all your data end to end. So what we get from next flow workflows, which are supported in Carrara, is the ability to go from the sequencer to a TileDB dataset end to end. So we don't have to do things across different platforms. It really is a much more holistic solution that's going to allow people like Dr. Kingsmore from Rady’s to see reports and see static files and use TileDB not just in one niche, but for managing basically everything. TileDB Carrara is going to be their system of record for doing all their work so they're not having to be divided between different solutions, adding this first-class tabular support that will manage organization’s tabular databases effectively as well.

Learn more about TileDB Carrara and how it facilitates collaboration to drive life science discovery.

Meet the authors

A conversation with Jeremy Leipzig: Why metadata is the glue of reproducible research

The underappreciated importance of reproducible research

Making the most of metadata in life sciences data management

How TileDB is simplifying how we share multimodal data