Last week we had a coming out party for our audacious vision at TileDB. We hosted a webinar where I shared what we have been working on for the past several years to make this vision a reality. We argue that the market is saturated with thousands of purpose-built data(base) systems. Those create a lot of noise for analysts and scientists dealing with vital data problems across numerous important application domains, making their lives very difficult and slowing down Science for all of us. We explain that there is huge Engineering overlap across all those data systems and, therefore, it is possible to build a single system that can manage all data types in a unified way, for all applications, in a foundational and future-proof manner. We call such a system the universal database. And we built one, so we thought to share how we did it!
Here is the webinar video recording. I am also providing the gist below in text, in case you folks are too busy or bored to listen to some random dude talking about the exact opposite of what cloud vendors are telling you and how the market works. I look forward to hearing your thoughts and initiating a necessary dialogue in our industry.
While we have built enormous sophistication over the past 5 decades in relational databases which manage tabular data perfectly, most of the data being generated out there (e.g., by the Sciences) is not tabular. This data is typically massive and seemingly diverse, and cannot be managed effectively with “traditional” databases.
In the meantime, the cloud has changed data management radically. Organizations with large quantities of data prefer storing it in cheap cloud stores in the form of files, effectively separating storage from compute. This gave rise to “lake houses”, which pretty much boil down to the following: (i) dump your data into cloud buckets as files, (ii) adopt a “hammer” for scalable compute, (iii) treat data management as an afterthought by applying “hacks”.
To add insult to injury, Machine Learning is undergoing massive hype. As everyone jumps on the bandwagon and builds numerous pieces of software around ML, an important mistake is made: everyone thinks that ML is a compute problem, whereas it is in fact a data management problem. Because those ML models are trained over data, they are serviced on data, and they themselves constitute… well, data.
Therefore, a data management mess ensues. Thousands of “data systems” — databases, warehouses, lake houses, metadata stores, governance systems, catalogs, ML model/feature stores — flood the market. VCs spend inordinate amounts of money on new startups around those systems, with the recipe being: (i) GitHub stars or Hacker News top story, coupled with (ii) some top university pedigree or top tech company previous affiliation for the founders. Sometimes I think they are just doing it to troll us, as they can even invest in companies developing the exact same software branded with a different name, or in systems with the exact same features but with marginal performance differences.
And you may say: “So what? The market is well capitalized and more talent gets hired”. I am absolutely ok with that. But here is another idea. We are currently working with some organizations who are actually trying to solve some very important problems. Problems important for Humanity, those kinds of problems. And these folks, being scientists and despite the fact that they are more than capable of handling data engineering tasks, are lost in this data system noise. They either end up using way too many systems for their problem, or they build them in-house because no system fits their needs. So they lose a lot of time and money for their organization, their work is slowed down and, therefore, Science is slowed down. And I am absolutely not ok with that!
So how do we solve this problem? A while ago we hypothesized that there exists a single system which can efficiently support:
Also, it can handle authentication, security, access control and logging (like good old databases). In addition, it can be “infinitely” scalable (bounded by $ or machine availability), and it can enable global sharing, collaboration and monetization. Is it even possible to build such a system?
If it is indeed possible, why hasn't anyone built a universal database? Well, first of all, it sounds utterly unrealistic or, to say the least, it seems to require a LOT of work. Moreover, we as a database community overlooked the most powerful data structure that could have given us a chance to build such a system: the multi-dimensional array. The reason is that we always treated arrays as dense, i.e., having integral dimensions and values in every cell. That resulted in using arrays only for niche applications and of course never for tabular data.
Here is the thing though. Consider a multi-dimensional array that can be dense, but also sparse. And by sparse we mean that:
Such an extended array model, coupled with arbitrary key-value metadata is depicted below, constituting the basis for a universal way to model any kind of data. From tables, to time series, to LiDAR, to genomic variants, to imaging, to video, to key-values, to documents, to anything you can think of — even flat files! All you need to do is remove the domain-specific jargon from your data and map it to a dense or sparse array of any dimensionality.
The array model captures universal storage. Then algorithms can be modeled as a set of tasks (defining any kind of computation, even user-defined) with dependencies, enabling either distributed computing or out-of-core processing (i.e., processing that needs more space than RAM). Such generic task graphs help realize universal compute. Finally, extensibility via exposing numerous APIs and integrating with any popular tool out there leads to universal interoperability. And those three components must of course be welded by authentication, security, access control, and logging, all built once on the unified data and compute model.
A recipe for the universal database has therefore formed:
We took a stab and followed the above recipe, and the outcome was the TileDB universal database, which looks something like this:
It consists of two high level components:
The former is an open-source storage engine written in C++ that handles efficiently storing and accessing dense and sparse multi-dimensional arrays (as well as exposing APIs and integrating with other tools). The latter builds upon the storage engine and offers database features like authentication, access control, and logging, as well as serverless SQL, user-defined functions (UDFs), and task graphs, plus a lot of other goodness (such as hosted Jupyter notebooks and dashboards, global sharing of data and code, a full-fledged marketplace for data and code, and more).
Both those components merit separate tech deep-dive workshops, which are coming up soon — stay tuned!
Of course, we would not be having this coming out party if we were not confident that what we built works. TileDB is currently used in numerous applications, often vastly diverse. Those applications include tabular data, anything ML, point cloud (such as LiDAR, SONAR and AIS), genomics, satellite/biomedical imaging, weather, and more. We will be hosting workshops and publishing in-depth tutorials for each one, but you can take a look at our recent LiDAR workshop to get a taste.
Here are a few thoughts about the future of data management, at least the way I see it.
First, I predict that data warehouses and lake houses will be subsumed by universal databases. Based on what I covered in this webinar, there is nothing that data warehouses and lake houses do that a universal database cannot. But a universal database handles more than one data type, in an efficient, unified, foundational manner. So I expect data warehouses and lake houses to start following a similar universal approach to TileDB.
Second, in my future, we will stop building a new data system (and startup) for every single “twist” around data types, indexing, performance, and hardware. We will instead start innovating modularly without having to reinvent the stuff that already works well in the database. Moreover, we will stop using the term “universal”, as it will be unnecessary: all databases will be universal by default.
Third, once we prove that the most serious data management problems are reconciled and solved, there will be a vast network of analysts and scientists that easily share data and runnable code without fuss, accelerating Science and analytics via easy reproducibility, focusing more on Science rather than unnecessary Engineering.
Last, in my future, the spotlight will shine on the user. The database should be a mere facilitator of the creativity and innovation of the users that adopt it to solve important problems in their various disciplines. The database should be promoting the user’s brilliance. In my view, the future of data management is the user. You!
Here are the slides I used in my presentation. Big thanks to Aggeliki (our graphics design wizard) for the awesome art!
Last but not least, a huge thanks to the entire TileDB team and everyone that took a huge bet and has supported us throughout the years (investors, mentors, customers, partners)! I am honored to work with them; nothing would have existed without them.