These are exciting times for anyone working on data problems, as the data industry is as hot and as hyped as ever. Numerous databases, data warehouses, data lakes, lakehouses, feature stores, metadata stores, file managers, etc. have been hitting the market in the past few years. I have spent my entire career in data management, but going against the market trend, I have always been absolutely baffled that we keep on creating a new data system for every different data type and application. During my time at Intel Labs and MIT between 2014–2017 where I worked as a database researcher, I asked myself a simple question: instead of building a new data system every time our data needs change, can we build a single database that can store, govern, and process all data — tables, images, video, genomics, LiDAR, features, metadata, flat files and any other data type that may pop up in the future.
This question was born from the simple observation that all database systems (and variations) share significant similarities, including laying data out on the storage medium of choice, and fetching it for processing based on certain query workloads. Therefore, to answer the above question, I had to ask a slightly different one: is there a data model that can efficiently capture all data from all applications? Because if such a universal data model exists, it can serve as the foundation for building a universal database with all the subsystems common to all databases (query planner, executor, authenticator, transaction manager, APIs, etc.). In this blog post I argue that such a model does exist, and it is based on multi-dimensional arrays.
Before elaborating on why arrays are universal, I need to answer yet another question: why should you care about a universal data model and a universal database? Here are a few important reasons:
Even if I have already managed to convince you of the importance of a universal database, I need to make one last vital remark. A universal database is unusable if it does not offer excellent performance for all the data types it is serving. In other words, a universal database must be performing as efficiently as the purpose-built ones, otherwise there will be a lot of skepticism on adopting it. And this is where the difficulty of universal databases lies and why no one is building such a system today (well, almost no one).
In this blog post I show that multi-dimensional arrays are the right bet not only for their universality, but also for their performance. In full disclosure, I am the founder and CEO of TileDB, a company that is building a universal database with multi-dimensional arrays as first class citizens. I will include a lot of the critical decisions we took at TileDB when designing the array data model, an efficient on-disk format and a powerful storage engine to support it. My goal is to emphasize the generic array principles, hoping that this article can serve as a small step towards shifting from purpose-built data systems to developing more universal databases.
Take a deep breath, we are about to take a technical dive!
The rest of the blog post is for you if:
The basic array model we follow at TileDB is depicted below. We make an important distinction between a dense and a sparse array. A dense array can have any number of dimensions. Each dimension must have a domain with integer values, and all dimensions have the same data type. An array element is defined by a unique set of dimension coordinates and it is called a cell. In a dense array, all cells must store a value. A logical cell can store more than one value of potentially different types (which can be integers, floats, strings, etc). An attribute defines a collection of values that have the same type across all cells. A dense array may be associated with a set of arbitrary key-value pairs, called array metadata.
A sparse array is very similar to a dense array, but it has three important differences:
The decision on whether to model your data with a dense or sparse array depends on the application and it can greatly affect performance. Also extra care should be taken when choosing to model a data field as a dimension or an attribute. These decisions are covered in detail in the performance section below, but for now, you should know this: array systems are optimized for rapidly performing range conditions on dimension coordinates. Arrays can also support efficient conditions on attributes, but by design the most optimized selection performance will come from querying on dimensions, and the reason will become clear soon.
Range conditions on dimensions are often called “slicing” and the results constitute a “slice” or “subarray”. Some examples are shown in the figure below. In numpy notation,
A[0:2, 1:3] is a slice that consists of the values of cells with coordinates 0 and 1 on the first dimension, and coordinates 1 , and 2 on the second dimension (assuming a single attribute). Alternatively, this can be written in SQL as
SELECT attr FROM A WHERE d1>=0 AND d1<=1 AND d2>=1 AND d2<=2 , where
attr is an attribute and
d2 the two dimensions of array
A. Note also that slicing may contain more than one range per dimension (a multi-range slice/subarray).
The above model can be extended to include “dimension labels”. This extension can be applied to both dense and sparse arrays, but labels are particularly useful in dense arrays. Briefly stated, a dimension can accept a label vector which maps values of arbitrary data types to integer dimension coordinates. An example is demonstrated below. This is very powerful in applications where the data is quite dense (i.e., there are not too many empty cells), but the dimension fields are not integers or do not appear contiguously in the integer domain. In such cases, multi-dimensional slicing is performed by first efficiently looking up the integer coordinates in the label vectors, and then applying the slicing as explained above. The reason why this may be preferable to simply modeling the data with a sparse array will become clear in the performance section below.
Multi-dimensional arrays have been around for a long time. However, there have been two misconceptions about arrays:
The sky is the limit in terms of applicability for a system that supports both dense and sparse arrays. An image is a 2D dense array, where each cell is a pixel that can store the RGBA color values. Similarly a video is a 3D dense array, two dimensions for the frame images and a third one for the time. LiDAR is a 3D sparse array with float coordinates. Genomic variants can be modeled by a 3D array where the dimensions are the sample name (string), the chromosome (string) and the position (integer). Time series tick data can be modeled by a 2D array, with time and tick symbol as labeled dimensions (this can of course be extended arbitrarily to a ND dense or sparse array). Similarly, weather data can be modeled with a 2D dense array with float labels (the lat/lon real coordinates). Graphs can be modeled as (sparse 2D) adjacency matrices. Finally, a flat file can be stored as a simple 1D dense array where each cell stores a byte.
But what about tabular data? Arrays have a lot of flexibility here. In the most contrived scenario, we can store a table as a set of 1D arrays, one per column (similar to Parquet for those familiar with it). This is useful if we want to slice a range of rows at a time. Alternatively, we can store a table as a ND sparse array, using a subset of columns as the dimensions. That would allow rapid slicing on the dimension columns. Finally, we can use labeled dense arrays as explained above for the time series tick data.
You may wonder how we can make all these decisions about dimensions vs. attributes and dense vs. sparse for each application. To answer that, we need to understand how dense and sparse arrays lay data out on the storage medium, and what factors affect performance when slicing, which is the focus of the next section.
The first step for performance in a system built on the above array model is figuring out the on-disk format. That will dictate the IO cost which is often dominant for a lot of query workloads. I will first explain some of the architectural decisions we took in TileDB Embedded to build an efficient array format. I will then explain how such a format leads to highly efficient slicing operations and summarize all performance factors.
Let’s start with how we store the dimension coordinate and attribute values.In order to efficiently support cloud object stores and versioned writes, we have chosen a multi-file format, where the values along each dimension or attribute are stored in a separate file. A dense array does not materialize the dimension coordinates, whereas a sparse array must. The figure below shows a dense and a sparse array example. Observe that each cell value across all dimensions and attributes appears in the same absolute position across the corresponding files. Moreover, in variable-length attributes or dimensions, a separate file is needed to store the byte offset of where each value starts in the value file.
This layout where values of the same type are grouped together is ideal for compression, vectorization and subsetting on attributes/dimensions during queries (similar to other columnar databases). Since typical queries slice subsets of these values, we do not compress the whole file as a single blob. Instead, we tile (i.e., chunk) it into smaller blocks. Therefore, we need a mechanism to create groups of cells. The cells in each tile will appear as contiguous in the corresponding files as shown in the figure above. A tile is the atomic unit of IO and compression, (as well as other potential filters, such as encryption, checksum, etc.).
A question arises: how do we sort the values from a multi-dimensional space, inside the files that are single-dimensional. This order will dictate the position of each cell within its corresponding tile, and the position of the tiles in the file. We call this order the global order of the cells in the array. I will explain the global order and tiling for dense and sparse arrays separately.
In a dense array, the global order and tiling is determined by 3 parameters:
The figure below depicts three different orders as we vary the above parameters. Observe that the shape and size of each tile is dictated solely by the space tile extents.
The sparse array case is slightly different, since tiling only based on space tiles may lead to highly imbalanced numbers of non-empty cells in each tile, which can further impact compressibility and slicing performance. In sparse arrays, the global order could be determined as follows:
Once the global order is determined, the tiling is specified by an extra parameter, called the tile capacity, i.e., the number of non-empty cells in each tile. The figure below depicts different global orders for a different choice of all the above mentioned parameters for sparse arrays (a non-empty cell is depicted in dark blue).
Why is the global order and tiling such a big deal? The global order should retain as much as possible the co-locality of your query results, for the majority of your typical slicing shapes. Remember, the array is multi-dimensional, whereas the file storing the array data is single-dimensional. You have a single chance (unless you want to pay for redundancy) to sort your data in a single 1D order. And that order absolutely dictates the performance of your multi-dimensional queries. The reason is that the closer your results appear in the file, the faster the IO operations to retrieve them. Also the size of the tile can affect performance, since integral tiles will be fetched from storage to memory. The examples below demonstrate some good and bad global orders and tilings for a given slice, focusing on a dense array (similar arguments can be made for the sparse case).
Being perceptive you will say, “I need to know the shape of my slices in order to choose the global order”. In a lot of the applications we have been working on, the typical query patterns are indeed known. If not, arrays still give you the opportunity to define an order that leads to acceptable performance for most of your query patterns. The most important take away here is that arrays provide you with easy and flexible means to manipulate the on-disk representation of your data in order to achieve the desirable performance, for all types of data and applications. And that’s the reason why arrays are universal!
Now that we figured out the on-disk format, how do we efficiently slice an array and what indices shall we optionally build to facilitate the query? First we focus on a dense array and use the example in the figure below. In addition to the slicing query, we know the following: the number of dimensions, the global order, the tiling, and the fact that there are no empty cells in the array. This information is collectively called the array schema. Using solely the array schema and with simple arithmetic, we can calculate the number, size and location of cell slabs (i.e., sets of contiguous cells on the disk) that comprise the query result, without any additional index. Therefore, we can design an efficient multi-threaded algorithm that can fetch the relevant tiles from disk, decompress, and copy the cell slabs into the result buffers, all in parallel. If the algorithm is developed appropriately and the global order and tiling is selected wisely, slicing dense arrays is truly rapid!
Slicing in sparse arrays is more difficult because we do not know the location of empty cells until the array is written. Therefore, unlike dense arrays, we need to explicitly store the coordinates of the non-empty cells, and build an index on top of them. The index must be small in size so that it can be quickly loaded in main memory when the query is submitted. In TileDB we use an R-tree as an index, but other multi-dimensional indexes exist. The R-tree groups the coordinates of the non-empty cells into minimum bounding rectangles (MBRs), one per tile, and then recursively groups those MBRs into a tree structure. The figure below shows an example. The slicing algorithm then traverses the R-tree to find which tile MBRs overlap the query, fetches in parallel those tiles from storage, and decompresses them. Then for every partially overlapping tile, the algorithm needs to further check the coordinates one by one to determine whether they fall inside the query slice. This simple algorithm, if developed efficiently (using multi-threading and vectorization), can lead to extremely efficient multi-dimensional slicing.
I already mentioned how some aspects of the array format can affect performance, but below I summarize them all:
As a final remark, it should be clear that arrays can efficiently model tabular data, but not vice versa, which would further justify why relational databases (or tabular formats like Parquet) never found success in the scientific domain. Take for example a simple 2x2 dense array
[[1, 2], [3, 4]]. Recall that dense arrays do not materialize the coordinates of the cells, and instead infer where the results are located from the array schema (i.e., dimensionality, global order and tiling). In an array engine, the array values would be serialized as
1, 2, 3, 4 in the file (assuming no tile extents and row-major cell order), and an array would be able to locate value in cell
(1, 0), i.e., second row, first column, easily as 3. However, there is no notion of dimensionality in tabular formats. If a tabular format stored the array values in a column as
1, 2, 3, 4, it would not possess the necessary information that the first two values correspond to the first row, and the second two to the second row for this example. Therefore, there would be no way to locate the value of cell
(1, 0), unless we created two extra columns explicitly storing the coordinates of every value and scan the table, which would incur significant unnecessary costs. If you find this example trivial, try to imagine how complex it gets in the presence of multiple dimensions, tiling and different tile and cell orders.
So far we have designed a great data model and an associated on-disk format that could lead to excellent performance, but where do we go from here? Before answering that, I will point out a few pitfalls that we observed with other array systems:
I argue that the data model and a format spec are not sufficient. We need to focus on building a powerful storage engine to support and evolve the array data model and format, with the following design principles (which we followed in TileDB Embedded):
The above are only the “basic” functionalities a storage engine should have. In order to build the foundation for a universal database, some more advanced features are needed:
The main takeaway is that multi-dimensional arrays can be powerful if supported by the right storage engine. And building a carefully engineered array engine is quite the undertaking. Fortunately, today there exists such an engine, called TileDB Embedded, and it is open-source under the MIT license. It is my hope that its design principles will motivate new array engines to emerge.
In this blog post I made the case that multi-dimensional arrays are a powerful data model that can serve as the foundation of a universal database that can capture all data types and applications. I explained the benefits of universality and presented the array data model. Then I described an efficient on-disk array format and elaborated on why it can lead to excellent performance via a technical deep dive. Finally, I argued that the array model and on-disk format are not sufficient without building a powerful storage engine to support and evolve them.
I am always looking forward to community feedback. You can find me and my team in the TileDB Slack workspace and forum, or you can contact us directly. For further reading, I suggest the TileDB docs and the following blogs / webinars:
A huge thanks to the entire TileDB team for making this possible!