I have been an academic and a computer scientist most of my life. Before founding TileDB as a company in 2017, I started TileDB as a research project while at Intel Labs and MIT to help data scientists solve problems around large matrix storage and computations. My goal was to eventually build a database that does more than SQL, as I always thought that the data and use cases out there that deal with data beyond tables are much bigger and more challenging. Throughout my journey, I always found it peculiar that there are so many SQL databases with numerous variations, which are trying to do pretty much the same thing: consume tabular data a little more efficiently than before. It was also puzzling to me that the non-tabular applications were re-inventing the wheel around data management (and in a rather suboptimal way in terms of performance and/or maintenance), since there was no off-the-shelf “database” for that kind of data.
And then I realized a much more serious problem. Despite the fact that the market for data solutions is enormous, and that the VC money poured into this space is unprecedented, the entire data industry is not looking at the data management and analytics problem holistically; from data production, to distribution, to consumption that leads to insights. In other words, I argue that today’s Data Economics is flawed.
In this blog post, I take a step back from all this frenzy to compete about which database offers the fastest and cheapest SQL queries, and I delve into the more important and broader data management problems that are plaguing truly all applications, from Business, to Science, to Technology. Once I define the problem, I make an important observation: we can design a generic, universal solution for all domains and types of data, which can seriously disrupt the data management space and shape its future. Finally, I describe how we are implementing this solution at TileDB, with my main goal being to prove that indeed such a solution can exist, and that this is not science fiction any more.
This blog post is an extended version of a recent talk I gave at PyData Global 2021, and here is the video recording of that talk:
Similar to the way Economics deals with the production, distribution and consumption of goods, Data Economics deals with the production, distribution and consumption of data. Data production refers to the format the data is generated in, where it originates from and where it is stored. Data distribution focuses on who has access to the data, how the data can be monetized, and how the access takes place. Data consumption deals with the way the data gets processed, analyzed and visualized to derive further insights.
Take a look at most of the data companies out there, from databases, to data warehouses, to data lakes, to lakehouses, to all the numerous and diverse Machine Learning and domain-specific solutions around data. Most of them exclusively focus on data consumption, and predominantly around how to make SQL queries run faster and cheaper, how to train and service ML models, or how to perform domain-specific computations and visualizations. A few companies have made some rudimentary attempts to enable distribution of the data beyond a single organization but with lots of limitations. And absolutely no company has addressed the production aspect. The entire industry took for granted that data will come in a specific format and that time spent on ETL and data wrangling is inevitable. It should not be. In fact, I believe that the no-ETL trend that favors keeping data in raw inefficient formats is moving our industry backward - not forward.
This fragmentation and isolation of the three aspects of Data Economics leads to an immense waste of human hours and resources, and is thus very costly for organizations. But most importantly, it is a total distraction; corporations and scientific institutions redirect resources and invest time on something that is not their core product or scientific work. Consequently, business is hindered, and technological and scientific advancement is unnecessarily slowed down.
In this blog post I elaborate in detail on the problems and consequences in all aspects of Data Economics, from production, to distribution, to consumption. Then I outline a foundational approach that can serve as a “solution template” to be followed by anyone who envisions building data systems and shaping the future of Data Economics. Finally, I present the solution we built with TileDB following the proposed approach, as proof that it is possible to completely transform the data space and dramatically accelerate Business, Science and Technology.
I start with a bold, yet obvious, claim: data in all applications is generated in the wrong format. Here are some examples:
And I can go on and on. The obvious problem is of course performance. This data is not stored in an analysis-ready manner that can be efficiently used by the tools that will consume them. Therefore, the consumers resort to very expensive wrangling and ETL processes. Moreover, every consumer has to build their own infrastructure to analyze the data at scale, which very often overlaps with the types of infrastructures other organizations are building for the same purpose. The end result is enormous reinvention of the wheel, suboptimal implementations, and a total waste of human hours and money.
This problem starts of course at the sources of this data, which do not communicate at all with the organizations that will be distributing and analyzing this data. We have reached a point where data consumers consider wrangling and ETL a necessary evil and, therefore, they do not push the data producers enough to change their methods. On the other hand, data consumers have not proposed a compelling alternative for the data producers and, therefore, the latter have very little appetite for changing the status quo. This is a stalemate.
Traditional databases and domain-specific solutions have long been ignoring the possibility to broaden the distribution of the data they manage beyond an organization. That was a remnant of the monolithic approaches of an older era. More recently organizations have started to realize that there is value in sharing their data with external parties, at a global scale. But the current approaches fall short in multiple respects. In this section I touch upon three such approaches.
The first is rather naive, yet the most frequently followed one: dump the data in flat files in some cloud storage bucket and grant access to the files either by managing file policies on the cloud provider or delegating the management to some kind of third party “marketplace” solution. Either way, the consumers are granted file access. It falls completely on the consumer’s shoulders to download the data, host a copy after wrangling it in some better data format, and build a colossal infrastructure to manage and analyze the data at scale. That process is followed by all the consumers of the same data. 1000 consumers of the same data? 1000 downloads, replicas and infrastructure variations to practically do the same thing: manage, analyze, and visualize the data in some sane way.
The second is a bit more generous on the data producer’s side: build infrastructure and serve the data in a better way to the consumers. In this case, the cost of building and maintaining the infrastructure falls on the producer. However, most of the data producers we have been talking to do not wish to spend the resources to build and maintain such an infrastructure. That’s completely outside their business scope.
The third solution is to utilize off-the-shelf database solutions and use their data sharing functionality (e.g., see Snowflake and Redshift). In that case, the cost of accessing the data is shifted to the consumer, which is the right direction. However, there are some important limitations. For example, write access is not supported, logging and auditing is tricky, and accessing producers' data that resides in different cloud regions or different cloud providers can become very complicated. For cross-region access, the consumer may be forced to spin up database clusters in multiple regions at extra hassle and cost, or be subject to extra egress costs. On the other hand, cross-cloud data sharing is currently out of the question for existing solutions. But most importantly, the elephant in the room: all these database solutions deal only with tabular data. What about the rest of the plethora of data types (imaging, video, genomics, LiDAR, etc), as well as simple flat files (e.g., PDFs)? You need to be very creative to get all this data to fit in a relational database, or just resort to the other two approaches mentioned above.
One may argue that this is the aspect where the current market is actually doing well. There are numerous data solutions, with great performance and a variety of useful features. And if your application requires a single data solution (e.g., a transactional database or a data warehouse), then you are probably all set.
Nevertheless, in the majority of the use cases we are working with, this is far from the norm. In most scenarios, an organization has a wide range of data types and flat files in addition to tables. Furthermore, different individuals and groups within the organization run different workloads written in different languages and using different tools, which span way beyond SQL.
What typically happens is mind-boggling. Say for example that some large tabular data is stored in a powerful warehouse, but one group needs to run large scale ML or custom operations that the warehouse does not support. That group creates a huge SQL query to export the data from the warehouse and bring it into a tool like Spark or Dask. Authentication, access control and logging happens properly inside the warehouse, but once the data is out, it’s the wild west. No control, no logging, no accountability whatsoever. And I can provide numerous other examples that involve multiple different data types, a variety of languages and tools, and a ton of wrangling.
The source of the consumption problem is actually the production and distribution problem, which are being neglected or treated as an afterthought. The data is produced or wrangled in an inefficient, non-interoperable way. Data distribution has not been properly solved to account for any types of data, and from any party within or outside the organization that produces the data.
Summarizing so far, Data Economics is flawed because no one has ever approached data problems holistically, as everyone seems to be stuck in an echo chamber around efficient and effective data consumption. We took a more panoramic view of the problem, and here is what we saw as a viable solution, covering all aspects of Data Economics.
There is a need for a universal format, which can be interoperable with any language and tool, as well as any storage backend (naturally making it also “cloud-optimized”). History has shown that inflexible format specifications have failed repeatedly across all sectors. Moreover, domain-specific formats can be typically parsed and read only by domain-specific tools, which significantly limits the consumption by the rapidly growing ecosystem of data science and analytics tools.
The format should be open-spec, and there should be a fast open-source storage library that can read and evolve it. That library should be built in a fast and interoperable language (such as C/C++), enabling numerous language API wrappers and tool integrations to be built on top. The focus should not be so much on the format, but rather on the storage library and its APIs. The format should not be inflexible or undergo massive cycles of “committee approvals”; instead it should be able to evolve rapidly, while striving for stability of the APIs and backwards compatibility.
Universality and interoperability will open the door to sane, normalized, governed data distribution, and to easy consumption by an ever growing set of tools.
There are four aspects to the distribution problem:
Storage: No matter who is accessing the data and how, the data needs to be stored somewhere. And since most of the time we are talking about large quantities of data, storage must be as inexpensive as possible. Moreover, the data is a very valuable asset, and therefore the producer should have the option to own the data, which suggests a bring-your-own-storage kind of capability. Finally, storage must be separate from compute to keep the total cost of operation within reasonable ranges, especially in cases where the computational needs fluctuate and are asymmetric to the storage volume. All these considerations point towards storing the data in some kind of a cloud object store solution that ticks all the above boxes.
Access: This ties back to the production aspect. If the data format is universal and interoperable, then any tool should be able to access it directly from storage, without the need to be constrained to SQL or some domain-specific library, and without any cumbersome and expensive downloads. Also if the data is normalized under a common format, then a single authentication, access control and logging mechanism can be built to securely manage the access of the various parties. Universality and interoperability is your friend.
Compute: Once the basic storage and access considerations are worked out, the most crucial aspect becomes compute or, more precisely, how and where the access takes place. This is crucial, because it will dictate who will be picking up the bill. Now here is how things get messy.
Data producers should own the data and have the option to store it in any cloud object store, even across different regions and cloud providers. Moreover, the producers should not be paying a single dollar for any access by the consumers. Finally, the consumers should not be forced to set up infrastructure in the producers’ cloud provider or region of choice, as that will never scale with the number of producers and there will always be reluctance on the consumer’s part to maintain multiple clusters.
So how do we solve this problem? Serverless is your friend here. There should be a third party managing the infrastructure for both producers and consumers. The producer should be storing their data in the cloud object store (on any region and cloud provider) and be charged only for storage. The consumer should maintain zero infrastructure. The third party should be responsible for managing compute clusters in every cloud provider and every region. The consumer should just issue queries from any tooling, and the third party should be automatically sending the compute to where the data resides to boost performance and eliminate unnecessary egress costs. The consumer should pay only for the resources they use, and the third party should be responsible for monitoring those costs. It’s a win-win-win situation.
Monetization: This refers to the ability of profiting from sharing data (or even code). Based on the above discussion, the third party that manages the infrastructure possesses all the means to facilitate the distribution of the data (or even code) from the producer to the consumers, including all the metrics around usage (as a pay-as-you-go model is followed). It follows that marketplace functionality can be easily built to support monetization, eliminating the need for the producers and consumers to sign up to two different vendors (one for the marketplace, a separate one for the analytics). Moreover (and perhaps most importantly), this eliminates the need for data movement, which is cumbersome and expensive. All the third party really needs is integration with a service like Stripe. And then everyone gets the ultimate solution for distributing, analyzing and monetizing data (or even code).
Contrary to the entire market that focuses mostly on data consumption, having figured out a robust way to solve the data production and distribution problems allows me to spend the least amount of time here to describe the consumption solution. This is because, once you get the data in a universal and interoperable format (with a storage engine to do all the heavy lifting on performance around storing and accessing the data efficiently, as well as integrating with all the tools), then you can practically use any tool you are already using with very little modification of your day-to-day practices. For example, you will be able to use SQL, or Python pandas, or R
data.table, or anything else, on the same data, without downloads and wrangling, and in a serverless manner within the infrastructure I described in the distribution section. Moreover, you will be able to securely and easily collaborate with anyone in any application domain (even across application domains), since the distribution infrastructure supports features like data sharing out of the box. In other words, once you carefully address the data production and distribution aspects, the consumption aspect just follows naturally.
But most importantly, a single universal format and management platform allows you to store and manage all diverse data types and files in a single solution (and thus unified authentication, access control and logging mechanism), instead of having to juggle with numerous solutions, data movements and conversions, and a lot of wasted time in unnecessary data engineering.
One question remains: Is the above generic approach feasible? We have proof that it is. Here is how we addressed this problem at TileDB.
We have invented a universal data format and a powerful open-source storage engine to support it, called TileDB Embedded. The format is based on multi-dimensional arrays, which are generic enough to store any data type we have encountered, from tables, to genomics, to imaging, to ML models, to even flat files. We explain its internal mechanics in detail in a recent webinar.
Over the past several years we have built the ultimate data management platform for sharing, monetizing and analyzing data at global scale, called TileDB Cloud. The platform builds upon the universal TileDB format and, therefore, you can store, share, monetize and log everything: tables, flat files, dashboards, Jupyter notebooks, user-defined functions, and ML models. TileDB Cloud is completely serverless. You can slice, issue a SQL query, build and submit any complicated task graph, all without setting up infrastructure and in a pay-as-you-go fashion. This architecture allows you to access data that is shared with any provider in any cloud region, without you even having to think about it. TileDB Cloud automatically understands where the data you are accessing resides, and dispatches your query to clusters we maintain within that region. Finally, TileDB Cloud integrates with Stripe, making it super easy to monetize not only your data, but also code (e.g., UDFs, notebooks, ML models and dashboards). Data distribution problem, solved!
With the TileDB format and storage engine being universal and interoperable, you can access the data from any language and tool, while inheriting all the access control and logging capabilities of TileDB Cloud. Every single action is accounted for: from slicing, to SQL, to a Jupyter notebook, to UDFs, to task graphs, to ML. All in a single platform. And of course, you get to interact with your data either via Jupyter notebooks, or dashboards, or fast RESTtful API access.
Data Economics requires drastic reshaping. The data management and analytics space is as hot as ever, and yet organizations are still struggling with foundational data management. In this blog post I made the case that we can do much better than that as a data community. I explained that there exists a new framework that considers the data problem holistically, with a panoramic view of Data Economics, from data production, to distribution, to consumption. And I argued that there is proof that such a framework is feasible. The proof today is what we built at TileDB, but I am hoping that we will inspire others to build similar solutions. There is way more at stake than just the data hype and VC firehose in the market. Serious organizations and scientific institutions are relying on us for accelerating Business, Science and Technology, for everyone’s sake!
Here are the slides I used in the talk.
A few final remarks:
Last but not least, a huge thank you to the entire team for all the amazing work. I am just a mere representative and am the exclusive recipient of complaints. All the credit always goes to our awesome team!