Spatial data is dominant in numerous applications, such as earth observation, location-based services, defense, biomedical and many more. Such data is very diverse and comes in various forms, including two important classes: (1) 2D or 3D points (e.g., LiDAR and SONAR) and (2) arbitrary 2D or 3D polygons (e.g., buildings, cell and nucleus boundaries and areas of interest). Querying such data entails submitting a rectangular or other arbitrary multi-polygon regions with a spatial operator such as contains
or intersects
with the points or polygons. These geospatial data and queries are commonly referred to as “geometries” that represent geographic data objects in the Geospatial world. Providing an efficient mechanism for handling querying and storing geometries at grand scale (hundreds of millions of billions of points and polygons) is a challenging task.
The Geospatial domain has well known query methods and storage encodings for geometries so we are focusing on geospatial applications in this blog. However, look out for a blog post on our innovative work with spatially resolved omics data where we combine the domain knowledge from the biomedical and geospatial experts within TileDB Inc.
Today we are announcing full geometries support in TileDB! TileDB is a multi-dimensional array database management system and in this blog I explain why it is a natural, super efficient choice for handling geometries. Databases such as PostgreSQL (and specifically its extension, PostGIS) and Oracle have had spatial capabilities for many years, whereas recently new open formats for storing geometries have come up, such as GeoParquet. TileDB brings a lot more to the table as compared to these other approaches. Here are a few takeaways:
We worked with Even Rouault, the primary maintainer of GDAL, to implement support for geometries in the GDAL OGR driver which we use to ingest geospatial data into TileDB arrays.
In the remainder of the article, I will first cover the very basics on Geometries and TileDB. I will then explain how TileDB supports geometries and why it is a very powerful solution. Next, I will outline the competitive advantage of TileDB versus other approaches. Finally, I will conclude with our upcoming work on geometries.
For those who wish to jump directly on some action, read our TileDB 101: Geometries blog post.
Enjoy!
Geospatial geometries are defined in ISO-19125 parts 1 and 2 and are primarily points, polygons, curves, and surfaces. The figure below shows some examples.
(b) ST_Intersects query over multiple building geometriesMap data by © OpenStreetMap, under ODbL. Buildings from Overture Maps Foundation, under ODbL. |
Geometries can be applied as queries on 2D or 3D point cloud data, or they can be stored as the underlying data and queried by other geometries. In every case, storage compression and computational performance are very important.
Software that supports geometries includes GDAL, Fiona, R-Spatial and the broader open-source ecosystem that integrates with it (e.g., such as GeoPandas with pyogrio in Python), as well as PostGIS where you can use SQL statements to ingest and query your data. GeoParquet is another alternative, which can integrate with other database systems like DuckDB.
TileDB is a database system architected around dense and sparse multi-dimensional arrays. It consists of (1) an open-source C++ library called TileDB Embedded and a broad open-source ecosystem of APIs and integrations, and (2) a commercial product called TileDB Cloud. TileDB Embedded is the core array engine responsible for the open-spec array format, whereas TileDB Cloud is a full-fledged, secure database management system, which comes with a powerful serverless distributed computing platform, a holistic catalog, governance capabilities and interactive analysis tools.
The TileDB array model serves as the basis for many vertical applications, such as in Life Sciences, Geospatial and Machine Learning. We were fortunate to attract and build geospatial expertise within our team and dedicated resources to build vertical solutions for the Geospatial sector. Geometries is one of our geospatial vertical products.
The next section explains how we implemented geometries at TileDB, and why it is a great fit for such a use case.
TileDB has native support for sparse arrays, which can effectively store any multi-dimensional point in space. But how would TileDB store arbitrary shapes? The trick is to calculate and store the center of the envelope of the 2D or 3D geometry. In addition, for each geometry, we calculate the distances in X, Y and Z from the center of the envelope of that geometry to the envelope boundary. We finally store the maximum of all those distances as part of the array metadata. This value will be used during the query phase. The geometry itself is stored as a TileDB attribute in Well Known Binary format, and the geospatial metadata is maintained in the TileDB array or group.
During any arbitrary spatial query, we first calculate the minimum bounding box of the query geometry, and then expand it by the maximum distance we calculated during ingestion. Any point in the array included in this expanded bounding box is either a result (i.e., an actual data geometry intersected or included in the query geometry) or not included in the actual query geometry. We then filter the returned results from the expanded bounding box query in a fast post-processing step.
The cool thing about this implementation is that the centers of the geometry bounds, represented as sparse TileDB array cells, are natively indexed in TileDB with R-trees. That allows for super efficient pruning of unnecessary points that are guaranteed not to contribute the result, leading to superb query performance.
You can use the TileDB geometries in a variety of ways:
Here are some code examples with GDAL / GeoPandas and MariaDB:
import fiona
fiona.drvsupport.supported_drivers["TileDB"] = "rw"
fiona.vfs.SCHEMES["tiledb"] = "tiledb"
import geopandas as gpd
import matplotlib.pyplot as plt
import shapely
x = 172.63
y = -43.53
circle = shapely.buffer(shapely.Point(x, y), 0.0025)
gdf = gpd.read_file(building_array, engine="pyogrio", use_arrow=True, bbox=circle.bounds)
for r in range(gdf.shape[0]):
if gdf.geometry[r].intersects(circle):
for g in gdf.geometry[r].geoms:
plt.plot(*g.exterior.xy)
plt.plot(*circle.exterior.xy)
plt.show()
The same query with MariaDB and SQL:
tiledb.cloud.sql.exec(
query="""
SELECT names, ST_GeometryFromWkb(geometry) as geom
FROM `tiledb://norman/overture_buildings`
WHERE ST_Intersects(
ST_GeometryFromWkb(geometry),
ST_Buffer(ST_GeometryFromText('POINT(172.63 -43.53)'), 0.0025)
);
"""
)
Other benefits with open-source TileDB:
Extra benefits in TileDB Cloud:
As a nice demonstration of the power of TileDB Cloud’s computational infrastructure, we worked with the buildings dataset from the Overture Maps Foundation that contains buildings from the entire world. Currently this dataset consists of 120 Parquet files that contain 785,524,851 buildings.
It was possible to ingest the whole dataset in parallel and completely serverless with the task graph shown in the figure below. The ingestion took 20 minutes, and the cost was just a few dollars. If you are interested in learning more about task graphs, I recommend reading the blog post TileDB Cloud: Task Graphs.
After the ingestion, all buildings reside as geometries stored in a singleTileDB array on AWS S3. We use the spatial filter from this article to measure performance of the spatial queries against TileDB and others.
This array can be read through any one of our APIs or integrations. Querying the array through GDAL or our Python API for all buildings WITHIN the polygon boundary below takes about 12 seconds and returns 9,037 buildings (whereas querying using a specialized alternative like Apache Sedona takes about 48 seconds).
Yet one more benefit for adopting TileDB:
TileDB’s competitive advantage for geometries can be summarized as follows:
In addition to the above, here is a brief qualitative and quantitative comparison versus other solutions you might know. In our experiments we used the building dataset from Overture Maps Foundation and tested with storing on S3 as well as local disk on a 8-core machine and 32 GB of RAM (m5a.2xlarge on AWS EC2) with the polygon boundary query above as the actual spatial filter. Note that the system caches were cleared before each test run and with the exception of PostGIS all datasets used ZSTD compression.
We include the preliminary benchmarks below and summarize the results in turn for each solution immediately after that.
Notes:
bench_ogr_c_api
iterates over features with calls to GetNextFeature() in the GDAL APIbench-ogr_batch
can be used to measure the raw performance of the Arrow API within GDAL OGR.bench_ogr_batch
could not be used with a partitioned GeoParquet dataset.GDAL_NUM_THREADS
was set to be ALL_CPUS
. (minx, miny, maxx, maxy)
of (-122.235128, 47.574044, -122.181229, 47.657588)
.Notes:
Currently, the primary method of easily accessing geometries that are stored in TileDB arrays is through GDAL and libraries built on top of it such as Fiona/Shapely or R-Spatial. The spatial queries can also be performed using many of our integrations in C++, Java and Python but require decoding the Well Known Binary (WKB) and interpreting the geospatial metadata. In the near future, we plan to do the following:
Supporting geometries in TileDB is exciting and enables users to support simpler workflows for processing their geospatial data. Stay tuned for more updates from the TileDB team!
In this article I described the support for geometries in TileDB and how you can now store geographic data objects as TileDB arrays. You can get kickstarted with TileDB geometries by reading the blog TileDB 101: Geometries. You can find a plethora of more information about TileDB on our official website.
We’d love to connect with you and hear what you think! Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.