Back

Aug 01, 2023

TileDB 101: Vector Search

Data Management
Data Science
Vector Database
Generative AI
11 min read
Isaiah Norton

Isaiah Norton

VP of Engineering

In this article we provide a quickstart tutorial on the vector search capabilities of TileDB. I strongly recommend you read the blog "Why TileDB for Vector Search" before digging into this tutorial, especially if you are not familiar with the TileDB array database and how it naturally morphs into a vector database (vectors are 1D arrays after all).

TileDB’s core array technology lies in the open-source (MIT License) library TileDB-Embedded, but we developed the vector search specific components in library TileDB-Vector-Search, which is also open-source under MIT License. Similar to the core library, TileDB-Vector-Search is built in C++, but it also offers a Python API.

In the majority of this article, we will use the Python API of TileDB-Vector-Search, and the examples are reproducible on your local machine. However, TileDB also develops the commercial TileDB Cloud product, which provides you with additional governance and scalability features, which we cover briefly in one section. All the code in this article is summarized in a TileDB Cloud notebook, which you can either download and run locally, or launch a Jupyter server directly in TileDB Cloud. Sign up to do so (no credit card required), and we’ll give you free credits so that you can evaluate without a hassle.

Setup

To install TileDB-Vector-Search, run:

# Using pip
pip install tiledb-vector-search

# Or, using conda (requires tiledb-cloud and scikit-learn)
pip install tiledb-cloud
conda install -c conda-forge -c tiledb tiledb-vector-search scikit-learn

We’ll first work with the small (10k) SIFT dataset from the Datasets for approximate nearest neighbor search site. Download the dataset from here (mirrored, from original source here) and run:

tar xf siftsmall.tgz

Later we’ll play with the 1B SIFT dataset on TileDB Cloud, which we have pre-ingested for you, so nothing to do here.

A Simple Vector Search Example

In this example I will show you how to ingest the small dataset you just downloaded, and run your first similarity search query. For more information, see the TileDB-Vector-Search API Reference.

We start by importing the necessary libraries:

import tiledb.vector_search as vs
from tiledb.vector_search.utils import *
import sklearn

Ingestion

You can ingest vectors with a single command as follows.

flat_index = vs.ingest(
    index_type = "FLAT",
    array_uri = "sift10k_flat",
    source_uri = "siftsmall_base.fvecs",
    source_type = "FVEC",
    partitions=100
)

That will create a “vector search asset” which is equivalent to a “TileDB group” called sift10k_flat in your working directory. If we list its contents, we’ll find a single 2D dense array called shuffled_vectors, which is storing all the ingested vectors from file siftsmall_base.fvecs.

%%bash
ls -al sift10k_flat # List the group contents
total 0
drwxr-xr-x   6 stavrospapadopoulos  staff  192 Jul 31 23:29 .
drwxr-xr-x  11 stavrospapadopoulos  staff  352 Jul 31 23:30 ..
drwxr-xr-x   3 stavrospapadopoulos  staff   96 Jul 31 23:29 __group
drwxr-xr-x   3 stavrospapadopoulos  staff   96 Jul 31 23:29 __meta
-rw-r--r--   1 stavrospapadopoulos  staff    0 Jul 31 23:29 __tiledb_group.tdb
drwxr-xr-x   8 stavrospapadopoulos  staff  256 Jul 31 23:29 shuffled_vectors
%%bash
ls -l sift10k_flat/shuffled_vectors # List the array contents
total 0
drwxr-xr-x  3 stavrospapadopoulos  staff  96 Jul 31 23:29 __commits
drwxr-xr-x  2 stavrospapadopoulos  staff  64 Jul 31 23:29 __fragment_meta
drwxr-xr-x  3 stavrospapadopoulos  staff  96 Jul 31 23:29 __fragments
drwxr-xr-x  2 stavrospapadopoulos  staff  64 Jul 31 23:29 __labels
drwxr-xr-x  2 stavrospapadopoulos  staff  64 Jul 31 23:29 __meta
drwxr-xr-x  3 stavrospapadopoulos  staff  96 Jul 31 23:29 __schema
# Open the array
A = tiledb.open("sift10k_flat/shuffled_vectors")

# Print the schema - 2D dense array
print(A.schema)
ArraySchema(
  domain=Domain(*[
    Dim(name='rows', domain=(0, 127), tile=128, dtype='int32'),
    Dim(name='cols', domain=(0, 9999), tile=100, dtype='int32'),
  ]),
  attrs=[
    Attr(name='values', dtype='float32', var=False, nullable=False),
  ],
  cell_order='col-major',
  tile_order='col-major',
  capacity=12800,
  sparse=False,
)
# Print the first vector
print(A[:,0]["values"])
[  0.  16.  35.  …   8.  19.  25.  23.   1.]

Parameter partitions dictates the tiling of this array, but you can ignore it for now (we’ll discuss it at length in a separate blog). This vector search asset has no indexing (index_type = FLAT). Therefore, running this ingestion function is rapid, similarity search (for large datasets) is slow as it is brute-force, and the recall (i.e., accuracy) is always 100%.

To ingest the dataset building an IVF_FLAT index, all you need to do it specify index_type = "IVF_FLAT":

ivf_flat_index = vs.ingest(
    index_type="IVF_FLAT",
    source_uri="siftsmall_base.fvecs",
    array_uri="sift10k_ivf_flat",
    source_type = "FVEC",
    partitions = 100
)

This takes a longer time to run, similarity search is much faster even for humongous datasets, but the recall may be smaller than 100%.

Looking into the contents of the created group sift10k_ivf_flat, we can now see that there are more arrays created (partition_centroids, shuffled_vector_ids and partition_indexes), which collectively comprise the IVF_FLAT index.

%%bash
ls -l sift10k_ivf_flat
total 0
drwxr-xr-x  3 stavrospapadopoulos  staff   96 Jul 31 23:28 __group
drwxr-xr-x  3 stavrospapadopoulos  staff   96 Jul 31 23:28 __meta
-rw-r--r--  1 stavrospapadopoulos  staff    0 Jul 31 23:28 __tiledb_group.tdb
drwxr-xr-x  8 stavrospapadopoulos  staff  256 Jul 31 23:28 partition_centroids
drwxr-xr-x  8 stavrospapadopoulos  staff  256 Jul 31 23:28 shuffled_vector_ids
drwxr-xr-x  8 stavrospapadopoulos  staff  256 Jul 31 23:28 partition_indexes
drwxr-xr-x  8 stavrospapadopoulos  staff  256 Jul 31 23:28 shuffled_vectors

You can explore the schema of those arrays and read their contents as you’d do with any other TileDB array. We will cover the TileDB-Vector-Search internals in detail in future blogs. Our blog post TileDB 101: Arrays can familiarize you with TileDB arrays.

Similarity Search

To run similarity search on the ingested vectors, we’ll load the queries and ground truth vectors from the siftsmall dataset we downloaded (noting though you can use any vector to query this dataset). We provide load_fvecs and load_ivecs as auxiliary functions in the tiledb.vector_search.utils module.

# Get query vectors with ground truth
query_vectors = load_fvecs("siftsmall_query.fvecs")
ground_truth = load_ivecs("siftsmall_groundtruth.ivecs")

To return the most similar vectors to a query vector, simply run:

# Select a query vector
query_id = 77
qv = np.array([query_vectors[query_id]])

# Return the 100 most similar vectors to the query vector with FLAT
result = flat_index.query(qv, k=100)

# Return the 100 most similar vectors to the query vector with IVF_FLAT
# (you can set the nprobe parameter)
#result = ivf_flat_index.query(qv, nprobe = 10, k=100)

To check the result against the ground truth, run:

# For FLAT, the following will always be true
np.alltrue(result == ground_truth[query_id])

You can even run batches of searches, which are very efficiently implemented in TileDB-Vector-Search:

# Simply provide more than one query vectors
result = ivf_flat_index.query(np.array([query_vectors[5], query_vectors[6]]), nprobe=2, k=100)
result
array([[1097, 1239, 3227,  804, …,  849, 9262],
       [3013, 1682, 8581, 2774, …, 9694, 9704]], dtype=uint64)

To query a vector search asset in a later session, you simply need to run the following command to initiate the index for queries:

index = vs.IVFFlatIndex(uri)
query_id = 77
result = index.query(np.array([query_vectors[query_id]]), k=10)

Cloud Object Store Support

TileDB natively supports several of the most widely-used Cloud object stores with no additional dependencies. For example, if you have configured your AWS S3 account with default credentials in HOME/.aws, then TileDB may be used to read and write directly from S3 with no additional configuration as follows:

data_dir = #<where your source vector file resides>
output_uri = "s3://tiledb-isaiah2/vector_search/sift10k_flat"
index = vs.ingest(
    index_type = "IVF_FLAT",
    array_uri = output_uri,
    Ssource_uri = os.path.join(data_dir, "siftsmall_base.fvecs"),
    source_type = "FVEC",
    partitions = 100
)

For more information on TileDB’s cloud object storage support, see blog TileDB 101: Cloud Object Storage. For additional configuration and authentication options for AWS S3, Azure, and GCS, see the following documentation:

Power Up with TileDB Cloud

If you wish to boost the vector search performance and enjoy some important data management features, you can use TileDB-Vector-Search on TileDB Cloud (sign up and we will give you free credits for your trial). In this section I will describe how to perform serverless, distributed ingestion and vector search using TileDB Cloud’s task graphs. I will also cover the exciting data governance functionality of TileDB Cloud, which allows you to securely share your assets with other users, discover other users’ public work and log every action for auditing purposes.

If you are interested in delving into the general functionality of TileDB Cloud, you can read the following blog posts:

Scaling with serverless, distributed task graphs

Distributed ingestion can greatly speed up and horizontally scale out the ingestion of vectors. In this example, we've ingested the SIFT 1 Billion vector dataset using the IVF_FLAT index (which involved computing K-means, a computationally intensive operation) in 46 minutes for a total cost of $11.385 in TileDB Cloud.

To ingest in a distributed, parallel fashion, simply set the mode to "BATCH" for batch ingestion, and pass a tiledb.cloud.Config() parameter with your TileDB credentials:

import tiledb
import tiledb.cloud
import tiledb.vector_search as vs

output_uri = "tiledb://TileDB-Inc/s3://tiledb-exaxmple/vector-search/ann_sift1b"
source_uri = "tiledb://TileDB-Inc/6a9a8e97-d99c-4ddb-829a-8455c794906e"

vs.ingest(    
    index_type = "IVF_FLAT",
    array_uri = output_uri,
    source_uri = source_uri,
    source_type = "TILEDB_ARRAY",
    partitions = 10_000,
    config = tiledb.cloud.Config(),
    mode = vs.Mode.BATCH)

Note that ingest automatically calculates the number of workers that will work in parallel to perform the ingestion, and there is no need to spin up any cluster - everything is serverless!

TileDB Cloud records all the details about the task graph, and lets you monitor it in real time.

image01-task-graph monitor.png

Distributed queries give you the capability of performing higher throughput queries with lower latency, yielding a much higher QPS. For example, in the example below we are able to submit a batch of 1000 query vectors to the 1 billion vector dataset for a cost of $0.10 in 23 seconds.

Performing a distributed query is similar to the ingestion described above, but now you can set mode to REALTIME as it will be faster:

# If running locally, log in to TileDB Cloud
#tiledb.cloud.login(token='your-api-key')

uri = "tiledb://TileDB-Inc/ann_sift1b"
query_vector_uri = "tiledb://TileDB-Inc/bigann_1b_ground_truth_query"
n_vectors=10000

index = vs.IVFFlatIndex(uri, config=tiledb.cloud.Config(), memory_budget=10)
query_vectors=np.transpose(vs.load_as_array(query_vector_uri, config=tiledb.cloud.Config()).astype(np.float32))[0:n_vectors]
results = index.query(query_vectors, k=10, nprobe=1, mode=tiledb.cloud.dag.Mode.REALTIME, num_partitions=30)

Vector search asset tiledb://TileDB-Inc/ann_sift1b is public (see “Governance” section below) and, thus, the above code will “just work” if you try to use it with your credentials.

Here is the task graph output of the above query.

image02-distributed-query-vector-search.png

Governance

TileDB Cloud allows you to quickly explore any vector search datasets right in the UI. We have several public datasets available, including the original ANN SIFT 1Billion Raw Vectors. This array lets you query the original vectors and get the data back as numpy arrays without having to parse the ivecs file. Simply slice directly by vector id!

uri = "tiledb://TileDB-Inc/ann_sift1b_raw_vectors"
with tiledb.open(uri, 'r', config=tiledb.cloud.Config()) as A:
  vector_id = 1
  print(A[vector_id])

Additional datasets include the pre-ingested BigANN dataset (with FLAT and IVF_FLAT indexes), as well as the tensorflow flowers and open drone map datasets.

image03-public-vector-datasets.png

Let's take a look at the BigANN SIFT 1 Billion dataset. When you open up the dataset you can see the components, as well as the overview, sharing, and settings.

You can quickly share the dataset with any other user or organization on the sharing page.

image04-share-annsift1b.png

Any dataset can also easily be made public right on the setting page.

image05-makepublic-annsift1b.png

Logging access and providing a full audit trail is built directly into TileDB Cloud. When you are sharing data with third parties or making data public, it is important to be able to capture access and understand what people are doing with the data. TileDB provides logs for all access including specific details about the data access and code used to perform that access.

image06-logging-access-annsift1b.png

This gives you just a small taste of the power of flexibility of TileDB Cloud. Start your own exploration today and do not hesitate to send us your feedback!

Stay Tuned

This article covered the very basics on TileDB’s vector search capabilities. Our team is working hard on multiple new exciting algorithms and features, so look out for the upcoming blogs on the TileDB-Vector-Search internals, benchmarks, integrations with LLMs, and more. You can also keep up with all other TileDB news by reading our blog.

We'd love to hear what you think of this article. Feel free to contact us, join our Slack community, or let us know on Twitter and LinkedIn.

Want to see TileDB Cloud in action?
Isaiah Norton

Isaiah Norton

VP of Engineering