Back

Jun 09, 2023

TileDB Cloud: User-Defined Functions

Data Science
12 min read
Seth Shelnutt

Seth Shelnutt

CTO, TileDB

TileDB Cloud supports numerous unique capabilities that stand out compared to traditional databases and other database-like systems. In this blog post we are going to highlight support for arbitrary User-Defined Functions or UDFs, which constitute the foundation for running entire pipelines and sophisticated workflows, which go way beyond simple ETL and SQL queries. We call these workflows Task Graphs and cover them in detail in a future blog post. Here I will focus only on UDFs, since they bring a lot of value even when used standalone.

This blog will use the TileDB-Cloud-Py, TileDB-Cloud-R and TileDB-Cloud-JS packages to showcase the heterogeneous environments that we support. You can reproduce everything covered here by signing up on TileDB Cloud (we will give you free credits to get you started) and running all the code examples included in the article.

What is a User-Defined Function?

A user-defined function (UDF) is literally any piece of code written in some programming language that can be executed inside the TileDB Cloud compute environment.

We have designed and built the UDF support from the ground up to work in a completely serverless manner. When you execute a UDF, that function is serialized and sent to the TileDB Cloud service, which runs it in a sandboxed and isolated environment. Resources are automatically handled by the TileDB Cloud service based on the user’s selection. The user does not need to worry about manually spinning up EC2 instances, or any other compute infrastructure. Instead, you just call the function to be executed, and it's automatically dispatched for you in a lambda-like fashion.

Today TileDB Cloud supports UDFs in Python and R, but soon we are adding support for more languages. In addition, TileDB Cloud has three UDF types:

  • Single Array UDFs: UDFs that get applied to any slice and attribute subset of a single array (e.g., it can be a sum of the cell values in a subarray).
  • Multi Array UDFs: UDFs that get applied to any slice and attribute subset of any number of arrays (e.g., it can be a join or a matrix product).
  • Generic UDFs: Arbitrary UDFs with any inputs and outputs (they can implement literally anything).

An important benefit of array UDFs versus generic UDFs is that you don’t get charged for data egress for the data sliced outside the specified arrays. On the other hand, a key benefit of generic UDFs is that there is no restriction on what code you can execute in them and, thus, they provide maximum flexibility.

Single Array UDFs

You can use this type of UDF for applying a function directly to data of a particular array. The UDF gets executed using the apply (in Python) / execute_array_udf (in R) function of TileDB arrays, which takes as input a subarray slice and a list of attributes. The result of this query (i.e., slicing the array and subselecting on attributes) gets passed into a Python UDF as a NumPy OrderedDict, and as a data.frame in a R UDF.

# Define the UDF
# The input is the array slice results as an OrderedDict
def median(data):
    import numpy
    return numpy.median(data["a"])

# The "apply" function takes as input the function, an array slice 
# and any attribute subset, and passes to the function the result of
# that TileDB query, i.e., A.query(attrs=["a"])[1:2, 1:2]
with tiledb.open("tiledb://TileDB-Inc/quickstart_dense") as A:
results = A.apply(median, [(1,2), (1,2)], attrs = ["a"])
print(results)

Multi-Array UDFs

Multi-array UDFs let you apply a function on a list of arrays , which specifies the array name, subarray to apply the UDF on and a subset of array attributes. This can be useful for performing join operations, data fusion and much more.

array_1 = "tiledb://TileDB-Inc/quickstart_sparse"
array_2 = "tiledb://TileDB-Inc/quickstart_dense"

def median(data):
    import numpy
  # When you have multiple arrays, the parameter
  # we pass in is a list of ordered dictionaries.
  # The list is in the order of the arrays you asked for.
    return (
            # perform a median over both arrays
        numpy.median(numpy.concatenate((data[0]["a"].flatten(), data[1]["a"].flatten())))
    )

# The following will create the list of arrays to take part
# in the multi-array UDF. Each has as input the array name,
# a multi-index for slicing and a list of attributes to subselect on.
array_list = tiledb.cloud.array.ArrayList()    
array_list.add(array_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_2, [(1, 2), (1, 4)], ["a"])

# This will execute `median` using as input the result of the
# slicing and attribute subselection for each of the arrays 
# in `array_list`
result = tiledb.cloud.array.exec_multi_array_udf(median, array_list)

print(result)

Generic UDFs

In addition to running functions on specific arrays, TileDB Cloud supports completely generic functions that can be called. These functions give you the power and flexibility to build out any kind of computational workload.

Let's examine a quick example to showcase how simple and powerful this can be.

def hello(str="world"):
   return f"hello {str}"

tiledb.cloud.udf.exec(hello)

We can even run machine learning functions directly through TileDB Cloud's UDFs. Let’s take the quickstart example from TensorFlow and run it as a function.

# An arbitrary Python function - this could be truly anything
def ml():
    import tensorflow as tf

    mnist = tf.keras.datasets.mnist

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    model = tf.keras.models.Sequential([
      tf.keras.layers.Flatten(input_shape=(28, 28)),
      tf.keras.layers.Dense(128, activation='relu'),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(10)
    ])

    predictions = model(x_train[:1]).numpy()
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)


    model.compile(optimizer='adam',
                  loss=loss_fn,
                  metrics=['accuracy'])

    model.fit(x_train, y_train, epochs=5)
    model.evaluate(x_test,  y_test, verbose=2)

    probability_model = tf.keras.Sequential([
      model,
      tf.keras.layers.Softmax()
    ])

    return probability_model(x_test[:5])

# This is how we call a generic UDF
tiledb.cloud.udf.exec(ml)

Registering UDFs

Another powerful feature of TileDB Cloud UDFs is the ability to register a function. Instead of writing or copying/pasting numerous lines of UDF code, you can just write it once and register it on TileDB Cloud. After you register a UDF, it can be invoked directly by name, passing any input parameters. This allows you to even call functions across languages such as calling a Python function right from R. Finally, you can securely share the UDF so that collaborators can call it by name without having to rewrite it!

# Register a generic UDF
tiledb.cloud.udf.register_generic_udf(hello, name="hello_py", namespace=tiledb.cloud.user_profile().username)

# Execute registered generic UDF
tiledb.cloud.udf.exec("TileDB-Inc/hello_py", str="mars")

# Register a single-array UDF
tiledb.cloud.udf.register_single_array_udf(median, name="median_single_array_py", namespace=tiledb.cloud.user_profile().username)

# Apply registered single-array UDF
tiledb.cloud.array.apply("tiledb://TileDB-Inc/quickstart_dense", "TileDB-Inc/median_single_array_py", [(1,2), (1,2)], attrs = ["a"])

# Register a multi-array UDF
tiledb.cloud.udf.register_multi_array_udf(median, name="median_multi_array_py", namespace=tiledb.cloud.user_profile().username)

# Execute registered multi-array UDF
array_1 = "tiledb://TileDB-Inc/quickstart_sparse"
array_2 = "tiledb://TileDB-Inc/quickstart_dense"
array_list = tiledb.cloud.array.ArrayList()    
array_list.add(array_1, [(1, 4), (1, 4)], ["a"])
array_list.add(array_2, [(1, 2), (1, 4)], ["a"])
tiledb.cloud.array.exec_multi_array_udf("TileDB-Inc/median_multi_array_py", array_list)

You can call the Python function from R as follows (it also works vice versa):

# Execute generic UDF
execute_generic_udf(registered_udf_name='TileDB-Inc/hello_py', args=list(s="from python"), result_format="JSON")

# Execute single-array UDF
results <- tiledbcloud::execute_array_udf(
  array="TileDB-Inc/quickstart_dense",
  registered_udf_name="TileDB-Inc/my_median_single_array_py",
  selectedRanges=list(cbind(1,2), cbind(1,2)),
  attrs=c("a"), result_format="JSON"
)
print(results)

# Execute multi-array UDF
# The following will create the list of arrays to take part in
# the multi-array UDF. Each has as input the array name, a
# multi-index for slicing and a list of attributes to subselect on.
details1 <- tiledbcloud::UDFArrayDetails$new(
  uri="tiledb://TileDB-Inc/quickstart_dense",
  ranges=QueryRanges$new(
    layout=Layout$new('row-major'),
    ranges=list(cbind(1,4),cbind(1,4))
  ),
  buffers=list("a")
)

details2 <- tiledbcloud::UDFArrayDetails$new(
  uri="tiledb://TileDB-Inc/quickstart_sparse",
  ranges=QueryRanges$new(
    layout=Layout$new('row-major'),
    ranges=list(cbind(1,2),cbind(1,4))
  ),
  buffers=list("a")
)

# This will execute `median` using as input the result of the
# slicing and attribute subselection for each of the arrays 
# in `array_list`
result <- tiledbcloud::execute_multi_array_udf(
  array_list=list(details1, details2),
  registered_udf_name="TileDB-Inc/my_median_single_array_py",
)
print(result)

You can even call registered functions right from the browser! Here is an example from JavaScript:

const result = await client.udf.exec("TileDB-Inc/hello_py", ["from javascript"]);

Environment

UDFs can be run in two compute environments:

  • Standard (default): 2GB of RAM and 2 CPUs (default)
  • Large: 8GB of RAM and 8 CPUs

UDFs have multiple docker images to choose from for the execution environment. There are four different python images and one R image currently:

Python:

  • Default image ("3.9" and "3.7")
  • Geospatial image ("3.9-geo", "3.7-geo")
  • Genomics image ("3.9-genomics", "3.7-genomics")
  • Imaging ("3.9-imaging", "3.7-imaging")

R:

  • Default image ("4.0", "4.1")
tiledb.cloud.udf.exec(hello, resource_class="standard")
tiledb.cloud.udf.exec(hello, resource_class="large", image_name="3.9-geo")

Sharing

TileDB Cloud allows you to easily and securely collaborate with others. You can share any registered UDFs with a single click or a single function call. Choose a UDF, click on Sharing, and add any TileDB Cloud username (if the user exists) or email (if the user does not exist).

foobar001.png

Users that do not exist in TileDB Cloud will receive an email notification to join and gain access to your UDF. Since UDFs are registered programmatically, you can also manage UDF permissions programmatically:

# Share a UDF with both read and write permissions with a user
tiledb.cloud.share_array(uri="tiledb://TileDB-Inc/hello_py",
                         namespace="user1", # The user to share the array with
                         permissions=["read", "write"])

Unsharing

You can easily revoke access to any user with a single click.

foobar002.png

# Revoke access to a UDF for a particular user                         
tiledb.cloud.unshare_array(uri="tiledb://TileDB-Inc/hello_py", namespace="user1")

View sharing information

The UDF's Sharing tab lists all users that the UDF is currently shared with and their associated permissions.

foobar003.png

# Get sharing information about a UDF
tiledb.cloud.list_shared_with("tiledb://TileDB-Inc/hello_py")

Catalog and Exploration

All UDFs are indexed and cataloged making it fast and efficient to find the function you are looking for. This includes your own function, functions others have shared with you or even public functions available right on TileDB Cloud. The catalog includes the ability to search based on metadata, which gives you a rich and robust method for maintaining and organizing your catalog.

tiledb.cloud.list_arrays(search="hello_r")

Details and Logs

TileDB Cloud provides quick and easy access to any task details and logs, including UDF calls. This can be done programmatically, as well as via the TileDB Cloud UI. The UDF details include information such as task duration, the CPU and memory resources that were available, costs, and any array activities associated with a UDF.

tiledb.cloud.last_udf_task()
tiledb.cloud.last_udf_task().logs
tiledb.cloud.task(<uuid>)

From the TileDB Cloud UI console, you can view the UDF call details in the Activity tab.

tiledb_cloud_udfs_006.png

Further Fun

The use cases for running UDFs on TileDB Cloud are infinite! In this blog we explained how to write single-/multi-array UDFs and generic UDFs in a variety of languages, as well as how to manage your UDFs on TileDB Cloud. We've got some additional cool examples, including building an entire react app and calling Python functions directly from the browser.

Stay tuned for the next blog in the series which will walk through how you can build sophisticated pipelines and workflows using TileDB Cloud Task Graphs, in order to scale your compute from a single function to a massively parallel setting, all 100% serverless.

Want to see TileDB Cloud in action?
Seth Shelnutt

Seth Shelnutt

CTO, TileDB