User-defined functions (UDFs)
Data cube operations such as apply_pixel
, or reduce_time
expect functions or expressions as arguments. For example, to compute summary statistics over time, reduce_time
supports computing minimum, maximum, mean, median, and a few other built-in statistics. These built-in functions are purely written in C++ and hence relatively efficient. However, in many cases, we would like to apply more complex functions from external R packages on data cubes.
Table 1 lists data cube operations that additionally support calling user-defined R functions (e.g., on time series). For example, reduce_time
can not only be used for simple statistics but also for applying complex time series analysis like classification or change detection.
Operation | UDF input \((n_b, n_t, n_y, n_x)\) |
UDF output \((n_b, n_t, n_y, n_x)\) |
---|---|---|
apply_pixel |
\((n_b, 1, 1, 1)\) | \((m_b, 1, 1, 1)\) |
apply_time |
\((n_b, n_t, 1, 1)\) | \((m_b, n_t, 1, 1)\) |
reduce_space |
\((n_b, 1, n_y, n_x)\) | \((m_b, 1, 1, 1)\) |
reduce_time |
\((n_b, n_t, 1, 1)\) | \((m_b, 1, 1, 1)\) |
Examples
Fit a linear model to all time series and returns the coefficients:
Derive wind speed from north/south and east/west wind components:
Calculate cumulative sums of NDVI times series after removing the (per time series) mean NDVI:
Notes on implementing more complex UDFs
Writing UDFs can sometimes become complicated and more difficult to debug (see Debugging). However, you can reduce frustration if you consider the following rules:
The function must always return an array of the corresponding dimensions (see Table 1). Dimensions with size 1 can be dropped in the result array.
The number of bands that a UDF returns can be independent from the number of input bands but must be identical for all function calls. For example, if a UDF outputs model parameters for time series model, it should always return the same number of parameters.
Make sure that the function does not crash due to NAs. It is quite common that there are for example time series without any valid observations at the corners of a data cube. Ideally, use
tryCatch
to handle errors and return an NA array of appropriate shape by default. If errors are not handled, computations will stop even if only a single call fails.Do not try to use variables, packages, or any other objects from the environment outside the function. Since the UDF runs in a new R process, make sure to load needed packages, data, etc. within the function.
Debugging
Because UDFs run in separate processes, debugging can be challenging. When developing UDFs, it is recommended to start with a minimal example, e.g. by working on lower resolution and/or spatiotemporal extent. Error messages are generally forwarded to the main R process. To get more detailed error messages, try setting gdalcubes_options(debug = TRUE)
.
Technical details
Execution of UDFs
The following steps are performed when executing a UDF on a data cube.
If needed, chunks of data cubes are rearranged. For example, to apply a function on time series, this step makes sure that a single chunk contains complete time series of a small spatial block of pixels.
A chunk of the data cube is written to a temporary file on disk using a custom binary format (see binary serialization format).
A new R process is started, reads the chunk from disk memory, applies the UDF, and finally writes the result to another temporary binary file.
gdalcubes now loads the result file into memory and continues.
Notice that it is possible to change the directory where intermediate files are written, e.g. to use a shared memory device with gdalcubes_options(streaming_dir = "/dev/shm")
.
Binary serialization format
To stream data to and from the external process, a custom binary serialization format is used.
The format includes data cube values of one chunk and includes dimension values, band names, and the spatial reference system.
The binary serialization format will be replaced with CF compliant netCDF-4 / ZARR files in future versions.
The format contains
the size of the chunk as 4 * 4 bytes in total (nbands:int32, nt:int32, ny:int32, nx:int32),
band names, where each band starts with the number of characters as int32 followed by actual characters,
dimension values in the order time, y, x as doubles, summing to (nt + ny + nx) * 8 bytes in total,
the spatial reference system as a string (number of characters as int32, followed by actual characters), and
actual values of the chunk as doubles (summing to nbands * nt * ny * nx * 8 bytes).