Skip to content

Python API reference

All described objects can be imported from the zappend.api module.

Function zappend()

zappend.api.zappend(slices, config=None, **kwargs)

Robustly create or update a Zarr dataset from dataset slices.

The zappend function concatenates the dataset slices from given slices along a given append dimension, e.g., "time" (the default) for geospatial satellite observations. Each append step is atomic, that is, the append operation is a transaction that can be rolled back, in case the append operation fails. This ensures integrity of the target data cube target_dir given in config or kwargs.

Each slice item in slices provides a slice dataset to be appended. The interpretation of a given slice item depends on whether a slice source is configured or not (setting slice_source).

If no slice source is configured, a slice item must be an object of type str, FileObj, xarray.Dataset, or SliceSource. If str or FileObj are used, they are interpreted as local dataset path or dataset URI. If a URI is used, protocol-specific parameters apply, given by the configuration parameter slice_storage_options.

If a slice source is configured, a slice item represents the argument(s) passed to that slice source. Multiple positional arguments can be passed as list, multiple keyword arguments as dict, and both as a tuple of list and dict.

Parameters:

Name Type Description Default
slices Iterable[Any]

An iterable that yields slice items.

required
config ConfigLike

Processor configuration. Can be a file path or URI, a dict, None, or a sequence of the aforementioned. If a sequence is used, subsequent configurations are incremental to the previous ones.

None
kwargs Any

Additional configuration parameters. Can be used to pass or override configuration values in config.

{}

Returns:

Type Description
int

The number of slices processed. The value can be useful if the number of items in slices is unknown.

Class SliceSource

Bases: ABC

Slice source interface definition.

A slice source is a closable source for a slice dataset.

A slice source is intended to be implemented by users. An implementation must provide the methods get_dataset() and close().

If your slice source class requires the processing context, your class constructor may define a ctx: Context as 1st positional argument or as keyword argument.

close()

Close this slice source. This should include cleaning up of any temporary resources.

This method is not intended to be called directly and is called exactly once for each instance of this class.

dispose()

Deprecated since version 0.6.0, override close() instead.

get_dataset() abstractmethod

Open this slice source, do some processing and return a dataset of type xarray.Dataset as result.

This method is not intended to be called directly and is called exactly once for each instance of this class.

It should return a dataset that is compatible with target dataset:

  • slice must have same fixed dimensions;
  • append dimension must exist in slice.

Returns:

Type Description
Dataset

A slice dataset.

Class Context

Provides access to configuration values and values derived from it.

Parameters:

Name Type Description Default
config Dict[str, Any] | Config

A validated configuration dictionary or a Config instance.

required

Raises:

Type Description
ValueError

If target_dir is missing in the configuration.

config: Config property

The processor configuration.

last_append_label: Any | None property

The last label found in the coordinate variable that corresponds to the append dimension. Its value is None if no such variable exists or the variable is empty or if config.append_step is None.

target_metadata: DatasetMetadata | None property writable

The metadata for the target dataset. May be None while the target dataset hasn't been created yet. Will be set, once the target dataset has been created from the first slice dataset.

get_dataset_metadata(dataset)

Get the dataset metadata from configuration and the given dataset.

Parameters:

Name Type Description Default
dataset Dataset

The dataset

required

Returns:

Type Description
DatasetMetadata

The dataset metadata

Class Config

Provides access to configuration values and values derived from it.

Parameters:

Name Type Description Default
config_dict Dict[str, Any]

A validated configuration dictionary.

required

Raises:

Type Description
ValueError

If target_dir is missing in the configuration.

append_dim: str property

The name of the append dimension along which slice datasets will be concatenated. Defaults to "time".

append_step: int | float | str | None property

The enforced step size in the append dimension between two slices. Defaults to None.

attrs: dict[str, Any] property

Global dataset attributes. May include dynamically computed placeholders if the form {{ expression }}.

attrs_update_mode: Literal['keep'] | Literal['replace'] | Literal['update'] property

The mode used to deal with global slice dataset attributes. One of "keep", "replace", "update".

disable_rollback: bool property

Whether to disable transaction rollbacks.

dry_run: bool property

Whether to run in dry mode.

excluded_variables: list[str] property

Names of excluded variables.

extra: dict[str, Any] property

Extra settings. Intended use is by a slice_source that expects an argument named ctx to access the extra settings and other configuration.

force_new: bool property

If set, an existing target dataset will be deleted.

included_variables: list[str] property

Names of included variables.

logging: dict[str, Any] | str | bool | None property

Logging configuration.

permit_eval: bool property

Check if dynamically computed values in dataset attributes attrs using the syntax {{ expression }} is permitted. Executing arbitrary Python expressions is a security risk, therefore this must be explicitly enabled.

persist_mem_slices: bool property

Whether to persist in-memory slice datasets.

profiling: dict[str, Any] | str | bool | None property

Profiling configuration.

slice_engine: str | None property

The configured slice engine to be used if a slice path or URI does not point to a dataset in Zarr format. If defined, it will be passed to the xarray.open_dataset() function.

slice_polling: tuple[float, float] | tuple[None, None] property

The configured slice dataset polling. If slice polling is enabled, returns tuple (interval, timeout) in seconds, otherwise, return (None, None).

slice_source: Callable[[...], Any] | None property

A class or function that receives a slice item as argument(s) and provides the slice dataset.

  • If a class is given, it must be derived from zappend.api.SliceSource.
  • If the function is a context manager, it must yield an xarray.Dataset.
  • If a plain function is given, it must return any valid slice item type.

Refer to the user guide for more information.

slice_source_kwargs: dict[str, Any] | None property

Extra keyword-arguments passed to a specified slice_source together with each slice item.

slice_storage_options: dict[str, Any] | None property

The configured slice storage options to be used if a slice item is a URI.

target_dir: FileObj property

The configured directory that represents the target datacube in Zarr format.

temp_dir: FileObj property

The configured directory used for temporary files such as rollback data.

variables: dict[str, Any] property

Variable definitions.

zarr_version: int property

The configured Zarr version for the target dataset.

Class FileObj

An object that represents a file or directory in some filesystem.

Parameters:

Name Type Description Default
uri str

The file or directory URI

required
storage_options dict[str, Any] | None

Optional storage options specific to the protocol of the URI

None
fs AbstractFileSystem | None

Optional fsspec filesystem instance. Use with care, the filesystem must be consistent with uri and storage_options. For internal use only.

None
path str | None

The path info the filesystem fs. Use with care, the path must be consistent with uri. For internal use only.

None

filename: str property

The filename part of the URI.

fs: fsspec.AbstractFileSystem property

The filesystem.

parent: FileObj property

The parent file object.

path: str property

The path of the file or directory into the filesystem.

storage_options: dict[str, Any] | None property

Storage options for creating the filesystem object.

uri: str property

The URI.

__truediv__(rel_path)

Overriden to call for_path(rel_path).

Parameters:

Name Type Description Default
rel_path str

Relative path to append.

required

close()

Close the filesystem used by this file object.

delete(recursive=False)

Delete the file or directory represented by this file object.

Parameters:

Name Type Description Default
recursive bool

Set to True to delete a non-empty directory.

False

exists()

Check if the file or directory represented by this file object exists.

for_path(rel_path)

Gets a new file object for the given relative path.

Parameters:

Name Type Description Default
rel_path str

Relative path to append.

required

Returns:

Type Description
FileObj

A new file object

mkdir()

Create the directory represented by this file object.

read(mode='rb')

Read the contents of the file represented by this file object.

Parameters:

Name Type Description Default
mode Literal['rb'] | Literal['r']

Read mode, must be "rb" or "r"

'rb'

Returns:

Type Description
bytes | str

The contents of the file either as bytes if mode is "rb" or as str

bytes | str

if mode is "r".

write(data, mode=None)

Write the contents of the file represented by this file object.

Parameters:

Name Type Description Default
data str | bytes

The data to write.

required
mode Literal['wb'] | Literal['w'] | Literal['ab'] | Literal['a'] | None

Write mode, must be "wb", "w", "ab", or "a".

None

Returns:

Type Description
int

The number of bytes written.

Types

zappend.api.SliceItem = str | FileObj | xr.Dataset | ContextManager[xr.Dataset] | SliceSource module-attribute

The possible types that can represent a slice dataset.

zappend.api.SliceCallable = Type[SliceSource] | Callable[[...], SliceItem] module-attribute

This type is either a class derived from SliceSource or a function that returns a SliceItem. Both can be invoked with any number of positional or keyword arguments. The processing context, if used, must be named ctx and must be either the 1st positional argument or a keyword argument. Its type is Context.

zappend.api.ConfigItem = FileObj | str | dict[str, Any] module-attribute

The possible types used to represent zappend configuration.

zappend.api.ConfigList = list[ConfigItem] | tuple[ConfigItem] module-attribute

A sequence of possible zappend configuration types.

zappend.api.ConfigLike = ConfigItem | ConfigList | None module-attribute

Type for a zappend configuration-like object.

Contributions

This module contributes to zappend's core functionality.

The function signatures in this module are less stable, and their implementations are considered experimental. They may also rely on external packages. For more information, please refer to the individual function documentation. Due to these reasons, this module is excluded from the project's automatic coverage analysis.

Function write_levels()

Write a dataset given by source_ds or source_path to target_path using the multi-level dataset format as specified by xcube.

It resembles the store.write_data(dataset, "<name>.levels", ...) method provided by the xcube filesystem data stores ("file", "s3", "memory", etc.). The zappend version may be used for potentially very large datasets in terms of dimension sizes or for datasets with very large number of chunks. It is considerably slower than the xcube version (which basically uses xarray.to_zarr() for each resolution level), but should run robustly with stable memory consumption.

The function opens the source dataset and subdivides it into dataset slices along the append dimension given by append_dim, which defaults to "time". The slice size in the append dimension is one. Each slice is downsampled to the number of levels and each slice level dataset is created/appended the target dataset's individual level datasets.

The target dataset's chunk size in the spatial x- and y-dimensions will be the same as the specified (or derived) tile size. The append dimension will be one. The chunking will be reflected as the variables configuration parameter passed to each zappend() call. If configuration parameter variables is also given as part of zappend_config, it will be merged with the chunk definitions.

Important notes:

  • This function depends on xcube.core.gridmapping.GridMapping and xcube.core.subsampling.subsample_dataset() of the xcube package.
  • write_levels() is not as robust as zappend itself. For example, there may be inconsistent dataset levels if the processing is interrupted while a level is appended.
  • There is a remaining issue that with (coordinate) variables that have a dimension that is not a dimension of any variable that has one of the spatial dimensions, e.g., time_bnds with dimensions time and bnds. Please exclude such variables using the parameter excluded_variables.

Parameters:

Name Type Description Default
source_ds Dataset | None

The source dataset. Must be given in case source_path is not given.

None
source_path str | None

The source dataset path. If source_ds is provided and link_level_zero is true, then source_path must also be provided in order to determine the path of the level zero source.

None
source_storage_options dict[str, Any] | None

Storage options for the source dataset's filesystem.

None
source_append_offset int | None

Optional offset in the append dimension. Only slices with indexes greater or equal the offset are appended.

None
target_path str | None

The target multi-level dataset path. Filename extension should be .levels, by convention. If not given, target_dir should be passed as part of the zappend_config. (The name target_path is used here for consistency with source_path.)

None
num_levels int | None

Optional number of levels. If not given, a reasonable number of levels is computed from tile_size.

None
tile_size tuple[int, int] | None

Optional tile size in the x- and y-dimension in pixels. If not given, the tile size is computed from the source dataset's chunk sizes in the x- and y-dimensions.

None
xy_dim_names tuple[str, str] | None

Optional dimension names that identify the x- and y-dimensions. If not given, derived from the source dataset's grid mapping, if any.

None
agg_methods str | dict[str, Any] | None

An aggregation method for all data variables or a mapping that provides the aggregation method for a variable name. Possible aggregation methods are "first", "min", "max", "mean", "median".

None
use_saved_levels bool

Whether a given, already written resolution level serves as input to aggregation for the next level. If False, the default, each resolution level other than zero is computed from the source dataset. If True, the function may perform significantly faster, but be aware that the aggregation methods "first" and "median" will produce inaccurate results.

False
link_level_zero bool

Whether to not write the level zero of the target multi-level dataset and link it instead. In this case, a link file {target_path}/0.link will be written. If False, the default, a level dataset {target_path}/0.zarr will be written instead.

False
zappend_config

Configuration passed to zappend as zappend(slice, **zappend_config) for each slice in the append dimension. The zappend config parameter is not supported.

{}