Python API reference
All described objects can be imported from the zappend.api
module.
Function zappend()
zappend.api.zappend(slices, config=None, **kwargs)
Robustly create or update a Zarr dataset from dataset slices.
The zappend
function concatenates the dataset slices from given
slices
along a given append dimension, e.g., "time"
(the default)
for geospatial satellite observations.
Each append step is atomic, that is, the append operation is a transaction
that can be rolled back, in case the append operation fails.
This ensures integrity of the target data cube target_dir
given
in config
or kwargs
.
Each slice item in slices
provides a slice dataset to be appended.
The interpretation of a given slice item depends on whether a slice source
is configured or not (setting slice_source
).
If no slice source is configured, a slice item must be an object of type
str
, FileObj
, xarray.Dataset
, or SliceSource
.
If str
or FileObj
are used, they are interpreted as local dataset path or
dataset URI. If a URI is used, protocol-specific parameters apply, given by the
configuration parameter slice_storage_options
.
If a slice source is configured, a slice item represents the argument(s) passed
to that slice source. Multiple positional arguments can be passed as list
,
multiple keyword arguments as dict
, and both as a tuple
of list
and dict
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
slices |
Iterable[Any]
|
An iterable that yields slice items. |
required |
config |
ConfigLike
|
Processor configuration.
Can be a file path or URI, a |
None
|
kwargs |
Any
|
Additional configuration parameters. Can be used to pass or override configuration values in config. |
{}
|
Returns:
Type | Description |
---|---|
int
|
The number of slices processed. The value can be useful if the number of items in |
Class SliceSource
Bases: ABC
Slice source interface definition.
A slice source is a closable source for a slice dataset.
A slice source is intended to be implemented by users. An implementation must provide the methods get_dataset() and close().
If your slice source class requires the processing context,
your class constructor may define a ctx: Context
as 1st positional
argument or as keyword argument.
close()
Close this slice source. This should include cleaning up of any temporary resources.
This method is not intended to be called directly and is called exactly once for each instance of this class.
dispose()
Deprecated since version 0.6.0, override close() instead.
get_dataset()
abstractmethod
Open this slice source, do some processing and return a dataset of type xarray.Dataset as result.
This method is not intended to be called directly and is called exactly once for each instance of this class.
It should return a dataset that is compatible with target dataset:
- slice must have same fixed dimensions;
- append dimension must exist in slice.
Returns:
Type | Description |
---|---|
Dataset
|
A slice dataset. |
Class Context
Provides access to configuration values and values derived from it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config |
Dict[str, Any] | Config
|
A validated configuration dictionary or a |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If |
config: Config
property
The processor configuration.
last_append_label: Any | None
property
The last label found in the coordinate variable that corresponds to
the append dimension. Its value is None
if no such variable exists or the
variable is empty or if config.append_step
is None
.
target_metadata: DatasetMetadata | None
property
writable
The metadata for the target dataset. May be None
while the
target dataset hasn't been created yet. Will be set, once the
target dataset has been created from the first slice dataset.
get_dataset_metadata(dataset)
Get the dataset metadata from configuration and the given dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The dataset |
required |
Returns:
Type | Description |
---|---|
DatasetMetadata
|
The dataset metadata |
Class Config
Provides access to configuration values and values derived from it.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_dict |
Dict[str, Any]
|
A validated configuration dictionary. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If |
append_dim: str
property
The name of the append dimension along which slice datasets will be
concatenated. Defaults to "time"
.
append_step: int | float | str | None
property
The enforced step size in the append dimension between two slices.
Defaults to None
.
attrs: dict[str, Any]
property
Global dataset attributes. May include dynamically computed
placeholders if the form {{ expression }}
.
attrs_update_mode: Literal['keep'] | Literal['replace'] | Literal['update']
property
The mode used to deal with global slice dataset attributes.
One of "keep"
, "replace"
, "update"
.
disable_rollback: bool
property
Whether to disable transaction rollbacks.
dry_run: bool
property
Whether to run in dry mode.
excluded_variables: list[str]
property
Names of excluded variables.
extra: dict[str, Any]
property
Extra settings.
Intended use is by a slice_source
that expects an argument
named ctx
to access the extra settings and other configuration.
force_new: bool
property
If set, an existing target dataset will be deleted.
included_variables: list[str]
property
Names of included variables.
logging: dict[str, Any] | str | bool | None
property
Logging configuration.
permit_eval: bool
property
Check if dynamically computed values in dataset attributes attrs
using the syntax {{ expression }}
is permitted. Executing arbitrary
Python expressions is a security risk, therefore this must be explicitly
enabled.
persist_mem_slices: bool
property
Whether to persist in-memory slice datasets.
profiling: dict[str, Any] | str | bool | None
property
Profiling configuration.
slice_engine: str | None
property
The configured slice engine to be used if a slice path or URI does not
point to a dataset in Zarr format.
If defined, it will be passed to the xarray.open_dataset()
function.
slice_polling: tuple[float, float] | tuple[None, None]
property
The configured slice dataset polling. If slice polling is enabled, returns tuple (interval, timeout) in seconds, otherwise, return (None, None).
slice_source: Callable[[...], Any] | None
property
A class or function that receives a slice item as argument(s) and provides the slice dataset.
- If a class is given, it must be derived from
zappend.api.SliceSource
. - If the function is a context manager, it must yield an
xarray.Dataset
. - If a plain function is given, it must return any valid slice item type.
Refer to the user guide for more information.
slice_source_kwargs: dict[str, Any] | None
property
Extra keyword-arguments passed to a specified slice_source
together with each slice item.
slice_storage_options: dict[str, Any] | None
property
The configured slice storage options to be used if a slice item is a URI.
target_dir: FileObj
property
The configured directory that represents the target datacube in Zarr format.
temp_dir: FileObj
property
The configured directory used for temporary files such as rollback data.
variables: dict[str, Any]
property
Variable definitions.
zarr_version: int
property
The configured Zarr version for the target dataset.
Class FileObj
An object that represents a file or directory in some filesystem.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
uri |
str
|
The file or directory URI |
required |
storage_options |
dict[str, Any] | None
|
Optional storage options specific to the protocol of the URI |
None
|
fs |
AbstractFileSystem | None
|
Optional fsspec filesystem instance. Use with care, the filesystem must be consistent with uri and storage_options. For internal use only. |
None
|
path |
str | None
|
The path info the filesystem fs. Use with care, the path must be consistent with uri. For internal use only. |
None
|
filename: str
property
The filename part of the URI.
fs: fsspec.AbstractFileSystem
property
The filesystem.
parent: FileObj
property
The parent file object.
path: str
property
The path of the file or directory into the filesystem.
storage_options: dict[str, Any] | None
property
Storage options for creating the filesystem object.
uri: str
property
The URI.
__truediv__(rel_path)
Overriden to call for_path(rel_path).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rel_path |
str
|
Relative path to append. |
required |
close()
Close the filesystem used by this file object.
delete(recursive=False)
Delete the file or directory represented by this file object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
recursive |
bool
|
Set to |
False
|
exists()
Check if the file or directory represented by this file object exists.
for_path(rel_path)
Gets a new file object for the given relative path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rel_path |
str
|
Relative path to append. |
required |
Returns:
Type | Description |
---|---|
FileObj
|
A new file object |
mkdir()
Create the directory represented by this file object.
read(mode='rb')
Read the contents of the file represented by this file object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mode |
Literal['rb'] | Literal['r']
|
Read mode, must be "rb" or "r" |
'rb'
|
Returns:
Type | Description |
---|---|
bytes | str
|
The contents of the file either as |
bytes | str
|
if mode is "r". |
write(data, mode=None)
Write the contents of the file represented by this file object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data |
str | bytes
|
The data to write. |
required |
mode |
Literal['wb'] | Literal['w'] | Literal['ab'] | Literal['a'] | None
|
Write mode, must be "wb", "w", "ab", or "a". |
None
|
Returns:
Type | Description |
---|---|
int
|
The number of bytes written. |
Types
zappend.api.SliceItem = str | FileObj | xr.Dataset | ContextManager[xr.Dataset] | SliceSource
module-attribute
The possible types that can represent a slice dataset.
zappend.api.SliceCallable = Type[SliceSource] | Callable[[...], SliceItem]
module-attribute
This type is either a class derived from SliceSource
or a function that
returns a SliceItem
. Both can be invoked with any number of positional or
keyword arguments. The processing context, if used, must be named ctx
and
must be either the 1st positional argument or a keyword argument. Its type
is Context
.
zappend.api.ConfigItem = FileObj | str | dict[str, Any]
module-attribute
The possible types used to represent zappend configuration.
zappend.api.ConfigList = list[ConfigItem] | tuple[ConfigItem]
module-attribute
A sequence of possible zappend configuration types.
zappend.api.ConfigLike = ConfigItem | ConfigList | None
module-attribute
Type for a zappend configuration-like object.
Contributions
This module contributes to zappend's core functionality.
The function signatures in this module are less stable, and their implementations are considered experimental. They may also rely on external packages. For more information, please refer to the individual function documentation. Due to these reasons, this module is excluded from the project's automatic coverage analysis.
Function write_levels()
Write a dataset given by source_ds
or source_path
to target_path
using the
multi-level dataset format
as specified by
xcube.
It resembles the store.write_data(dataset, "<name>.levels", ...)
method
provided by the xcube filesystem data stores ("file", "s3", "memory", etc.).
The zappend version may be used for potentially very large datasets in terms
of dimension sizes or for datasets with very large number of chunks.
It is considerably slower than the xcube version (which basically uses
xarray.to_zarr()
for each resolution level), but should run robustly with
stable memory consumption.
The function opens the source dataset and subdivides it into dataset slices
along the append dimension given by append_dim
, which defaults
to "time"
. The slice size in the append dimension is one.
Each slice is downsampled to the number of levels and each slice level
dataset is created/appended the target dataset's individual level
datasets.
The target dataset's chunk size in the spatial x- and y-dimensions will
be the same as the specified (or derived) tile size.
The append dimension will be one. The chunking will be reflected as the
variables
configuration parameter passed to each zappend()
call.
If configuration parameter variables
is also given as part of
zappend_config
, it will be merged with the chunk definitions.
Important notes:
- This function depends on
xcube.core.gridmapping.GridMapping
andxcube.core.subsampling.subsample_dataset()
of thexcube
package. write_levels()
is not as robust as zappend itself. For example, there may be inconsistent dataset levels if the processing is interrupted while a level is appended.- There is a remaining issue that with (coordinate) variables that
have a dimension that is not a dimension of any variable that has
one of the spatial dimensions, e.g.,
time_bnds
with dimensionstime
andbnds
. Please exclude such variables using the parameterexcluded_variables
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_ds |
Dataset | None
|
The source dataset.
Must be given in case |
None
|
source_path |
str | None
|
The source dataset path.
If |
None
|
source_storage_options |
dict[str, Any] | None
|
Storage options for the source dataset's filesystem. |
None
|
source_append_offset |
int | None
|
Optional offset in the append dimension. Only slices with indexes greater or equal the offset are appended. |
None
|
target_path |
str | None
|
The target multi-level dataset path.
Filename extension should be |
None
|
num_levels |
int | None
|
Optional number of levels.
If not given, a reasonable number of levels is computed
from |
None
|
tile_size |
tuple[int, int] | None
|
Optional tile size in the x- and y-dimension in pixels. If not given, the tile size is computed from the source dataset's chunk sizes in the x- and y-dimensions. |
None
|
xy_dim_names |
tuple[str, str] | None
|
Optional dimension names that identify the x- and y-dimensions. If not given, derived from the source dataset's grid mapping, if any. |
None
|
agg_methods |
str | dict[str, Any] | None
|
An aggregation method for all data variables or a
mapping that provides the aggregation method for a variable
name. Possible aggregation methods are
|
None
|
use_saved_levels |
bool
|
Whether a given, already written resolution level
serves as input to aggregation for the next level. If |
False
|
link_level_zero |
bool
|
Whether to not write the level zero of the target
multi-level dataset and link it instead. In this case, a link
file |
False
|
zappend_config |
Configuration passed to zappend as |
{}
|