Fine Tuning
This chapter is dedicated to advanced settings and features provided by DEKER™.
Client
In addition to the URI parameter Client
accepts several options, that you may want or need to
tune. All of them shall be explicitly passed as keyword parameters, none of them is positional.
executor
DEKER™ creates its own ThreadPoolExecutor
instance for working with VArray
. By default,
this parameter is None
. You may want to use your own ThreadPoolExecutor
(or some custom
executor, based on ThreadPoolExecutor
) instance. In this case DEKER™ will use the passed one:
from deker import Client
client = Client(uri, executor=<your_executor_instance>)
Note
No executor is initialized and used if you work with a Collection
of Array
. The executor,
passed by you, will be ignored.
Attention
When Client
is closed your executor will not be shut down, you shall do it manually.
workers
This is a parameter for the native DEKER™ executor mentioned above.
By default, it is None
and in this case DEKER™ uses the maximum number of threads from the
formula, provided by Python 3.9 documentation : cpu_count() + 4
.
You may increase or reduce it, if you need:
from deker import Client
client = Client(uri, workers=8)
write_lock_timeout
DEKER™ uses its own file locking mechanisms for different operations, one of which is for writing.
With write_lock_timeout
you can modify an amount of seconds during which a parallel writing
process waits for the release of the locked file:
from deker import Client
client = Client(uri, write_lock_timeout=120)
The default is 60
seconds. The units are immutable and only int
is accepted.
write_lock_check_interval
While the parallel writing process waits for the lock release, it sleeps for some time and then checks the state of the lock. You can adjust its sleeping time in seconds:
from deker import Client
client = Client(uri, write_lock_check_interval=5)
The default is 1
second. The units are immutable and only int
is accepted.
loglevel
All the DEKER™ objects (including private ones) have their own loggers. They are bound by the
common logging level, which defaults to "ERROR"
. If you need, you may change
it at Client
init:
from deker import Client
client = Client(uri, loglevel="INFO")
If you need to change it on the fly, you may use the following function:
from deker.log import set_logging_level
set_logging_level("INFO") # now DEKER™ logs starting from "INFO" level
memory_limit
This parameter is used for the early run time break in case of potential memory overflow.
DEKER™ operates big amounts of data, and you may be unaware that your machine will probably run out
of memory. For example, NumPy shall raise _ArrayMemoryError
if you do something like this:
>>> import numpy as np
>>> np.random.random((100000, 100000))
# numpy.core._exceptions._ArrayMemoryError: Unable to allocate 74.5 GiB
# for an array with shape (100000, 100000) and data type float64
As DEKER™ is lazy, you shall be warned about such problems beforehand. For that purpose, DEKER™ checks the memory limits when it is creating:
Collection
Subset
orVSubset
xarray.DataArray
from aSubset
or aVSubset
By default DEKER™ is limited to your total virtual memory size (i.e. total amount of RAM plus swap size). For example, you have 16 GB of RAM and 2 GB of swap. Thus, DEKER™ is limited to 18 GB of memory by default. But usually a machine is already using some parts of these memory for other processes. So your current available free memory is always lower than the total one.
DEKER™ compares its limits with your current available free memory (RAM + swap) and chooses the
minimal one of them. Than it compares the result with the requested shape size. In case your
request requires too much memory or you are trying to create a Collection
with a schema, which
may cause a memory overflow in future, DekerMemoryError
will be immediately raised.
You can lower the default value by passing a certain number of bytes
or by passing a human
readable representation of kilobytes, megabytes, gigabytes ot terabytes, for example: "1024K"
,
"512M"
, "8G"
, "1T"
:
from deker import Client
client = Client(uri, memory_limit="4G") # 4 gigabytes
client = Client(uri, memory_limit=4096) # 4096 bytes
Only integers are acceptable for both of bytes and human representation. Capitalization of units
suffix is ignored: "1024k"
, "512m"
, "8g"
, "1t"
will work.
Note
You definitely may want to use it in Docker.
If you set a memory limit to your container, you’d better limit DEKER™ to the same value. Otherwise your container may be killed because of memory overflow.
skip_collection_create_memory_check
Currently deker has 3 places, where memory check, described in memory_limit:
On collection creation via
client.create_collection()
On getting subset e.g
array[:]
On reading array as
XArray
e.garray[:].read_xarray()
While the last two prevent memory overflow and are required, sometimes you may need to be able to skip the first one
You can do so by providing
skip_collection_create_memory_check=True
as argument to
the Client
constructor
HDF5 Options
Attention
If you are new to HDF5
, please, refer to the HDF5 official documentation
Very briefly, HDF5
is a data model, library, and file format for storing and managing data. It
supports an unlimited variety of data types, and is designed for flexible and efficient I/O and for
high volume and complex data. This format offers a big number of special tuning options. We will
talk about chunks
and data compression
.
DEKER™ deker-local-adapters
plugin has its default implementation of working with this format.
It depends on two packages: h5py_
and hdf5plugin_
which provide a Python interface for HDF5
binaries and a pack of compression filters.
DEKER™ applies chunks and compression options to all of the files within one collection. As long as
you do not interact directly with the files and low-level interfaces, DEKER™ provides special types
for these options usage. Your settings are stored in the collection metadata. When you invoke a
Collection
, they are recovered and ready to be applied to your data. But they have to make a
trip from the collection metadata to the final data, that’s why we need HDF5Options
and
HDF5CompressionOpts
objects.
Note
Chunks and compression options are applied to your dataset within HDF5 file when the data is inserted or updated. When reading, HDF5 file already knows how to manage its chunked and/or compressed contents properly.
First of all, let’s prepare a collection schema once again:
from datetime import datetime, timedelta
from deker import (
TimeDimensionSchema,
DimensionSchema,
Scale,
AttributeSchema,
ArraySchema,
Client,
Collection
)
dimensions = [
TimeDimensionSchema(
name="day_hours",
size=24,
start_value="$dt",
step=timedelta(hours=1)
),
DimensionSchema(
name="y",
size=181,
scale=Scale(start_value=90.0, step=-1.0, name="lat")
),
DimensionSchema(
name="x",
size=360,
scale=Scale(start_value=-180.0, step=1.0, name="lon")
),
DimensionSchema(
name="weather",
size=4,
labels=["temperature", "humidity", "pressure", "wind_speed"]
),
]
attributes = [
AttributeSchema(name="dt", dtype=datetime, primary=True),
AttributeSchema(name="tm", dtype=int, primary=False),
]
array_schema = ArraySchema(
dimensions=dimensions,
attributes=attributes,
dtype=float, # will be converted and saved as numpy.float64
# fill_value is not passed - will be numpy.nan
)
Chunks
Correct data chunking may increase your performance. It makes your data split in smaller equal pieces. When you read data from a chunk, HDF5-file opens and caches it. The next reading of the same pattern will be much faster as it will be captured not from the storage, but from the cache.
A HDF5-file may have no chunks options or be chunked either manually or automatically.
Hint
Study HDF5 chunking manual to understand chunks better.
DEKER™ allows you to use all the 3 options.
Chunks options are set to None
by default.
from deker import Client
with Client("file:///tmp/deker") as client:
client.create_collection("weather", array_schema)
When you create an Array
, its file is one big chunk.
If you set chunks to True
, HDF5-file will automatically determine a chunk size with its own
algorithm, basing on the shape of your Array
:
from deker import Client, HDF5Options
with Client("file:///tmp/deker") as client:
client.create_collection(
"weather_chunked_automatically",
array_schema,
HDF5Options(chunks=True)
)
You will never know the final chunk size, but be sure that your data is chunked now.
If you need to adjust it, you may set it manually. It shall be a tuple of integers. The size of the
tuple shall be equal to your Array
shape. Its values shall divide your dimensions without
remainders:
from deker import Client, HDF5Options
chunks = (1, 181, 36, 4)
# schema shape is (24, 181, 360, 4)
# (24, 181, 360, 4) / (1, 181, 36, 4) = (24.0, 1.0, 10.0, 1.0) - no remainders
with Client("file:///tmp/deker") as client:
client.create_collection(
"weather_chunked_manually",
array_schema,
HDF5Options(chunks=chunks)
)
Here we chunked our data into pieces, each of which will contain 1 hour, 181 y
points (because
181 is a natural number and is divisible only by itself or 1), 36 x
points and the full scope
of weather layers. If you need to read some data, which is kept in one or several chunks, the file
will not affect other chunks, but it will open and cache the correspondent ones.
Hint
The best way to decide on chunk size is your the most frequently used reading pattern.
Compression
To prevent a lack of the disc space for your data, you can compress it with different filters,
supported by HDF5 and provided by h5py
and hdf5plugin
packages.
There are several default filters, set in h5py
and a pack of the most popular filters, brought
by hdf5plugin
.
Default filters:
GZip
Lzf
SZip
Custom filters, brought by hdf5plugin
:
Bitshuffle
Blosc
BZip2
FciDecomp
LZ4
SZ
SZ3
Zfp
Zstd
Attention
The data is compressed chunk by chunk. If you use compression without indicating a chunk size, it will be automatically set to True and calculated by the inner HDF5 algorythm.
The default filters shall be used as follows:
from deker import Client, HDF5Options, HDF5CompressionOpts
with Client("file:///tmp/deker") as client:
compression=HDF5CompressionOpts(compression="gzip", compression_opts=9),
options = HDF5Options(compression_opts=compression)
client.create_collection(
"weather_chunked_automatically_gzip",
array_schema,
collection_options=options
)
The custom filters shall be instantiated and passed to HDF5CompressionOpts
as a mapping:
with Client("file:///tmp/deker") as client:
compression=HDF5CompressionOpts(**hdf5plugin.Zstd(6)),
options = HDF5Options(chunks=(1, 181, 36, 4), compression_opts=compression)
client.create_collection(
"weather_chunked_manually_zstd",
array_schema,
collection_options=options
)
Hint
Dive into compression options at h5py filter pipeline, hdf5plugin docs and HDF5 compression manual.