Data Access
Collections
Retrieving Collections
Retrieving collections is Client responsibility as well as their creation. In the previous
chapter we created Collection named "weather". Now we are going to get it:
from deker import Client, Collection
with Client("file:///tmp/deker") as client:
   collection: Collection = client.get_collection("weather")
print(collection)  # weather
If you have several collections on the same storage, you can iterate them with the Client:
with Client("file:///tmp/deker") as client:
   for collection in client:
      print(collection)
Collection object has several useful properties and methods for self-managing:
with Client("file:///tmp/deker") as client:
   collection: Collection = client.get_collection("weather")
   print(collection.name)
   print(collection.array_schema)   # returns schema of Array
   print(collection.varray_schema)  # returns schema of VArray if applicable, else None
   print(collection.path)           # returns physical storage path of the Collection
   print(collection.as_dict)        # serializes main information about Collection into
                                    # dictionary
   collection.clear()               # removes all the Array and/or VArray objects from the
                                    # storage, but retains the Collection metadata
   collection.delete()              # removes all the Array and/or VArray and the Collection
                                    # metadata from the storage
Managers
Collection object has 3 kinds of managers to work with its contents:
default(orDataManager) isCollectionitself
Collection.arrays(orArraysManager) is a manager responsible forArray
Collection.varrays(orVArraysManager) is a manager responsible forVArray(unavailable inArraycollections).
These managers are mixed with FilteredManager object and are responsible for creation and
filtering of the correspondent contents. All of them have the same interface. The default manager
is a preferred one. Having information about the Collection main schema, the default manager
decides what to create or to filter. If you have a VArray collection, it will create or filter
VArray objects, if your collection is made of Array it will create or filter Array. The
two others are made for direct filtering of Array or VArray.
Normally, you need the default one, and although the two others are public, we will not describe them in this documentation.
Array Creation
Let’s create a first Array:
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.create({"dt": datetime(2023, 1, 1, 0)})
    print(array)
Note
Let’s assume that hereinafter all the datetime objects, including timestamps and ISO 8601
strings are in UTC timezone.
As you remember, our schema contains a TimeDimensionSchema and a primary attribute schema.
TimeDimensionSchema.start_value was indicated as a reference to the AttributeSchema.name,
what allowed you to set an individual time start point for each Array. That’s why we passed
{"dt": datetime(2023, 1, 1, 0)} to the method of creation, nevertheless if the attribute was
defined as primary or custom. Now our Array knows the day and the hour when its data
time series starts.
If some other primary attributes were defined, values for them should have been included in this dictionary.
If no attributes are defined in the schema, the method shall be called without parameters:
collection.create().
When an Array or a VArray is created, it has a unique id which is a UUID string.
Array and VArray IDs are generated automatically by different algorithms. So the
probability to get two same IDs tends to zero.
Fine, we have our first Array in the Collection. Do we have any changes in our storage?
Yes, we do. If you list it with:
ls -lh /tmp/deker/collections/weather
You will find out that there are two directories named array_data and array_symlinks and a
file with the Collection metadata weather.json.
Listing these inner directories will tell you that you have an .hdf5 file with the Array
UUID in its name. At the moment this file is almost empty. It contains just the Array metadata,
as we have not yet inserted any data in it. But it is created and ready to be used.
Thus, we can create all the Array objects in advance without filling them in with any data and
retrieve them when we need. Let’s prepare our database for January 2023:
from datetime import datetime, timedelta
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    for day in range(30):
        start_point = datetime(2023, 1, 2, 0) + timedelta(days=day)
        collection.create({"dt": start_point})
Collection is an iterator, so we can get all its contents item by item:
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    for array in collection:
       print(array)
Note
Everything, mentioned above in this section, is applicable to VArray as well, except that
VArray collection path will contain two more directories: varray_data and
varray_symlinks.
Arrays Filtering
If we need to get a certain Array from the collection, we shall filter it out. As previously
stated, primary attributes allow you to find a certain Array or VArray in the
Collection. If no primary attribute is defined, you need either to know its id or to
iterate the Collection in order to find a particular Array or VArray until you get the
right one.
Attention
It is highly recommended to define at least one primary attribute in every schema.
So you have two options how to filter an Array or VArray in a Collection:
By
idBy primary attributes
For example, we saved an id of some Array to a variable, let’s create a filter:
from deker import Array, Client, Collection
from deker.managers import FilteredManager
id = "9d7b32ee-d51e-5a0b-b2d9-9a654cb1991d"
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    filter: FilteredManager = collection.filter({"id": id})
This filter is an instance of FilteredManager object, which is also lazy. It keeps the
parameters for filtering, but no job has been done yet.
Attention
There is no any query language or conditional matching for now, only strict matching is available.
The FilteredManager provides final methods for invocation of the filtered objects:
first()
last()
Since only strict matching is available, both of them will return the same. They are stubs for the query language development.
Now let’s filter some Array by the primary attribute:
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    filter_1: FilteredManager = collection.filter({"dt": datetime(2023, 1, 3, 0)})
    filter_2: FilteredManager = collection.filter({"dt": datetime(2023, 1, 15, 0).isoformat()})
    array_1: Array = filter_1.first()
    array_2: Array = filter_2.last()
    print(array_1)
    print(array_2)
    assert array_1.id != array_1.id
As you see, attributes of datetime.datetime type, can be filtered both by datetime.datetime
object as well as by its representation as ISO 8601 string.
Attention
If your collection schema has several primary attributes, you must pass filtering values for all of them!
Note
Everything, mentioned above in this section, is applicable to VArray as well.
Array and VArray
As previously stated, both Array and VArray objects have the same interface.
Their common properties are:
id: returnsArrayorVArrayID
dtype: returns type of theArrayorVArraydata
shape: returnsArrayorVArrayshape as a tuple of dimension sizes
named_shape: returnsArrayorVArrayshape as a tuple of dimension names bound to their sizes
dimensions: returns a tuple ofArrayorVArraydimensions as objects
schema: returnsArrayorVArraylow-level schema
collection: returns the name ofCollectionto which theArrayis bound
as_dict: serializes main information about array into dictionary, prepared for JSON
primary_attributes: returns anOrderedDictofArrayorVArrayprimary attributes
custom_attributes: returns adictofArrayorVArraycustom attributes
VArray has two extra properties:
arrays_shape: returns common shape of all theArraybound to theVArray
vgrid: returns virtual grid (a tuple of integers) by whichVArrayis split intoArray
Their common common methods are:
read_meta(): reads theArrayorVArraymetadata from storage
update_custom_attributes(): updatesArrayorVArraycustom attributes values
delete(): deletesArrayorVArrayfrom the storage with all its data and metadata
__getitem__(): createsSubsetfromArrayorVSubsetfromVArray
Updating Custom Attributes
Updating custom attributes is quite simple. As you remember, our schema contains one named tm
(timestamp) with int data type, and we have never defined its value. It means, that it is set
to None in each Array. Let’s check it and update them everywhere:
from deker import Array, Client, Collection
from deker.managers import FilteredManager
with Client("file:///tmp/deker") as client:
   collection: Collection = client.get_collection("weather")
   for array in collection:
      print(array.custom_attributes)  # {'tm': None}
      # type shall be `int`
      custom_attribute_value = int(array.primary_attributes["dt"].timestamp()))
      array.update_custom_attributes({'tm': custom_attribute_value})
      print(array.custom_attributes)
If there are many custom attributes and you want to update just one or several of them - no problem. Just pass a dictionary with values for the attributes you need to update. All the others will not be harmed and will keep their values.
Fancy Slicing
It is our privilege and pleasure to introduce the fancy slicing of your data!
We consider the __getitem__() method to be one of our pearls.
Usually, you use integers for native Python and NumPy indexing and start, stop and step
slicing parameters:
import numpy as np
python_seq = range(10)
np_seq = np.random.random((3, 4, 5))
print(python_seq[1], python_seq[3:], python_seq[3:9:2])
print(np_seq[2, 3, 4], np_seq[1:,:, 2], np_seq[:2, :, 1:4:2])
Attention
If you are new to NumPy indexing, please, refer to the official documentation
DEKER™ allows you to index and slice its Array and VArray not only with integers, but with
the types by which the dimensions are described.
But let’s start with a constraint.
Step
Since a VArray is split in separate files, and each file can contain an Array with more
than one dimension, the calculation of their inner bounds is a non-trivial problem.
That’s why the step parameter is limited to 1 for both Array and VArray
dimensions. This constraint is introduced to keep consistent behavior, although that there is no
such a problem for Array.
Workaround for VArray would be to read your data and slice it again with steps, if you need,
as it will be a numpy.ndarray.
Start and Stop
As earlier mentioned, if your Dimensions have an additional description with scale or
labels you can get rid of indexes calculations and provide your scale or labels values
to start and stop parameters.
If you have a TimeDimension, you can slice it with datetime.datetime objects, its ISO 8601
formatted string or timestamps in the type of float.
Attention
Remember, that you shall convert your local timezone to UTC for proper TimeDimension slicing.
Let’s have a closer look:
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
   collection: Collection = client.get_collection("weather")
   array: Array = collection.filter({"dt": datetime(2023, 1, 3, 0)}).first()
   start_dt = datetime(2023, 1, 3, 5)
   end_dt = datetime(2023, 1, 3, 10)
   fancy_subset = array[
      start_dt:end_dt,  # step is timedelta(hours=1)
      -44.0:-45.0,      # y-scale start point is 90.0 and step is -1.0 (90.0 ... -90.0)
      -1.0:1.0,         # x-scale start point is -180.0 and step is 1.0 (-180.0 ... 179.0)
      :"pressure"       # captures just "temperature" and "humidity"
   ]
   # which is equivalent of:
   subset = array[
      5:10,
      134:135,
      179:181,
      :2
   ]
   assert fancy_subset.shape == subset.shape
   assert fancy_subset.bounds == subset.bounds
It is great, if you can keep in mind all the indexes and their mappings, but this feature awesome, isn’t it?! Yes, it is!!!
The values, passed to each dimension’s index or slice, are converted to integers, and after that
they are set in the native Python slice object. A tuple of such slices is the final
representation of the bounds which will be applied to your data.
Attention
Fancy index values must exactly match your dimension time, Scale or label values,
otherwise, you will get IndexError.
You have not yet approached your data, but you are closer and closer.
Now you have a new object - Subset.
Subset and VSubset
Subset and VSubset are the final lazy objects for the access to your data.
Once created, they contain no data and do not access the storage until you manually invoke one of their correspondent methods.
Note
If you need to read or write all the data from Array or VArray you should create a
subset with [:] or [...].
Both of them also have the same interface. As for the properties, they are:
shape: returns shape of theSubsetorVSubset
bounds: returns bounds that were applied toArrayorVArray
dtype: returns type of queried data
fill_value: returns value for empty cells
Let’s dive deeper into the methods.
Note
The explanations below are based on the logic, implemented for the HDF5 format.
Read
Method read() gets data from the storage and returns a numpy.ndarray of the corresponding
shape and dtype. Regarding VArray data reading, VSubset will capture the data from
the Array, affected by the passed bounds, arrange it in a single numpy.ndarray of the
proper shape and dtype and return it to you.
If your Array or VArray is empty - a numpy.ndarray filled with fill_value will
be returned for any called Subset or VSubset:
import numpy as np
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 15, 0)}).first()
    subset = array[0, 0, 0]  # get first hour and grid zero-point
    print(subset.read())  # [nan, nan, nan, nan]
Update
Method update() is an upsert method, which is responsible for new values inserting and
old values updating.
The shape of the data, that you pass into this method, shall match the shape of the Subset or
VSubset. It is impossible to insert 10 values into 9 cells. It is also impossible to insert
them into 11 cells, as there are no instructions how to arrange them properly.
import numpy as np
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
    subset = array[:]  # captures full array shape
    data = np.random.random(subset.shape)
    subset.update(data)
The provided data dtype shall match the dtype of Array or VArray set by the schema or
shall have the correspondent Python type to be converted into such dtype:
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
    subset = array[:]  # captures full array shape
    data = np.random.random(subset.shape).tolist  # converts data into Python list of Python floats
    subset.update(data)  # data will be converted to array.dtype
If your Array or VArray is utterly empty, Subset or VSubset will create a
numpy.ndarray of the Array shape filled with the fill_value from the Collection
schema and than, using the indicated bounds, it will insert the data provided by you in this array.
Afterwards it will be dumped to the storage. In the scope of VArray it will work in the same
manner, except that only corresponding affected inner Array will be created.
If there is some data in your Array or VArray and you provide some new values by this
method, the old values in the affected bounds will be substituted with new ones:
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
    data = np.random.random(array.shape)
    array[:].update(data)
    subset = array[0, 0, 0]  # get first hour and grid zero-point
    print(subset.read())  # a list of 4 random values
    new_values = [0.1, 0.2, 0.3, 0.4]
    subset.update(new_values)  # data will be converted to array.dtype
    print(subset.read())  # [0.1, 0.2, 0.3, 0.4]
Clear
Method clear() inserts the fill_value into the affected bounds. If all your Array or
VArray values are fill_value, it will be concerned empty and the data set will be deleted
from the file. But the file still exists and retains Array or VArray metadata:
import numpy as np
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
    data = np.random.random(array.shape)
    array[:].update(data)
    subset = array[0, 0, 0]  # get first hour and grid zero-point
    print(subset.read())  # a list of 4 random values
    new_values = [0.1, 0.2, 0.3, 0.4]
    subset.update(new_values)  # data will be converted to array.dtype
    print(subset.read())  # [0.1, 0.2, 0.3, 0.4]
    subset.clear()
    print(subset.read())  # [nan, nan, nan, nan]
    array[:].clear()
    print(array[:].read()) # a numpy.ndarray full of `nans`
Describe
You may want to check, what part of data you are going to manage.
With describe() you can get an OrderedDict with a description of the dimensions parts
affected by Subset or VSubset. If you provided scale and/or labels for your
dimensions, you will get the human-readable description, otherwise you’ll get indexes.
So it is highly recommended to describe your dimensions:
from datetime import datetime
from deker import Array, Client, Collection
from pprint import pprint
with Client("file:///tmp/deker") as client:
   collection: Collection = client.get_collection("weather")
   array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
   pprint(array[0, 0, 0].describe())
   # OrderedDict([('day_hours',
   #             [datetime.datetime(2023, 1, 1, 0, 0, tzinfo=datetime.timezone.utc)]),
   #             ('y', [90.0]),
   #             ('x', [-180.0]),
   #             ('weather', ['temperature', 'humidity', 'pressure', 'wind_speed'])])
   subset = array[datetime(2023, 1, 1, 5):datetime(2023, 1, 1, 10),
                  -44.0:-45.0,
                  -1.0:1.0,
                  :"pressure"]
   pprint(subset.describe())
   #  OrderedDict([('day_hours',
   #               [datetime.datetime(2023, 1, 1, 5, 0, tzinfo=datetime.timezone.utc),
   #                datetime.datetime(2023, 1, 1, 6, 0, tzinfo=datetime.timezone.utc),
   #                datetime.datetime(2023, 1, 1, 7, 0, tzinfo=datetime.timezone.utc),
   #                datetime.datetime(2023, 1, 1, 8, 0, tzinfo=datetime.timezone.utc),
   #                datetime.datetime(2023, 1, 1, 9, 0, tzinfo=datetime.timezone.utc)]),
   #              ('y', [-44.0]),
   #              ('x', [-1.0, 0.0]),
   #              ('weather', ['temperature', 'humidity'])])
Attention
Description is an OrderedDict object, having in values full ranges of descriptive data for
Subset or VSubset. If you keep this description in memory, your memory will be lowered
by its size.
Read Xarray
Warning
xarray package is not in the list of the DEKER™ default dependencies. Please, refer to the
Installation chapter for more details
Xarray is a wonderful project, which provides special objects for working with multidimensional data. Its main principle is the data shall be described. We absolutely agree with that.
Method read_xarray() describes a Subset or VSubset, reads its contents and converts it
to xarray.DataArray object.
If you need to convert your data to pandas objects, or to netCDF, or to ZARR - use this
method and after it use methods, provided by xarray.DataArray:
import numpy as np
from datetime import datetime
from deker import Array, Client, Collection
with Client("file:///tmp/deker") as client:
    collection: Collection = client.get_collection("weather")
    array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
    data = np.random.random(array.shape)
    array[:].update(data)
    subset = array[0, 0, 0]  # get first hour and grid zero-point
    x_subset: xarray.DataArray = subset.read_xarray()
    print(dir(x_subset))
    print(type(x_subset.to_dataframe()))
    print(type(x_subset.to_netcdf()))
    print(type(x_subset.to_zarr()))
It provides even more opportunities. Refer to xarray.DataArray API for details .
Locks
DEKER™ is thread and process safe. It uses its own locks for the majority of operations. DEKER™ locks can be divided into two groups: read and write locks
Read locks can be shared between threads and processes with no risk of data corruption.
Write locks are exclusive and are taken for the files with correspondent data content. Only the process/thread, which has already acquired a write lock, may produce any changes in the data.
It means that if one process is already writing some data into a HDF5 file (or into
an Array) and some other processes want to read from it or to write some other data
into the same file, they will receive a DekerLockError.
Note
Reading data from an Array, which is locked for writing, is impossible.
Speaking about VArray it means that several processes are able to update data
in several non-intersecting VSubsets. In case if any updating VSubset intersects
with another one, the update operation will be rejected for the VSubset, which met
the write lock.
Please note that operation of custom attributes updating also locks Array or VArray files for writing.