*********** Data Access *********** Collections =========== Retrieving Collections ---------------------- Retrieving collections is ``Client`` responsibility as well as their creation. In the previous chapter we created ``Collection`` named ``"weather"``. Now we are going to get it:: from deker import Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") print(collection) # weather If you have several collections on the same storage, you can iterate them with the ``Client``:: with Client("file:///tmp/deker") as client: for collection in client: print(collection) ``Collection`` object has several useful properties and methods for self-managing:: with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") print(collection.name) print(collection.array_schema) # returns schema of Array print(collection.varray_schema) # returns schema of VArray if applicable, else None print(collection.path) # returns physical storage path of the Collection print(collection.as_dict) # serializes main information about Collection into # dictionary collection.clear() # removes all the Array and/or VArray objects from the # storage, but retains the Collection metadata collection.delete() # removes all the Array and/or VArray and the Collection # metadata from the storage Managers -------- ``Collection`` object has 3 kinds of managers to work with its contents: 1. ``default`` (or ``DataManager``) is ``Collection`` itself 2. ``Collection.arrays`` (or ``ArraysManager``) is a manager responsible for ``Array`` 3. ``Collection.varrays`` (or ``VArraysManager``) is a manager responsible for ``VArray`` (unavailable in ``Array`` collections). These managers are mixed with ``FilteredManager`` object and are responsible for creation and filtering of the correspondent contents. All of them have the same interface. The default manager is a preferred one. Having information about the ``Collection`` main schema, the default manager decides what to create or to filter. If you have a ``VArray`` collection, it will create or filter ``VArray`` objects, if your collection is made of ``Array`` it will create or filter ``Array``. The two others are made for direct filtering of ``Array`` or ``VArray``. Normally, you need the default one, and although the two others are public, we will not describe them in this documentation. Array Creation -------------- Let's create a first Array:: from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.create({"dt": datetime(2023, 1, 1, 0)}) print(array) .. note:: Let's assume that hereinafter all the ``datetime`` objects, including timestamps and ISO 8601 strings are in **UTC timezone**. As you remember, our schema contains a ``TimeDimensionSchema`` and a **primary** attribute schema. ``TimeDimensionSchema.start_value`` was indicated as a reference to the ``AttributeSchema.name``, what allowed you to set an individual time start point for each Array. That's why we passed ``{"dt": datetime(2023, 1, 1, 0)}`` to the method of creation, nevertheless if the attribute was defined as ``primary`` or ``custom``. Now our ``Array`` knows the day and the hour when its data time series starts. If some other primary attributes were defined, values for them should have been included in this dictionary. If no attributes are defined in the schema, the method shall be called without parameters: ``collection.create()``. When an ``Array`` or a ``VArray`` is created, it has a unique ``id`` which is a UUID string. ``Array`` and ``VArray`` IDs are generated automatically by different algorithms. So the probability to get two same IDs tends to zero. Fine, we have our first ``Array`` in the ``Collection``. Do we have any changes in our storage? Yes, we do. If you list it with:: ls -lh /tmp/deker/collections/weather You will find out that there are two directories named ``array_data`` and ``array_symlinks`` and a file with the ``Collection`` metadata ``weather.json``. Listing these inner directories will tell you that you have an ``.hdf5`` file with the ``Array`` UUID in its name. At the moment this file is almost empty. It contains just the ``Array`` metadata, as we have not yet inserted any data in it. But it is created and ready to be used. Thus, we can create all the ``Array`` objects in advance without filling them in with any data and retrieve them when we need. Let's prepare our database for January 2023:: from datetime import datetime, timedelta from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") for day in range(30): start_point = datetime(2023, 1, 2, 0) + timedelta(days=day) collection.create({"dt": start_point}) ``Collection`` is an iterator, so we can get all its contents item by item:: with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") for array in collection: print(array) .. note:: Everything, mentioned above in this section, is applicable to ``VArray`` as well, except that ``VArray`` collection path will contain two more directories: ``varray_data`` and ``varray_symlinks``. Arrays Filtering ---------------- If we need to get a certain ``Array`` from the collection, we shall filter it out. As previously stated, **primary** attributes allow you to find a certain ``Array`` or ``VArray`` in the ``Collection``. If no primary attribute is defined, you need either to know its ``id`` or to iterate the ``Collection`` in order to find a particular ``Array`` or ``VArray`` until you get the right one. .. attention:: It is highly recommended to define at least one **primary** attribute in every schema. So you have two options how to filter an ``Array`` or ``VArray`` in a ``Collection``: 1. By ``id`` 2. By primary attributes For example, we saved an ``id`` of some ``Array`` to a variable, let's create a filter:: from deker import Array, Client, Collection from deker.managers import FilteredManager id = "9d7b32ee-d51e-5a0b-b2d9-9a654cb1991d" with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") filter: FilteredManager = collection.filter({"id": id}) This ``filter`` is an instance of ``FilteredManager`` object, which is also lazy. It keeps the parameters for filtering, but no job has been done yet. .. attention:: There is no any query language or conditional matching for now, only strict matching is available. The ``FilteredManager`` provides final methods for invocation of the filtered objects: * ``first()`` * ``last()`` Since only strict matching is available, both of them will return the same. They are stubs for the query language development. Now let's filter some ``Array`` by the primary attribute:: with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") filter_1: FilteredManager = collection.filter({"dt": datetime(2023, 1, 3, 0)}) filter_2: FilteredManager = collection.filter({"dt": datetime(2023, 1, 15, 0).isoformat()}) array_1: Array = filter_1.first() array_2: Array = filter_2.last() print(array_1) print(array_2) assert array_1.id != array_1.id As you see, attributes of ``datetime.datetime`` type, can be filtered both by ``datetime.datetime`` object as well as by its representation as ISO 8601 string. .. attention:: If your collection schema has **several** primary attributes, you must pass filtering values for **all** of them! .. note:: Everything, mentioned above in this section, is applicable to VArray as well. Array and VArray ================ As previously stated, both ``Array`` and ``VArray`` objects have the same interface. Their common **properties** are: * ``id``: returns ``Array`` or ``VArray`` ID * ``dtype``: returns type of the ``Array`` or ``VArray`` data * ``shape``: returns ``Array`` or ``VArray`` shape as a tuple of dimension sizes * ``named_shape``: returns ``Array`` or ``VArray`` shape as a tuple of dimension names bound to their sizes * ``dimensions``: returns a tuple of ``Array`` or ``VArray`` dimensions as objects * ``schema``: returns ``Array`` or ``VArray`` low-level schema * ``collection``: returns the name of ``Collection`` to which the ``Array`` is bound * ``as_dict``: serializes main information about array into dictionary, prepared for JSON * ``primary_attributes``: returns an ``OrderedDict`` of ``Array`` or ``VArray`` **primary** attributes * ``custom_attributes``: returns a ``dict`` of ``Array`` or ``VArray`` **custom** attributes ``VArray`` has two extra properties: * ``arrays_shape``: returns common shape of all the ``Array`` bound to the ``VArray`` * ``vgrid``: returns virtual grid (a tuple of integers) by which ``VArray`` is split into ``Array`` Their common common methods are: * ``read_meta()``: reads the ``Array`` or ``VArray`` metadata from storage * ``update_custom_attributes()``: updates ``Array`` or ``VArray`` custom attributes values * ``delete()``: deletes ``Array`` or ``VArray`` from the storage with all its data and metadata * ``__getitem__()``: creates ``Subset`` from ``Array`` or ``VSubset`` from ``VArray`` Updating Custom Attributes -------------------------- Updating custom attributes is quite simple. As you remember, our schema contains one named ``tm`` (timestamp) with ``int`` data type, and we have never defined its value. It means, that it is set to ``None`` in each ``Array``. Let's check it and update them everywhere:: from deker import Array, Client, Collection from deker.managers import FilteredManager with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") for array in collection: print(array.custom_attributes) # {'tm': None} # type shall be `int` custom_attribute_value = int(array.primary_attributes["dt"].timestamp())) array.update_custom_attributes({'tm': custom_attribute_value}) print(array.custom_attributes) If there are many custom attributes and you want to update just one or several of them - no problem. Just pass a dictionary with values for the attributes you need to update. All the others will not be harmed and will keep their values. Fancy Slicing ------------- It is our privilege and pleasure to introduce the **fancy slicing** of your data! We consider the ``__getitem__()`` method to be one of our pearls. Usually, you use integers for native Python and NumPy indexing and ``start``, ``stop`` and ``step`` slicing parameters:: import numpy as np python_seq = range(10) np_seq = np.random.random((3, 4, 5)) print(python_seq[1], python_seq[3:], python_seq[3:9:2]) print(np_seq[2, 3, 4], np_seq[1:,:, 2], np_seq[:2, :, 1:4:2]) .. attention:: If you are new to NumPy indexing, please, refer to the `official documentation`_ .. _`official documentation`: https://numpy.org/doc/stable/user/basics.indexing.html DEKER™ allows you to index and slice its ``Array`` and ``VArray`` not only with integers, but with the ``types`` by which the dimensions are described. But let's start with a **constraint**. Step ~~~~ Since a ``VArray`` is split in separate files, and each file can contain an ``Array`` with more than one dimension, the calculation of their inner bounds is a non-trivial problem. That's why the ``step`` parameter **is limited** to ``1`` for both ``Array`` and ``VArray`` dimensions. This constraint is introduced to keep consistent behavior, although that there is no such a problem for ``Array``. Workaround for ``VArray`` would be to read your data and slice it again with steps, if you need, as it will be a ``numpy.ndarray``. Start and Stop ~~~~~~~~~~~~~~ As earlier mentioned, if your ``Dimensions`` have an additional description with ``scale`` or ``labels`` you can get rid of indexes calculations and provide your ``scale`` or ``labels`` values to ``start`` and ``stop`` parameters. If you have a ``TimeDimension``, you can slice it with ``datetime.datetime`` objects, its ISO 8601 formatted string or timestamps in the type of ``float``. .. attention:: Remember, that you shall convert your local timezone to UTC for proper ``TimeDimension`` slicing. Let's have a closer look:: from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 3, 0)}).first() start_dt = datetime(2023, 1, 3, 5) end_dt = datetime(2023, 1, 3, 10) fancy_subset = array[ start_dt:end_dt, # step is timedelta(hours=1) -44.0:-45.0, # y-scale start point is 90.0 and step is -1.0 (90.0 ... -90.0) -1.0:1.0, # x-scale start point is -180.0 and step is 1.0 (-180.0 ... 179.0) :"pressure" # captures just "temperature" and "humidity" ] # which is equivalent of: subset = array[ 5:10, 134:135, 179:181, :2 ] assert fancy_subset.shape == subset.shape assert fancy_subset.bounds == subset.bounds It is great, if you can keep in mind all the indexes and their mappings, but this feature awesome, isn't it?! Yes, it is!!! The values, passed to each dimension's index or slice, are converted to integers, and after that they are set in the native Python ``slice`` object. A ``tuple`` of such ``slices`` is the final representation of the bounds which will be applied to your data. .. attention:: Fancy index values must **exactly** match your dimension time, ``Scale`` or ``label`` values, otherwise, you will get ``IndexError``. You have not yet approached your data, but you are closer and closer. Now you have a new object - `Subset`. Subset and VSubset ================== ``Subset`` and ``VSubset`` are the final lazy objects for the access to your data. Once created, they contain no data and do not access the storage until you manually invoke one of their correspondent methods. .. note:: If you need to read or write all the data from ``Array`` or ``VArray`` you should create a subset with ``[:]`` or ``[...]``. Both of them also have the same interface. As for the properties, they are: * ``shape``: returns shape of the ``Subset`` or ``VSubset`` * ``bounds``: returns bounds that were applied to ``Array`` or ``VArray`` * ``dtype``: returns type of queried data * ``fill_value``: returns value for empty cells Let's dive deeper into the methods. .. note:: The explanations below are based on the logic, implemented for the ``HDF5`` format. Read ---- Method ``read()`` gets data from the storage and returns a ``numpy.ndarray`` of the corresponding ``shape`` and ``dtype``. Regarding ``VArray`` data reading, ``VSubset`` will capture the data from the ``Array``, affected by the passed bounds, arrange it in a single ``numpy.ndarray`` of the proper ``shape`` and ``dtype`` and return it to you. If your ``Array`` or ``VArray`` is **empty** - a ``numpy.ndarray`` filled with ``fill_value`` will be returned for any called ``Subset`` or ``VSubset``:: import numpy as np from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 15, 0)}).first() subset = array[0, 0, 0] # get first hour and grid zero-point print(subset.read()) # [nan, nan, nan, nan] Update ------ Method ``update()`` is an **upsert** method, which is responsible for new values **inserting** and old values **updating**. The shape of the data, that you pass into this method, shall match the shape of the ``Subset`` or ``VSubset``. It is impossible to insert 10 values into 9 cells. It is also impossible to insert them into 11 cells, as there are no instructions how to arrange them properly. :: import numpy as np from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() subset = array[:] # captures full array shape data = np.random.random(subset.shape) subset.update(data) The provided data ``dtype`` shall match the dtype of ``Array`` or ``VArray`` set by the schema or shall have the correspondent Python type to be converted into such ``dtype``:: with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() subset = array[:] # captures full array shape data = np.random.random(subset.shape).tolist # converts data into Python list of Python floats subset.update(data) # data will be converted to array.dtype If your ``Array`` or ``VArray`` is utterly empty, ``Subset`` or ``VSubset`` will create a ``numpy.ndarray`` of the ``Array`` shape filled with the ``fill_value`` from the ``Collection`` schema and than, using the indicated bounds, it will insert the data provided by you in this array. Afterwards it will be dumped to the storage. In the scope of ``VArray`` it will work in the same manner, except that only corresponding affected inner ``Array`` will be created. If there is some data in your ``Array`` or ``VArray`` and you provide some new values by this method, the old values in the affected bounds will be substituted with new ones:: with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() data = np.random.random(array.shape) array[:].update(data) subset = array[0, 0, 0] # get first hour and grid zero-point print(subset.read()) # a list of 4 random values new_values = [0.1, 0.2, 0.3, 0.4] subset.update(new_values) # data will be converted to array.dtype print(subset.read()) # [0.1, 0.2, 0.3, 0.4] Clear ----- Method ``clear()`` inserts the ``fill_value`` into the affected bounds. If all your ``Array`` or ``VArray`` values are ``fill_value``, it will be concerned empty and the data set will be deleted from the file. But the file still exists and retains ``Array`` or ``VArray`` metadata:: import numpy as np from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() data = np.random.random(array.shape) array[:].update(data) subset = array[0, 0, 0] # get first hour and grid zero-point print(subset.read()) # a list of 4 random values new_values = [0.1, 0.2, 0.3, 0.4] subset.update(new_values) # data will be converted to array.dtype print(subset.read()) # [0.1, 0.2, 0.3, 0.4] subset.clear() print(subset.read()) # [nan, nan, nan, nan] array[:].clear() print(array[:].read()) # a numpy.ndarray full of `nans` Describe -------- You may want to check, what part of data you are going to manage. With ``describe()`` you can get an ``OrderedDict`` with a description of the dimensions parts affected by ``Subset`` or ``VSubset``. If you provided ``scale`` and/or ``labels`` for your dimensions, you will get the human-readable description, otherwise you'll get indexes. So it is highly recommended to describe your dimensions:: from datetime import datetime from deker import Array, Client, Collection from pprint import pprint with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() pprint(array[0, 0, 0].describe()) # OrderedDict([('day_hours', # [datetime.datetime(2023, 1, 1, 0, 0, tzinfo=datetime.timezone.utc)]), # ('y', [90.0]), # ('x', [-180.0]), # ('weather', ['temperature', 'humidity', 'pressure', 'wind_speed'])]) subset = array[datetime(2023, 1, 1, 5):datetime(2023, 1, 1, 10), -44.0:-45.0, -1.0:1.0, :"pressure"] pprint(subset.describe()) # OrderedDict([('day_hours', # [datetime.datetime(2023, 1, 1, 5, 0, tzinfo=datetime.timezone.utc), # datetime.datetime(2023, 1, 1, 6, 0, tzinfo=datetime.timezone.utc), # datetime.datetime(2023, 1, 1, 7, 0, tzinfo=datetime.timezone.utc), # datetime.datetime(2023, 1, 1, 8, 0, tzinfo=datetime.timezone.utc), # datetime.datetime(2023, 1, 1, 9, 0, tzinfo=datetime.timezone.utc)]), # ('y', [-44.0]), # ('x', [-1.0, 0.0]), # ('weather', ['temperature', 'humidity'])]) .. attention:: Description is an ``OrderedDict`` object, having in values full ranges of descriptive data for ``Subset`` or ``VSubset``. If you keep this description in memory, your memory will be lowered by its size. Read Xarray ----------- .. warning:: ``xarray`` package is not in the list of the DEKER™ default dependencies. Please, refer to the Installation_ chapter for more details Xarray_ is a wonderful project, which provides special objects for working with multidimensional data. Its main principle is *the data shall be described*. We absolutely agree with that. Method ``read_xarray()`` describes a ``Subset`` or ``VSubset``, reads its contents and converts it to ``xarray.DataArray`` object. If you need to convert your data to ``pandas`` objects, or to ``netCDF``, or to ``ZARR`` - use this method and after it use methods, provided by ``xarray.DataArray``:: import numpy as np from datetime import datetime from deker import Array, Client, Collection with Client("file:///tmp/deker") as client: collection: Collection = client.get_collection("weather") array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first() data = np.random.random(array.shape) array[:].update(data) subset = array[0, 0, 0] # get first hour and grid zero-point x_subset: xarray.DataArray = subset.read_xarray() print(dir(x_subset)) print(type(x_subset.to_dataframe())) print(type(x_subset.to_netcdf())) print(type(x_subset.to_zarr())) It provides even more opportunities. Refer to ``xarray.DataArray`` API_ for details . .. _Installation: installation.html#extra-dependencies .. _Xarray: https://docs.xarray.dev/en/stable/ .. _API: https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html Locks ====== DEKER™ is thread and process safe. It uses its own locks for the majority of operations. DEKER™ locks can be divided into two groups: **read** and **write** locks **Read locks** can be shared between threads and processes with no risk of data corruption. **Write locks** are exclusive and are taken for the files with correspondent data content. Only the process/thread, which has already acquired a write lock, may produce any changes in the data. It means that if one process is already writing some data into a ``HDF5`` file (or into an ``Array``) and some other processes want to read from it or to write some other data into the same file, they will receive a ``DekerLockError``. .. note:: Reading data from an ``Array``, which is locked for writing, is impossible. Speaking about ``VArray`` it means that several processes are able to update data in several non-intersecting ``VSubsets``. In case if any updating ``VSubset`` intersects with another one, the update operation will be rejected for the ``VSubset``, which met the write lock. Please note that operation of custom attributes updating also locks Array or VArray files for writing.