***********
Data Access
***********

Collections
===========


Retrieving Collections
----------------------

Retrieving collections is ``Client`` responsibility as well as their creation. In the previous
chapter we created ``Collection`` named ``"weather"``. Now we are going to get it::

   from deker import Client, Collection

   with Client("file:///tmp/deker") as client:
      collection: Collection = client.get_collection("weather")

   print(collection)  # weather

If you have several collections on the same storage, you can iterate them with the ``Client``::

   with Client("file:///tmp/deker") as client:
      for collection in client:
         print(collection)

``Collection`` object has several useful properties and methods for self-managing::

   with Client("file:///tmp/deker") as client:
      collection: Collection = client.get_collection("weather")

      print(collection.name)
      print(collection.array_schema)   # returns schema of Array
      print(collection.varray_schema)  # returns schema of VArray if applicable, else None
      print(collection.path)           # returns physical storage path of the Collection
      print(collection.as_dict)        # serializes main information about Collection into
                                       # dictionary
      collection.clear()               # removes all the Array and/or VArray objects from the
                                       # storage, but retains the Collection metadata
      collection.delete()              # removes all the Array and/or VArray and the Collection
                                       # metadata from the storage


Managers
--------

``Collection`` object has 3 kinds of managers to work with its contents:

   1. ``default`` (or ``DataManager``) is ``Collection`` itself
   2. ``Collection.arrays`` (or ``ArraysManager``) is a manager responsible for ``Array``
   3. ``Collection.varrays`` (or ``VArraysManager``) is a manager responsible for ``VArray``
      (unavailable in ``Array`` collections).

These managers are mixed with ``FilteredManager`` object and are responsible for creation and
filtering of the correspondent contents. All of them have the same interface. The default manager
is a preferred one. Having information about the ``Collection`` main schema, the default manager
decides what to create or to filter. If you have a ``VArray`` collection, it will create or filter
``VArray`` objects, if your collection is made of ``Array`` it will create or filter ``Array``. The
two others are made for direct filtering of ``Array`` or ``VArray``.

Normally, you need the default one, and although the two others are public, we will not describe
them in this documentation.


Array Creation
--------------

Let's create a first Array::

   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.create({"dt": datetime(2023, 1, 1, 0)})
       print(array)

.. note::
   Let's assume that hereinafter all the ``datetime`` objects, including timestamps and ISO 8601
   strings are in **UTC timezone**.

As you remember, our schema contains a ``TimeDimensionSchema`` and a **primary** attribute schema.
``TimeDimensionSchema.start_value`` was indicated as a reference to the ``AttributeSchema.name``,
what allowed you to set an individual time start point for each Array. That's why we passed
``{"dt": datetime(2023, 1, 1, 0)}`` to the method of creation, nevertheless if the attribute was
defined as ``primary`` or ``custom``. Now our ``Array`` knows the day and the hour when its data
time series starts.

If some other primary attributes were defined, values for them should have been included in this
dictionary.

If no attributes are defined in the schema, the method shall be called without parameters:
``collection.create()``.

When an ``Array`` or a ``VArray`` is created, it has a unique ``id`` which is a UUID string.
``Array`` and ``VArray`` IDs are generated automatically by different algorithms. So the
probability to get two same IDs tends to zero.

Fine, we have our first ``Array`` in the ``Collection``. Do we have any changes in our storage?
Yes, we do. If you list it with::

   ls -lh /tmp/deker/collections/weather

You will find out that there are two directories named ``array_data`` and ``array_symlinks`` and a
file with the ``Collection`` metadata ``weather.json``.

Listing these inner directories will tell you that you have an ``.hdf5`` file with the ``Array``
UUID in its name. At the moment this file is almost empty. It contains just the ``Array`` metadata,
as we have not yet inserted any data in it. But it is created and ready to be used.

Thus, we can create all the ``Array`` objects in advance without filling them in with any data and
retrieve them when we need. Let's prepare our database for January 2023::

   from datetime import datetime, timedelta
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")

       for day in range(30):
           start_point = datetime(2023, 1, 2, 0) + timedelta(days=day)
           collection.create({"dt": start_point})

``Collection`` is an iterator, so we can get all its contents item by item::

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       for array in collection:
          print(array)

.. note::
   Everything, mentioned above in this section, is applicable to ``VArray`` as well, except that
   ``VArray`` collection path will contain two more directories: ``varray_data`` and
   ``varray_symlinks``.


Arrays Filtering
----------------

If we need to get a certain ``Array`` from the collection, we shall filter it out. As previously
stated, **primary** attributes allow you to find a certain ``Array`` or ``VArray`` in the
``Collection``. If no primary attribute is defined, you need either to know its ``id`` or to
iterate the ``Collection`` in order to find a particular ``Array`` or ``VArray`` until you get the
right one.

.. attention::
   It is highly recommended to define at least one **primary** attribute in every schema.

So you have two options how to filter an ``Array`` or ``VArray`` in a ``Collection``:

   1. By ``id``
   2. By primary attributes

For example, we saved an ``id`` of some ``Array`` to a variable, let's create a filter::

   from deker import Array, Client, Collection
   from deker.managers import FilteredManager

   id = "9d7b32ee-d51e-5a0b-b2d9-9a654cb1991d"

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       filter: FilteredManager = collection.filter({"id": id})

This ``filter`` is an instance of ``FilteredManager`` object, which is also lazy. It keeps the
parameters for filtering, but no job has been done yet.

.. attention::
   There is no any query language or conditional matching for now, only strict matching is
   available.

The ``FilteredManager`` provides final methods for invocation of the filtered objects:

   * ``first()``
   * ``last()``

Since only strict matching is available, both of them will return the same. They are stubs for
the query language development.

Now let's filter some ``Array`` by the primary attribute::

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")

       filter_1: FilteredManager = collection.filter({"dt": datetime(2023, 1, 3, 0)})
       filter_2: FilteredManager = collection.filter({"dt": datetime(2023, 1, 15, 0).isoformat()})

       array_1: Array = filter_1.first()
       array_2: Array = filter_2.last()
       print(array_1)
       print(array_2)
       assert array_1.id != array_1.id

As you see, attributes of ``datetime.datetime`` type, can be filtered both by ``datetime.datetime``
object as well as by its representation as ISO 8601 string.

.. attention::
   If your collection schema has **several** primary attributes, you must pass filtering values for
   **all** of them!

.. note::
   Everything, mentioned above in this section, is applicable to VArray as well.


Array and VArray
================

As previously stated, both ``Array`` and ``VArray`` objects have the same interface.

Their common **properties** are:

   * ``id``: returns ``Array`` or ``VArray`` ID
   * ``dtype``: returns type of the ``Array`` or ``VArray`` data
   * ``shape``: returns ``Array`` or ``VArray`` shape as a tuple of dimension sizes
   * ``named_shape``: returns ``Array`` or ``VArray`` shape as a tuple of dimension names bound to
     their sizes
   * ``dimensions``: returns a tuple of ``Array`` or ``VArray`` dimensions as objects
   * ``schema``: returns ``Array`` or ``VArray`` low-level schema
   * ``collection``: returns the name of ``Collection`` to which the ``Array`` is bound
   * ``as_dict``: serializes main information about array into dictionary, prepared for JSON
   * ``primary_attributes``: returns an ``OrderedDict`` of ``Array`` or ``VArray`` **primary**
     attributes
   * ``custom_attributes``: returns a ``dict`` of ``Array`` or ``VArray`` **custom** attributes

``VArray`` has two extra properties:

   * ``arrays_shape``: returns common shape of all the ``Array`` bound to the ``VArray``
   * ``vgrid``:  returns virtual grid (a tuple of integers) by which ``VArray`` is split into
     ``Array``

Their common common methods are:

   * ``read_meta()``: reads the ``Array`` or ``VArray`` metadata from storage
   * ``update_custom_attributes()``: updates ``Array`` or ``VArray`` custom attributes values
   * ``delete()``: deletes ``Array`` or ``VArray`` from the storage with all its data and metadata
   * ``__getitem__()``: creates ``Subset`` from ``Array`` or ``VSubset`` from ``VArray``


Updating Custom Attributes
--------------------------

Updating custom attributes is quite simple. As you remember, our schema contains one named ``tm``
(timestamp) with ``int`` data type, and we have never defined its value. It means, that it is set
to ``None`` in each ``Array``. Let's check it and update them everywhere::

   from deker import Array, Client, Collection
   from deker.managers import FilteredManager

   with Client("file:///tmp/deker") as client:
      collection: Collection = client.get_collection("weather")
      for array in collection:
         print(array.custom_attributes)  # {'tm': None}

         # type shall be `int`
         custom_attribute_value = int(array.primary_attributes["dt"].timestamp()))
         array.update_custom_attributes({'tm': custom_attribute_value})

         print(array.custom_attributes)

If there are many custom attributes and you want to update just one or several of them - no
problem. Just pass a dictionary with values for the attributes you need to update. All the others
will not be harmed and will keep their values.


Fancy Slicing
-------------

It is our privilege and pleasure to introduce the **fancy slicing** of your data!

We consider the ``__getitem__()`` method to be one of our pearls.

Usually, you use integers for native Python and NumPy indexing and ``start``, ``stop`` and ``step``
slicing parameters::

   import numpy as np

   python_seq = range(10)
   np_seq = np.random.random((3, 4, 5))

   print(python_seq[1], python_seq[3:], python_seq[3:9:2])
   print(np_seq[2, 3, 4], np_seq[1:,:, 2], np_seq[:2, :, 1:4:2])

.. attention::
   If you are new to NumPy indexing, please, refer to the `official documentation`_

.. _`official documentation`: https://numpy.org/doc/stable/user/basics.indexing.html

DEKER™ allows you to index and slice its ``Array`` and ``VArray`` not only with integers, but with
the ``types`` by which the dimensions are described.

But let's start with a **constraint**.


Step
~~~~

Since a ``VArray`` is split in separate files, and each file can contain an ``Array`` with more
than one dimension, the calculation of their inner bounds is a non-trivial problem.

That's why the ``step`` parameter **is limited** to ``1`` for both ``Array`` and ``VArray``
dimensions. This constraint is introduced to keep consistent behavior, although that there is no
such a problem for ``Array``.

Workaround for ``VArray`` would be to read your data and slice it again with steps, if you need,
as it will be a ``numpy.ndarray``.


Start and Stop
~~~~~~~~~~~~~~

As earlier mentioned, if your ``Dimensions`` have an additional description with ``scale`` or
``labels`` you can get rid of indexes calculations and provide your ``scale`` or ``labels`` values
to ``start`` and ``stop`` parameters.

If you have a ``TimeDimension``, you can slice it with ``datetime.datetime`` objects, its ISO 8601
formatted string or timestamps in the type of ``float``.

.. attention::
   Remember, that you shall convert your local timezone to UTC for proper ``TimeDimension`` slicing.

Let's have a closer look::

   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
      collection: Collection = client.get_collection("weather")

      array: Array = collection.filter({"dt": datetime(2023, 1, 3, 0)}).first()

      start_dt = datetime(2023, 1, 3, 5)
      end_dt = datetime(2023, 1, 3, 10)

      fancy_subset = array[
         start_dt:end_dt,  # step is timedelta(hours=1)
         -44.0:-45.0,      # y-scale start point is 90.0 and step is -1.0 (90.0 ... -90.0)
         -1.0:1.0,         # x-scale start point is -180.0 and step is 1.0 (-180.0 ... 179.0)
         :"pressure"       # captures just "temperature" and "humidity"
      ]

      # which is equivalent of:
      subset = array[
         5:10,
         134:135,
         179:181,
         :2
      ]

      assert fancy_subset.shape == subset.shape
      assert fancy_subset.bounds == subset.bounds

It is great, if you can keep in mind all the indexes and their mappings, but this feature awesome,
isn't it?! Yes, it is!!!

The values, passed to each dimension's index or slice, are converted to integers, and after that
they are set in the native Python ``slice`` object. A ``tuple`` of such ``slices`` is the final
representation of the bounds which will be applied to your data.

.. attention::
   Fancy index values must **exactly** match your dimension time, ``Scale`` or ``label`` values,
   otherwise, you will get ``IndexError``.

You have not yet approached your data, but you are closer and closer.

Now you have a new object - `Subset`.


Subset and VSubset
==================

``Subset`` and ``VSubset`` are the final lazy objects for the access to your data.

Once created, they contain no data and do not access the storage until you manually invoke one of
their correspondent methods.

.. note::
   If you need to read or write all the data from ``Array`` or ``VArray`` you should create a
   subset with ``[:]`` or ``[...]``.

Both of them also have the same interface. As for the properties, they are:

   * ``shape``: returns shape of the ``Subset`` or ``VSubset``
   * ``bounds``: returns bounds that were applied to ``Array`` or ``VArray``
   * ``dtype``: returns type of queried data
   * ``fill_value``: returns value for empty cells

Let's dive deeper into the methods.

.. note::
   The explanations below are based on the logic, implemented for the ``HDF5`` format.


Read
----

Method ``read()`` gets data from the storage and returns a ``numpy.ndarray`` of the corresponding
``shape`` and ``dtype``. Regarding ``VArray`` data reading, ``VSubset`` will capture the data from
the ``Array``, affected by the passed bounds, arrange it in a single ``numpy.ndarray`` of the
proper ``shape`` and ``dtype`` and return it to you.

If your ``Array`` or ``VArray`` is **empty** - a ``numpy.ndarray`` filled with ``fill_value`` will
be returned for any called ``Subset`` or ``VSubset``::

   import numpy as np
   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 15, 0)}).first()
       subset = array[0, 0, 0]  # get first hour and grid zero-point
       print(subset.read())  # [nan, nan, nan, nan]


Update
------

Method ``update()`` is an **upsert** method, which is responsible for new values **inserting** and
old values **updating**.

The shape of the data, that you pass into this method, shall match the shape of the ``Subset`` or
``VSubset``. It is impossible to insert 10 values into 9 cells. It is also impossible to insert
them into 11 cells, as there are no instructions how to arrange them properly. ::

   import numpy as np
   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
       subset = array[:]  # captures full array shape

       data = np.random.random(subset.shape)

       subset.update(data)

The provided data ``dtype`` shall match the dtype of ``Array`` or ``VArray`` set by the schema or
shall have the correspondent Python type to be converted into such ``dtype``::

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()
       subset = array[:]  # captures full array shape

       data = np.random.random(subset.shape).tolist  # converts data into Python list of Python floats

       subset.update(data)  # data will be converted to array.dtype

If your ``Array`` or ``VArray`` is utterly empty, ``Subset`` or ``VSubset`` will create a
``numpy.ndarray`` of the ``Array`` shape filled with the ``fill_value`` from the ``Collection``
schema and than, using the indicated bounds, it will insert the data provided by you in this array.
Afterwards it will be dumped to the storage. In the scope of ``VArray`` it will work in the same
manner, except that only corresponding affected inner ``Array`` will be created.

If there is some data in your ``Array`` or ``VArray`` and you provide some new values by this
method, the old values in the affected bounds will be substituted with new ones::

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()

       data = np.random.random(array.shape)
       array[:].update(data)

       subset = array[0, 0, 0]  # get first hour and grid zero-point

       print(subset.read())  # a list of 4 random values

       new_values = [0.1, 0.2, 0.3, 0.4]
       subset.update(new_values)  # data will be converted to array.dtype

       print(subset.read())  # [0.1, 0.2, 0.3, 0.4]


Clear
-----

Method ``clear()`` inserts the ``fill_value`` into the affected bounds. If all your ``Array`` or
``VArray`` values are ``fill_value``, it will be concerned empty and the data set will be deleted
from the file. But the file still exists and retains ``Array`` or ``VArray`` metadata::

   import numpy as np
   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()

       data = np.random.random(array.shape)
       array[:].update(data)

       subset = array[0, 0, 0]  # get first hour and grid zero-point

       print(subset.read())  # a list of 4 random values

       new_values = [0.1, 0.2, 0.3, 0.4]
       subset.update(new_values)  # data will be converted to array.dtype
       print(subset.read())  # [0.1, 0.2, 0.3, 0.4]

       subset.clear()
       print(subset.read())  # [nan, nan, nan, nan]

       array[:].clear()
       print(array[:].read()) # a numpy.ndarray full of `nans`


Describe
--------

You may want to check, what part of data you are going to manage.

With ``describe()`` you can get an ``OrderedDict`` with a description of the dimensions parts
affected by ``Subset`` or ``VSubset``. If you provided ``scale`` and/or ``labels`` for your
dimensions, you will get the human-readable description, otherwise you'll get indexes.

So it is highly recommended to describe your dimensions::

   from datetime import datetime
   from deker import Array, Client, Collection
   from pprint import pprint

   with Client("file:///tmp/deker") as client:
      collection: Collection = client.get_collection("weather")
      array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()

      pprint(array[0, 0, 0].describe())

      # OrderedDict([('day_hours',
      #             [datetime.datetime(2023, 1, 1, 0, 0, tzinfo=datetime.timezone.utc)]),
      #             ('y', [90.0]),
      #             ('x', [-180.0]),
      #             ('weather', ['temperature', 'humidity', 'pressure', 'wind_speed'])])

      subset = array[datetime(2023, 1, 1, 5):datetime(2023, 1, 1, 10),
                     -44.0:-45.0,
                     -1.0:1.0,
                     :"pressure"]

      pprint(subset.describe())

      #  OrderedDict([('day_hours',
      #               [datetime.datetime(2023, 1, 1, 5, 0, tzinfo=datetime.timezone.utc),
      #                datetime.datetime(2023, 1, 1, 6, 0, tzinfo=datetime.timezone.utc),
      #                datetime.datetime(2023, 1, 1, 7, 0, tzinfo=datetime.timezone.utc),
      #                datetime.datetime(2023, 1, 1, 8, 0, tzinfo=datetime.timezone.utc),
      #                datetime.datetime(2023, 1, 1, 9, 0, tzinfo=datetime.timezone.utc)]),
      #              ('y', [-44.0]),
      #              ('x', [-1.0, 0.0]),
      #              ('weather', ['temperature', 'humidity'])])

.. attention::
   Description is an ``OrderedDict`` object, having in values full ranges of descriptive data for
   ``Subset`` or ``VSubset``. If you keep this description in memory, your memory will be lowered
   by its size.


Read Xarray
-----------

.. warning::
   ``xarray`` package is not in the list of the DEKER™ default dependencies. Please, refer to the
   Installation_ chapter for more details

Xarray_ is a wonderful project, which provides special objects for working with multidimensional
data. Its main principle is *the data shall be described*. We absolutely agree with that.

Method ``read_xarray()`` describes a ``Subset`` or ``VSubset``, reads its contents and converts it
to ``xarray.DataArray`` object.

If you need to convert your data to ``pandas`` objects, or to ``netCDF``, or to ``ZARR`` - use this
method and after it use methods, provided by ``xarray.DataArray``::

   import numpy as np
   from datetime import datetime
   from deker import Array, Client, Collection

   with Client("file:///tmp/deker") as client:
       collection: Collection = client.get_collection("weather")
       array: Array = collection.filter({"dt": datetime(2023, 1, 1, 0)}).first()

       data = np.random.random(array.shape)
       array[:].update(data)

       subset = array[0, 0, 0]  # get first hour and grid zero-point

       x_subset: xarray.DataArray = subset.read_xarray()

       print(dir(x_subset))
       print(type(x_subset.to_dataframe()))
       print(type(x_subset.to_netcdf()))
       print(type(x_subset.to_zarr()))

It provides even more opportunities. Refer to ``xarray.DataArray`` API_ for details .

.. _Installation: installation.html#extra-dependencies
.. _Xarray: https://docs.xarray.dev/en/stable/
.. _API: https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html

Locks
======
DEKER™ is thread and process safe. It uses its own locks for the majority of operations.
DEKER™ locks can be divided into two groups: **read** and **write** locks

**Read locks** can be shared between threads and processes with no risk of data corruption.

**Write locks** are exclusive and are taken for the files with correspondent data content.
Only the process/thread, which has already acquired a write lock, may produce any changes
in the data.

It means that if one process is already writing some data into a ``HDF5`` file (or into
an ``Array``) and some other processes want to read from it or to write some other data
into the same file, they will receive a ``DekerLockError``.

.. note::
   Reading data from an ``Array``, which is locked for writing, is impossible.

Speaking about ``VArray`` it means that several processes are able to update data
in several non-intersecting ``VSubsets``. In case if any updating ``VSubset`` intersects
with another one, the update operation will be rejected for the ``VSubset``, which met
the write lock.

Please note that operation of custom attributes updating also locks Array or VArray files
for writing.