CLEPS architecture
==================

What is a cluster?
------------------

A computing cluster is a group of interconnected computers. The typical
architecture is composed of:

* a login node
* several computing nodes
* storage nodes
* a network

.. image:: ../img/ArchitectureSimpleCluster.png

In the case of CLEPS, users connect to the login node and schedule their jobs
from there. The `scheduler` (also named ressource manager) is responsible for finding
available computing resources and starting jobs using them.

.. _comp_nodes:

CLEPS Compute nodes
-------------------

All the nodes are running CentOS 7.9 with linux kernel 3.10.0.

.. cssclass:: table-striped

+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|                | | number    |             | | ``/local``  | | processor            | | hyper     | | total    | | average    |                         | | InfiniBand|               | CPUs/GPU [*]_   |
|     id         | | of nodes  |    RAM      | | disk        | | type                 | | threading | | number   | | memory     | Features                | | Network   | GPU/node      |                 |
|                |             |             |               |                        |             | | of cores | | per core   |                         |             |               |                 |
+================+=============+=============+===============+========================+=============+============+==============+=========================+=============+===============+=================+
|  node0[01-20]  | 20          | | 192 GB    | | 220 GB      | | 2x Cacade Lake       |             | 640        | 6GB          |                         |             |                                 |
|                |             | | 2667 MHz  | | 6GB/s       | | Intel Xeon 5218      |     yes     |            |              | hyperthreading,192go,   |   100Gb/s   |                                 |
|                |             |             | | SSD         | | 16 cores, 2.4GHz     |             |            |              | cascadelake             |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
|  node0[21-24]  | 4           | | 176 GB    | | 600 GB      | | 2x Intel Xeon        |             | 96         | 7.3GB        |                         |             |                                 |
|                |             | | 2400 MHz  | | 6GB/s       | | E5-2650 v4           |     yes     |            |              | hyperthreading,176go,   |   56Gb/s    |                                 |
|                |             |             | | HDD         | | 12 cores             |             |            |              | broadwell               |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+              None               |
|  node0[25-28]  | 4           | | 128 GB    | | 800 GB      | | 2x Skylake           |             | 96         | 5.3GB        |                         |             |                                 |
|                |             | | 2667 MHz  | | 6GB/s       | | Intel Xeon 5118      |     yes     |            |              | hyperhtreading,128go,   |   100Gb/s   |                                 |
|                |             |             | | HDD         | | 12 cores             |             |            |              | skylake                 |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
|  node0[29-40]  | 8           | | 192 GB    | | 800 GB      | | 2x broadwell         |             | 288        | 8GB          |                         |             |                                 |
|                |             | | 2400 MHz  | | 6GB/s       | | xeon e5-2650 v4      |     yes     |            |              | hyperthreading,192go,   |  56Gb/s     |                                 |
|                |             |             | | HDD         | | 12 cores             |             |            |              | broadwell               |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
|  node0[41-44]  | 4           | | 256 GB    | | 370 GB      | | 2x AMD EPIC 7352     |             | 192        | 5.3GB        |                         |             |                                 |
|                |             | | 3200 MHz  | | 6GB/s       | | 24 cores, 2.3GHz     |     yes     |            |              | hyperthreading,amd,256go|  100Gb/s    |                                 |
|                |             |             | | SSD         |                        |             |            |              |                         |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
|  node0[45-48]  | 4           | | 128 GB    | | 800 GB      | | 2x broadwell         |             | 112        | 4.6GB        |                         |             |                                 |
|                |             | | 2133 MHz  | | 6GB/s       | | xeon e5-2695 v3      |     yes     |            |              | hyperthreading,128go,   |  56Gb/s     |                                 |
|                |             |             | | HDD         | | 14 cores             |             |            |              | broadwell               |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
|  node0[49-56]  | 8           | | 128 GB    | | 800 GB      | | 2x broadwell         |             | 288        | 3.6GB        |                         |             |                                 |
|                |             | | 2400 MHz  | | 6GB/s       | | xeon e5-2695 v4      |     no      |            |              | nohyperthreading,128go, |  56Gb/s     |                                 |
|                |             |             | | HDD         | | 18 cores             |             |            |              | broadwell               |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+                                 |
| mem001         | 1           | | 3 TB      | | 200 GB      | | 4x Intel Xeon        |             | 48         | 62.5GB       |                         |             |                                 |
|                |             | | 1333 MHz  | | 6GB/s       | | E7-4860 v2           |     no      |            |              | nohyperthreading,3to    |   56Gb/s    |                                 |
|                |             |             | | HDD         | | 12 cores, 2.6-3.2GHz |             |            |              |                         |             |                                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|     gpu001     | 1           | | 192 GB    | | 3.8 TB      | | 2x Cascade Lake      |             | 16         | 12GB         | nohyperhtreading,192to, |             | | 2x Nvidia   | 8               |
|                |             | | 2667 MHz  | | 12GB/s      | | Intel Xeon 5217      |     no      |            |              | v100                    |   100Gb/s   | | V100 32GB   |                 |
|                |             |             | | SSD         | | 8 cores, 3-3.7GHz    |             |            |              |                         |             |               |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|   gpu00[2-3]   | 2           | | 192 GB    | | 1.5 TB      | | 2x AMD EPIC 7302     |             | 64         | 6GB          | hyperthreading,192go,   |             | | 3x Nvidia   | 16              |
|                |             | | 3200 MHz  | | 12GB/s      | | 16 cores, 3-3.3GHz   |     yes     |            |              | rtx6000                 |   100Gb/s   | | RTX6000 24GB|                 |
|                |             |             | | NVME        |                        |             |            |              |                         |             |               |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|   gpu00[4-5]   | 2           | | 96 GB     | | 200 GB      | | 2x Skylake           |             | 48         | 4GB          |  hyperthreading,96go,   |             | | 4x Nvidia   | 12              |
|                |             | | 2400 MHz  | | 6GB/s       | | Intel Xeon 5118      |     yes     |            |              |  gtx1080ti              |  56Gb/s     | | GTX 1080Ti  |                 |
|                |             |             | | HDD         | | 12 cores, 2.3-3.2GHz |             |            |              |                         |             |               |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|   gpu00[6-9]   | 4           | | 192 GB    | | 1.5 TB      | | 2x AMD EPIC 7302     |             | 128        | 6GB          |                         |             | | 3x Nvidia   | 16              |
|                |             | | 3200 MHz  | | 12GB/s      | | 16 cores, 3-3.3GHz   |     yes     |            |              | hyperthreading,192go,   |   100Gb/s   | | RTX8000 48GB|                 |
|                |             |             | | NVME        |                        |             |            |              | rtx8000                 |             |               |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|   gpu011       | 1           | | 128 GB    | | 200 GB      | | 2x Intel Xeon        |             | 28         | 4GB          |                         |             | | 4x Nvidia   | 6               |
|                |             | | 2400 MHz  | | 6 GB/s      | | E5-2650L v4          |     yes     |            |              | hyperthreading,128go,   |   56Gb/s    | | RTX2080ti   |                 |
|                |             |             | | HDD         | | 14 cores, 1.7-2.5GHz |             |            |              | rtx2080ti               |             | | 12GB        |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+
|   gpu01[2-3]   | 2           | | 256 GB    | | 3.6 GB      | | AMD EPYC 7543P       |             | 56         | 4GB          |                         |             | | 4x Nvidia   | 14              | 
|                |             | | 3200 MHz  | | 12 GB/s     | | 32 cores, 2.8GHz     |     yes     |            |              | hyperthreading,256go,   |   100Gb/s   | | A100        |                 |
|                |             |             | | HDD         |                        |             |            |              | a100                    |             | | 80GB        |                 |
+----------------+-------------+-------------+---------------+------------------------+-------------+------------+--------------+-------------------------+-------------+---------------+-----------------+


.. [*] Maximum number of cpu you can allocate with ``--cpus-per-task`` per allocated GPU card.


You'll notice that some nodes have the hyperthreading activated. It means that
you can allocate twice as many logical cores (threads) as there are physical cores on
these nodes. For example for node001 to node020, you can allocate a maximum of
64 logical cores.

CLEPS Partitions
----------------

When you submit a job with ``srun`` or ``sbatch``, you submit it into a
partition (like a queue). Nodes into a partition share common purpose or
configuration. To specify a partition when submitting a job, add the ``-p`` or
``--partition`` option, followed by the name of the partition.

.. code-block:: console

    # To submit your job into cpu_homogen partition
    srun -N 2 -p cpu_homogen <myjob>

.. code-block:: console

    $ cat <my_batch_script>.batch
    #!/bin/bash
    #SBATCH --partition=cpu_homogen
    ...

If none is specified, you job will run in the default partition ``cpu_devel``.


.. cssclass:: table-striped

+----------------+-------------+-------------+------------------------+
| | partition    |  nodes      | | jobs max  |                        |
| | name         |             | | duration  | | purpose/configuration|
|                |             |             |                        |
+================+=============+=============+========================+
|                |             |             | | Tests, compilations  |
|    cpu_devel   | node021-056 |   1 week    | | and small jobs       |
|                |             |             |                        |
+----------------+-------------+-------------+------------------------+
|                |             |             | | Homogeneous set      |
|    cpu_homogen | node001-020 |   1 week    | | of nodes. Suited     |
|                |             |             | | for scaling studies  |
|                |             |             | | of MPI jobs          |
+----------------+-------------+-------------+------------------------+
|                |             |             |  | Nodes equiped       |
|    gpu         | gpu001-009, |   2 days    |  | with GPUs           |
|                | gpu01[1-3]  |             |                        |
+----------------+-------------+-------------+------------------------+
|                |             |             |    Large memory node   |
|    mem         | mem001      |   2 days    |                        |
|                |             |             |                        |
+----------------+-------------+-------------+------------------------+
|                |             |             | | GPU node, ALMANACH   |
|    \*almanach  | gpu009      |   2 days    | | priority             |
|                |             |             |                        |
+----------------+-------------+-------------+------------------------+
|                |             |             | | GPU node, WILLOW     |
|    \*willow    | gpu01[2-3]  |   2 days    | | priority             |
|                |             |             |                        |
+----------------+-------------+-------------+------------------------+

.. warning::

   Projects have the possibility to buy computing resources and to include them
   into the CLEPS infrastructure. They benefit from the whole infrastructure and
   mechanisms such as scheduling. They can also benefit a priorirty access on their
   resources, while letting them accessible to users from other projects when
   not used. This Slurm mechanism is known as **job preemption**.

   Such resources are therefore present in two different **partitions**. The generic
   one that makes them available to everyone, and a higher priority one, only
   available to the members of the project that funded the resources. Such
   higher priority partition are marked with a `*` in the table above.

   The ``gpu`` partition is currently the only one concerned. Be aware that
   submitting in this partition could start your job on a `proprietary`
   resource, also included in either the `almanach` or `willow` partition.

   If you don't want to take the risk of beeing preempted by a higher priority
   job, you can explicitely exclude `proprietary` nodes from your allocation
   request with the ``--exclude`` option.

   Example:

   .. code-block:: console

      srun -p gpu --exclude=gpu009,gpu01[2-3] ...

   will exclude nodes gpu009, gpu012, gpu013 from your allocation.


If you belong to a team that benefit a prioritary access to some hardware,
you have to specify both your **partition** AND **account**. I.e. for the
members of ALMANACH team:

.. code-block:: console

  srun -p almanach -A almanach [options] <my_script_name>


Node features
~~~~~~~~~~~~~

In the table :ref:`comp_nodes`, you'll notice a column ``Features``. This
column ensure the possibility to target nodes with certain caracteristics
in a partition.

Example:

You want to target nodes with AMD processors in the cpu_devel partition
(default partition):

.. code-block:: console

  srun --constraint=amd ...

See the `Slurm documentation <https://slurm.schedmd.com/sbatch.html#OPT_constraint>`_
for more information.

CLEPS Storage
-----------------

The :code:`/home` path
~~~~~~~~~~~~~~~~~~~~~~~

Your :code:`/home` path is the prefered place to compile your code and
do small development tasks. It is **backuped** so it also a good place to store
important data. It is accessible with the `$HOME` environment variable.


* Capacity and quotas

This partition is 9TB xfs filesystem and your disk space quota is set at 100GB.

To check your disk usage:

.. code-block:: console

    cd
    du -sh .

The :code:`/scratch` path
~~~~~~~~~~~~~~~~~~~~~~~~~~

The scratch partition is a Lustre parallel filesystem and thereby designed to
support **large-file** parallel IO. It is **not backuped** so it is not the
right place to leave important data. It is accessible with the `$SCRATCH` environment variable.

* Capacity and quotas

There are currently 500To available and a project quota of 20To is applied to each GID(=Team/EPI/service).
If you need more space, you can contact directly the support via the
`helpdesk <https://helpdesk.inria.fr>`_.

To check your project quota status:

.. code-block:: console

    # First, get your project ID
    grep <group> /etc/lustre-projectid-gid | cut -c1
    lfs quota -ph <project ID> /scratch


Lustre offers many tuning parameters to increase performances even
for small files. Check the :ref:`file_striping` page to know how to tune
your :code:`/scratch` tree.

Two special folders are available on the `/scratch` partition:

* `/scratch/_projets_/<user_main_group>`, a folder shared by all the members of a project, accessible
  with the `$PROJECT` environment variable.

* `/scratch/_public_`, a read-only folder accessible to every user. It can be used to store
  large data shared by several projects. An explicit demand must be done for the admin to write
  the data. It is accessible with the variable `$PUBLIC`.

The :code:`/local` path
~~~~~~~~~~~~~~~~~~~~~~~

As the name suggests, this place is local to each node. It can be accessed only
while a job is running with the variable $TMP_DIR. You can see how much space
is available on each node in the :ref:`comp_nodes` section.  

.. warning:: 
   
   This folder (:code:`/local`) is a **temporary** storage solution, available at the scale
   of a running job. Once your job is over, all your data are erased.