Computing environement

Preinstalled software: Using Modules

CLEPS make softwares available through an environment variables manager called module. By modifying environment variables, this tool allows the nodes to access specific versions of preinstalled libraries.

Example:

$ module load valgrind
$ env | grep -i valgrind
MANPATH=/opt/ohpc/pub/utils/valgrind/3.20.0/share/man
VALGRIND_DIR=/opt/ohpc/pub/utils/valgrind/3.20.0
LD_LIBRARY_PATH=/opt/ohpc/pub/utils/valgrind/3.20.0/lib/pkgconfig
VALGRIND_LIB=/opt/ohpc/pub/utils/valgrind/3.20.0/lib/valgrind
...

Module commands

Most useful commands are:

Command

Description

module avail

List currently available modules (see Compiler/MPI families)

module spider

List all possible modules

module spider name1

Print information about a module

module load name1 name2 ...

Load a set of modules

module list

List loaded modules

module unload name1 name2 ...

Unload a set of modules

module purge

Unload all loaded modules

module help

Print module help

Check Lmod user documentation for more commands.

Compiler/MPI families

The module avail command doesn’t print all the available modules, but instead it prints the modules whose dependencies are already loaded. When you connect to CLEPS, if you have no module loaded by default, you should see something like this:

$ module avail

-------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
autotools       hwloc/2.9.3         matlab/2023b        pmix/4.2.9
cmake/3.24.2    julia/1.10.4        matlab/2024a (D)    prun/2.2
gnu12/12.2.0    libfabric/1.18.0    os                  ucx/1.15.0
gnu13/13.2.0    matlab/2023a        papi/6.0.0          valgrind/3.20.0

...

Among these modules, you’ll notice compilers (gnu12, gnu13). Compilers define what is called a family of modules, which means only one can be loaded at a time. If you load one (gnu13 for instance) and retype the module avail command:

$ module load gnu13
$ module avail

----------------------------------- /opt/ohpc/pub/moduledeps/gnu13 ------------------------------------
   R/4.2.1      likwid/5.3.0    openblas/0.3.21    pdtoolkit/3.25.1    scotch/6.0.6
   gsl/2.7.1    metis/5.1.0     openmpi5/5.0.3     plasma/21.8.29      superlu/5.2.1

-------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
   autotools           hwloc/2.9.3         matlab/2023b        pmix/4.2.9
   cmake/3.24.2        julia/1.10.4        matlab/2024a (D)    prun/2.2
   gnu12/12.2.0        libfabric/1.18.0    os                  ucx/1.15.0
   gnu13/13.2.0 (L)    matlab/2023a        papi/6.0.0          valgrind/3.20.0


...

Modules compiled with the gnu8 compilers are now available.

MPI libraries also form a family of modules. If you load openmpi5 and check for available modules again:

 $ module load openmpi5
 $ module avail

 ------------------------------- /opt/ohpc/pub/moduledeps/gnu13-openmpi5 -------------------------------
    boost/1.81.0     extrae/3.8.3    imb/2021.3    phdf5/1.14.0       scalasca/2.5    sionlib/1.7.7
    dimemas/5.4.2    fftw/3.3.10     omb/7.3       scalapack/2.2.0    scorep/7.1      tau/2.31.1

 ----------------------------------- /opt/ohpc/pub/moduledeps/gnu13 ------------------------------------
    R/4.2.1      likwid/5.3.0    openblas/0.3.21        pdtoolkit/3.25.1    scotch/6.0.6
    gsl/2.7.1    metis/5.1.0     openmpi5/5.0.3  (L)    plasma/21.8.29      superlu/5.2.1

 -------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
    autotools           hwloc/2.9.3      (L)    matlab/2023b        pmix/4.2.9
    cmake/3.24.2        julia/1.10.4            matlab/2024a (D)    prun/2.2
    gnu12/12.2.0        libfabric/1.18.0 (L)    os                  ucx/1.15.0      (L)
    gnu13/13.2.0 (L)    matlab/2023a            papi/6.0.0          valgrind/3.20.0

...

Then modules compiled with the couple gnu13/openmpi5 are available.

You can play with the different combinations to see what modules are available. You’ll notice that most of them are compiled with the gnu13 compilers suite. This is because the framework OpenHPC, used to configure CLEPS, offers many prepackaged modules with gnu13. For other compilers/MPI combinations, the admins have to manually compile the libraries.

If installed modules don’t satisfy your requirements, you can submit a ticket to the helpdesk or add an issue in the gitlab of this documentation.

Debugging and profiling tools

ScoreP/Scalasca profiling tools are available via module. There usage can’t be describe in this documentation but you can find information here.

Installing with Conda

Conda is a package manager allowing users to install software (without having necessarily admin rights). It was primarily developed to make Python users life easier by providing many data science packages and an environment mechanism to install different versions of a package. Nowadays, it hosts other packages written in different programming languages.

Conda installation

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Environment setup

Suppose you need an environment with Python 3.8:

$ conda create -n py38 python=3.8

You now have an environment called py38 available. To activate it, type:

$ conda activate py38
$ python --version
Python 3.8.5

Once the environment activated, every conda install/remove <package> will only affect this environment.

Tip

  • Conda may take a long time to resolve dependencies in large environmnents. Try as much as possible to keep your environments small.

  • Run periodically conda clean --all in order to remove unnecessary package and tarballs and free space from /scratch.

When submitting a job with Slurm (see Launching jobs), don’t forget to activate your conda environment. To do so, include these lines into your batch script:

source /home/$USER/.bashrc
conda activate <myenv>

Mpi4Py users

Warning

The conda installation of mpi4py comes with mpich as a dependency. This behavior can create conflicts with another version of MPI you have loaded.

You can compile mpi4y with a MPI compiler provided by a module on CLEPS. To do so, load first the mpi module you want to use, and then install mpi4py with pip. The pip installer will automatically find the available MPI compiler, and use it to compile mpi4py. Example with the openmpi5 library:

module load openmpi5
pip install mpi4py

The drawback of this method is that you have to load the MPI module whenever you want to use mpi4py.

GPU users

Since cudatoolkit/gpu_libraries are numerous, we decided not to provide cudatoolkit modules, but rather to let the user install the versions required by its own means (usually conda).

Tensorflow environment

Currently (august 2024), the recommended method to install Tensorflow is:

$ conda create -n tf_env
$ conda activate tf_env
$ conda install pip
$ pip install tensorflow[and-cuda]

Note

Warnings appear when importing the module in python, but it still seems able to use allocated GPU cards correctly.

PyTorch environment

PyTorch can be installed by following similar instructions.

$ conda create -n pt_env
$ conda activate pt_env
$ conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia

If needed, check the official documentation.

Apptainer (formerly Singularity)

Working with Docker/Apptainer containers

This part has been partially contributed by Jean Feydy.

Note

Apptainer was formerly known as Singularity. Most commands are the same, and environment variables prefixed with SINGULARITY_ have been replaced with APPTAINER_. You can find more info about the changes here.

Researchers often have very specific software requirements. Admins on academic clusters provide a wide collection of standard software packages through the “module” command (Cleps and Jean Zay documentations), but have a hard time keeping up with the latest CUDA/PyTorch version or supporting your favorite computational biology software.

Fortunately, all clusters support a convenient system to let you work autonomously: containers. Containers are “modern” virtual machines that come with negligible performance overheads. They are:

  • Defined through a series of command line instructions known as a Dockerfile. These just describe what you would run on a brand new computer to install your software.

  • Compiled as immutable system images (.sif files), that you can copy-paste between different computers. They behave like the .iso files that we use to install a new Linux distribution via a CD or USB stick.

  • Used through the ``apptainer run/exec`` command, is sits on your system kernel and `replaces` your user environment by the one that is defined in the .sif file. Of course, you can still access your local files transparently, install pip/conda/CRAN packages and run interactive applications via Jupyter or Matplotlib.

This solution lets you:

  • Pick freely the system configuration that you like, typically by downloading a standard image from DockerHub. You can also create your own system image by executing a set of command line instructions on your local machine, where you have admin rights, and then simply uploading your system image .sif file on the cluster.

  • Work autonomously from sysadmins. Your system gets updated when you decide - not as a consequence of a cluster-wide decision that breaks all of your dependencies 3 days before an important deadline. This is also the simplest way of running e.g. a classic baseline that requires a deprecated version of CUDA.

  • Use the exact same workflow on all computers. The method below is supported on Cleps or Jean Zay and is easy to setup on commercial clouds like AWS. Apptainer also works fine on your local machine: “packages” are available for most GNU/Linux distributions, and installation instructions are available for Windows and MacOS.

Note

Docker is the industry standard software to manage containers. However, for security reasons, academic clusters worldwide prefer to deploy Apptainer. Fortunately, Apptainer is 100% compatible with Dockerfiles and DockerHub: for users, this just means that we must “compile” DockerHub images to .sif Singularity Image Files once, and use the apptainer run command instead of docker run.

0 – Find an image that you like

You have access to multiple images such as Python3, Pytorch and many others on Cleps-images project.

# --docker-login will ask for your gitlab username and password
apptainer pull --docker-login docker://registry.gitlab.inria.fr/paris-cluster-2019/cleps/cleps-images/<image>

apptainer run <image>.sif

If you don’t find what you are looking for there, there are other sources for containers such as :

Sources

URls

Gitlab

docker://registry/image

Docker Hub

docker://user/image:tag

Singularity Hub

shub://user/image:tag

Library

library://user/collection/container[:tag]

OCI

oras://registry/namespace/image:tag

Http, Https

http(s)://url/image:tag

1 – Tutorials

In this tutorial, we will pick the image python3-ocr present on the Container Registry of the project CLEPS-images. And we will use it with a text recognition program.

Step 1 – Connect to a compute node

Use SSH to connect to the frontal node of your cluster (Cleps, Jean Zay, etc.). As usual: - You can use a custom .ssh/config file and ssh-keygen + ssh-copy-id to setup a password-free connection. - Using tmux to create a shell that will “survive” connection issues is also strongly advised. - If you are outside of the Inria Paris center, you may need to use a VPN to access Cleps. To this end, simply run sudo openconnect -u <login> vpn.inria.fr in a new terminal and keep it open while you work on the cluster.

Then, you may request an interactive session with e.g.:

# Request a node with (any) 1 GPU, a few cores and memory:
salloc -c 8 --gres=gpu:1 -p gpu --mem=15g
# Or request a specific GPU with:
salloc -c 8 --gres=gpu:v100:1 -p gpu --mem=15g

Once resources are available, you will be logged to a GPU-enabled computer named e.g. gpu001.

Step 2 – Pull the test project and the container

Place yourself in your /scratch/<login> directory and clone the example project:

git clone https://gitlab.inria.fr/paris-cluster-2019/cleps/cleps-user-examples/ocr-pytorch-example.git

cd ocr-pytorch-example/ocr.pytorch

# --docker-login will ask for your gitlab username and password
apptainer pull --docker-login docker://registry.gitlab.inria.fr/paris-cluster-2019/cleps/cleps-images/python3-ocr:1.0

Step 3 – Connect to the container

apptainer run --nv python3-ocr_1.0.sif

Step 4 – Run the demo

python demo.py

Step 5 – Submit jobs to the queue

  • ``my-job.sbatch`` is executed by the host system. It contains your hardware requirements, and the command that launches the Apptainer container with your custom --bind options.

  • ``my-job.sh`` is executed inside the Singularity container. It contains the actual logic of your script.

First of all, close your Singularity container (Ctrl+D), and get back to the frontal node of the cluster (Cleps or Jean Zay, the workflow is identical since both of them use the “Slurm” queuing system).

#!/bin/bash

#SBATCH --job-name=it_works            # create a short name for your job
#SBATCH --mail-type=ALL                # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=your.name@inria.fr # Where to send mail
#SBATCH --nodes=1                      # node count
#SBATCH --ntasks=1                     # total number of tasks across all nodes
#SBATCH --cpus-per-gpu=8               # threads per task (>1 if multi-threaded tasks)
#SBATCH --partition=gpu                # partition name
#SBATCH --gres=gpu:rtx6000:1           # Number and type of GPU cards
#SBATCH --mem=40G                      # Total memorye allocated
#SBATCH --time=03:00:00                # total run time limit (HH:MM:SS)
#SBATCH --output=logs/my_job_%j.out    # output file name
#SBATCH --error=logs/my_job_%j.err     # error file name


# Write interesting info in the log:
echo "### Running $SLURM_JOB_NAME ###"

# And make sure that commands are printed too:
set -x

# Go in the folder (presumably, /home/<login>/)
cd ${SLURM_SUBMIT_DIR}

# Run demo.py:
apptainer exec \
-H ~/scratch/ocr-pytorch-example/ocr.pytorch/ \
--bind ~/scratch/ocr-pytorch-example/ocr.pytorch/ \
--nv \ ~/scratch/ocr-pytorch-example/ocr.pytorch/python3-ocr_1.0.sif \ ~/scratch/ocr-pytorch-example/ocr.pytorch/demo.py
#!/bin/bash

echo "It works!"

To submit your job, just run ``sbatch my-job.sbatch``. You may have to wait a little bit: use squeue -u $USER to know if your job is running or still in the queue. That’s it :-)

In this tutorial, we will pick the reference Docker image for the KeOps library:

  • It is based on Ubuntu and contains a full working installation of miniconda, R, CUDA, PyTorch, KeOps and GeomLoss.

  • It is available here on DockerHub and weighs around 10Gb.

  • The KeOps developers generated the image and pushed it on DockerHub by executing this script, which selects up-to-date versions for this Dockerfile. (So you don’t have to.)

Step 1 – Build an Apptainer Image File

Doing this on your local machine may be the simplest option, since you have admin rights and may work around possible quotas. Otherwise, you may simply try to run the build command on your cluster.

In any case, to download an image from DockerHub and turn it into a portable .sif image file, run:

mkdir cache
mkdir tmp
mkdir tmp2
APPTAINER_TMPDIR=`pwd`/tmp APPTAINER_CACHEDIR=`pwd`/cache \
apptainer build --tmpdir `pwd`/tmp2 keops-full.sif \
docker://getkeops/keops-full:latest

Warning

This image is pretty heavy (~10 Gb), so it is safer to create cache folders on the hard drive instead of relying on the RAM-only tmpfs.

Note

Depending on your connection speed, this step may take 10mn to 60mn. You may prefer to execute it on your local computer and then copy the resulting file keops-full.sif to the cluster. Alternatively, on the Jean Zay cluster, you may use the prepost partition to have access to both a large RAM and an internet connection.

If you built the image locally, upload it to the cluster with:

rsync -a -P keops-full.sif <login>@cleps.inria.fr:/home/<login>/scratch/

Warning

This command creates a large image file keops-full.sif, whose size may exceed the disk quota for your /home folder. Storing the image in the ``/scratch`` folder of your cluster is strongly advised. Please also note that for security reasons, Jean Zay asks users to follow a specific workflow.

Step 2 – Connect to a compute node

Use SSH to connect to the frontal node of your cluster (Cleps, Jean Zay, etc.). As usual: - You can use a custom .ssh/config file and ssh-keygen + ssh-copy-id to setup a password-free connection. - Using tmux to create a shell that will “survive” connection issues is also strongly advised. - If you are outside of the Inria Paris center, you may need to use a VPN to access Cleps. To this end, simply run sudo openconnect -u <login> vpn.inria.fr in a new terminal and keep it open while you work on the cluster.

Then, you may request an interactive session with e.g.:

# Request a node with (any) 1 GPU, a few cores and memory:
salloc -c 8 --gres=gpu:1 -p gpu --mem=15g
# Or request a specific GPU with:
salloc -c 8 --gres=gpu:v100:1 -p gpu --mem=15g

Once resources are available, you will be logged to a GPU-enabled computer named e.g. gpu001.

Step 3 – Connect to the container

Once you are on a compute node, you may connect to your custom environment with:

# Run a bash terminal in your container.
# The --nv option (for NVidia) mounts the GPU ;
# don't use it if you are on a CPU-only node.
apptainer exec --nv ~/scratch/keops-full.sif /bin/bash

That’s it! In this new terminal, you should have access to your custom software.

In practice, you probably want your container to access some files and folders in your host system. This can be achieved with the -H and --bind options:

mkdir ~/my-project-home-folder

# Use a custom folder as your "/home" within Apptainer,
# and forward your ".Xauthority" to store credentials for GUI applications:
apptainer exec \
-H ~/my-project-home-folder/:/home \
--bind ~/.Xauthority:/home/.Xauthority \
--nv \ ~/scratch/keops-full.sif /bin/bash

Step 4 – Run Jupyter notebooks interactively

On Cleps, you may easily access your container via a Web browser. To this end, in your remote Apptainer container, run:

# Install a JupyterLab server, within your container:
pip install jupyterlab

# Run the server, letting it know that the current machine is not
# called "localhost" (= the frontal node) but something like
# "gpu001" instead (= the compute node).
jupyter-lab --no-browser --ip $(hostname)

This should output a lot of information, with a connection link that reads like:

http://gpu001:8888/lab?token=...

Here, gpu001 is the name of your machine in the Cleps cluster, and 8888 is the id of the port that Jupyter intends to use for communications. Let’s connect to this server!

On your local machine, in a new terminal, open an SSH tunnel with:

# Use the correct port and node name, something like:
ssh -N -L 8888:gpu001:8888 <login>@cleps.paris.inria.fr

This will redirect all of your local connections on port 8888 to the gpu001 node of Cleps, on port 8888.

Finally, open a web browser and copy-paste the connection link that was created by JupyterLab. Just replace gpu001 or equivalent with localhost to get an address that reads:

http://localhost:8888/lab?token=...

and you should be good to go!

Warning

Unfortunately, due to security concerns, Jean Zay requires a specific workflow to run Jupyter. I don’t know if it is compatible with containers.

Step 5 – Submit jobs to the queue

Jupyter notebooks are great for interactive development, but are not really suited to heavy workloads. To work asynchronously, use the sbatch command with two descriptive files:

  • ``my-job.sbatch`` is executed by the host system. It contains your hardware requirements, and the command that launches the Singularity container with your custom --bind options.

  • ``my-job.sh`` is executed inside the Singularity container. It contains the actual logic of your script.

First of all, close your Singularity container (Ctrl+D), and get back to the frontal node of the cluster (Cleps or Jean Zay, the workflow is identical since both of them use the “Slurm” queuing system).

Then, in your /home folder, create a file ``my-job.sbatch`` with the following content (adapt resources allocation to your needs):

#!/bin/bash

#SBATCH --job-name=it_works            # create a short name for your job
#SBATCH --mail-type=ALL                # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=your.name@inria.fr # Where to send mail
#SBATCH --nodes=1                      # node count
#SBATCH --ntasks=1                     # total number of tasks across all nodes
#SBATCH --cpus-per-gpu=8               # threads per task (>1 if multi-threaded tasks)
#SBATCH --partition=gpu                # partition name
#SBATCH --gres=gpu:rtx6000:1           # Number and type of GPU cards
#SBATCH --mem=40G                      # Total memorye allocated
#SBATCH --time=03:00:00                # total run time limit (HH:MM:SS)
#SBATCH --output=logs/my_job_%j.out    # output file name
#SBATCH --error=logs/my_job_%j.err     # error file name

# Write interesting info in the log:
echo "### Running $SLURM_JOB_NAME ###"

# And make sure that commands are printed too:
set -x

# Go in the folder (presumably, /home/<login>/)
cd ${SLURM_SUBMIT_DIR}

# Mount your custom home folder and my-job.sh in the container,
# don't forget the --nv option if you use a GPU,
# and run my-job.sh:
apptainer exec \
-H ~/my-project-home-folder/:/home \
--bind ~/my-job.sh:/home/my-job.sh \
--nv \ ~/scratch/keops-full.sif \ /home/my-job.sh

Likewise, create your script ``my-job.sh``:

#!/bin/bash

echo "It works!"

To submit your job, just run ``sbatch my-job.sbatch``. You may have to wait a little bit: use squeue -u $USER to know if your job is running or still in the queue. That’s it :-)

Warning

The partition names and hardware requirements depend on your cluster. Read the documentation of the cluster or contact your project manager to know what options are best for you. As an example, here is a typical set of scripts for Jean Zay.

Data management

Transfer files from your machine

If your files are located on your personal computer, you can transfer them onto CLEPS. On your machine, type the following command:

scp -r code_directory <login>@cleps.inria.fr:<destination/path>

In the same way, you can copy your data from the cluster to you local machine. On your machine, type:

scp -r <login>@cleps.inria.fr:<path_to_your_data> data_directory