Computing environement
Preinstalled software: Using Modules
CLEPS make softwares available through an environment variables manager called module. By modifying environment variables, this tool allows the nodes to access specific versions of preinstalled libraries.
Example:
$ module load valgrind
$ env | grep -i valgrind
MANPATH=/opt/ohpc/pub/utils/valgrind/3.20.0/share/man
VALGRIND_DIR=/opt/ohpc/pub/utils/valgrind/3.20.0
LD_LIBRARY_PATH=/opt/ohpc/pub/utils/valgrind/3.20.0/lib/pkgconfig
VALGRIND_LIB=/opt/ohpc/pub/utils/valgrind/3.20.0/lib/valgrind
...
Module commands
Most useful commands are:
Command |
Description |
---|---|
|
List currently available modules (see Compiler/MPI families) |
|
List all possible modules |
|
Print information about a module |
|
Load a set of modules |
|
List loaded modules |
|
Unload a set of modules |
|
Unload all loaded modules |
|
Print module help |
Check Lmod user documentation for more commands.
Compiler/MPI families
The module avail
command doesn’t print all the available modules, but
instead it prints the modules whose dependencies are already loaded. When you
connect to CLEPS, if you have no module loaded by default, you should see
something like this:
$ module avail
-------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
autotools hwloc/2.9.3 matlab/2023b pmix/4.2.9
cmake/3.24.2 julia/1.10.4 matlab/2024a (D) prun/2.2
gnu12/12.2.0 libfabric/1.18.0 os ucx/1.15.0
gnu13/13.2.0 matlab/2023a papi/6.0.0 valgrind/3.20.0
...
Among these modules, you’ll notice compilers (gnu12, gnu13).
Compilers define what is called a family of modules, which means only one can
be loaded at a time. If you load one (gnu13 for instance) and retype the
module avail
command:
$ module load gnu13
$ module avail
----------------------------------- /opt/ohpc/pub/moduledeps/gnu13 ------------------------------------
R/4.2.1 likwid/5.3.0 openblas/0.3.21 pdtoolkit/3.25.1 scotch/6.0.6
gsl/2.7.1 metis/5.1.0 openmpi5/5.0.3 plasma/21.8.29 superlu/5.2.1
-------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
autotools hwloc/2.9.3 matlab/2023b pmix/4.2.9
cmake/3.24.2 julia/1.10.4 matlab/2024a (D) prun/2.2
gnu12/12.2.0 libfabric/1.18.0 os ucx/1.15.0
gnu13/13.2.0 (L) matlab/2023a papi/6.0.0 valgrind/3.20.0
...
Modules compiled with the gnu8 compilers are now available.
MPI libraries also form a family of modules. If you load openmpi5
and
check for available modules again:
$ module load openmpi5
$ module avail
------------------------------- /opt/ohpc/pub/moduledeps/gnu13-openmpi5 -------------------------------
boost/1.81.0 extrae/3.8.3 imb/2021.3 phdf5/1.14.0 scalasca/2.5 sionlib/1.7.7
dimemas/5.4.2 fftw/3.3.10 omb/7.3 scalapack/2.2.0 scorep/7.1 tau/2.31.1
----------------------------------- /opt/ohpc/pub/moduledeps/gnu13 ------------------------------------
R/4.2.1 likwid/5.3.0 openblas/0.3.21 pdtoolkit/3.25.1 scotch/6.0.6
gsl/2.7.1 metis/5.1.0 openmpi5/5.0.3 (L) plasma/21.8.29 superlu/5.2.1
-------------------------------------- /opt/ohpc/pub/modulefiles --------------------------------------
autotools hwloc/2.9.3 (L) matlab/2023b pmix/4.2.9
cmake/3.24.2 julia/1.10.4 matlab/2024a (D) prun/2.2
gnu12/12.2.0 libfabric/1.18.0 (L) os ucx/1.15.0 (L)
gnu13/13.2.0 (L) matlab/2023a papi/6.0.0 valgrind/3.20.0
...
Then modules compiled with the couple gnu13/openmpi5
are available.
You can play with the different combinations to see what modules are available. You’ll notice that most of them are compiled with the gnu13 compilers suite. This is because the framework OpenHPC, used to configure CLEPS, offers many prepackaged modules with gnu13. For other compilers/MPI combinations, the admins have to manually compile the libraries.
If installed modules don’t satisfy your requirements, you can submit a ticket to the helpdesk or add an issue in the gitlab of this documentation.
Debugging and profiling tools
ScoreP/Scalasca profiling tools are available via module. There usage can’t be describe in this documentation but you can find information here.
Installing with Conda
Conda is a package manager allowing users to install software (without having necessarily admin rights). It was primarily developed to make Python users life easier by providing many data science packages and an environment mechanism to install different versions of a package. Nowadays, it hosts other packages written in different programming languages.
Conda installation
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
Environment setup
Suppose you need an environment with Python 3.8
:
$ conda create -n py38 python=3.8
You now have an environment called py38
available. To activate it, type:
$ conda activate py38
$ python --version
Python 3.8.5
Once the environment activated, every conda install/remove <package>
will
only affect this environment.
Tip
Conda may take a long time to resolve dependencies in large environmnents. Try as much as possible to keep your environments small.
Run periodically
conda clean --all
in order to remove unnecessary package and tarballs and free space from/scratch
.
When submitting a job with Slurm (see Launching jobs), don’t forget to activate your conda environment. To do so, include these lines into your batch script:
source /home/$USER/.bashrc
conda activate <myenv>
Mpi4Py users
Warning
The conda installation of mpi4py comes with mpich as a dependency. This behavior can create conflicts with another version of MPI you have loaded.
You can compile mpi4y
with a MPI compiler provided by a module on CLEPS. To do
so, load first the mpi module you want to use, and then install mpi4py
with
pip
. The pip
installer will automatically find the available MPI
compiler, and use it to compile mpi4py
. Example with the openmpi5
library:
module load openmpi5
pip install mpi4py
The drawback of this method is that you have to load the MPI module whenever you
want to use mpi4py
.
GPU users
Since cudatoolkit/gpu_libraries are numerous, we decided not to provide cudatoolkit modules, but rather to let the user install the versions required by its own means (usually conda).
Tensorflow environment
Currently (august 2024), the recommended method to install Tensorflow is:
$ conda create -n tf_env
$ conda activate tf_env
$ conda install pip
$ pip install tensorflow[and-cuda]
Note
Warnings appear when importing the module in python, but it still seems able to use allocated GPU cards correctly.
PyTorch environment
PyTorch can be installed by following similar instructions.
$ conda create -n pt_env
$ conda activate pt_env
$ conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia
If needed, check the official documentation.
Apptainer (formerly Singularity)
Working with Docker/Apptainer containers
This part has been partially contributed by Jean Feydy.
Note
Apptainer was formerly known as Singularity. Most commands are the same, and environment variables prefixed with SINGULARITY_ have been replaced with APPTAINER_. You can find more info about the changes here.
Researchers often have very specific software requirements. Admins on academic clusters provide a wide collection of standard software packages through the “module” command (Cleps and Jean Zay documentations), but have a hard time keeping up with the latest CUDA/PyTorch version or supporting your favorite computational biology software.
Fortunately, all clusters support a convenient system to let you work autonomously: containers. Containers are “modern” virtual machines that come with negligible performance overheads. They are:
Defined through a series of command line instructions known as a Dockerfile. These just describe what you would run on a brand new computer to install your software.
Compiled as immutable system images (
.sif
files), that you can copy-paste between different computers. They behave like the.iso
files that we use to install a new Linux distribution via a CD or USB stick.Used through the ``apptainer run/exec`` command, is sits on your system kernel and `replaces` your user environment by the one that is defined in the
.sif
file. Of course, you can still access your local files transparently, install pip/conda/CRAN packages and run interactive applications via Jupyter or Matplotlib.
This solution lets you:
Pick freely the system configuration that you like, typically by downloading a standard image from DockerHub. You can also create your own system image by executing a set of command line instructions on your local machine, where you have admin rights, and then simply uploading your system image
.sif
file on the cluster.Work autonomously from sysadmins. Your system gets updated when you decide - not as a consequence of a cluster-wide decision that breaks all of your dependencies 3 days before an important deadline. This is also the simplest way of running e.g. a classic baseline that requires a deprecated version of CUDA.
Use the exact same workflow on all computers. The method below is supported on Cleps or Jean Zay and is easy to setup on commercial clouds like AWS. Apptainer also works fine on your local machine: “packages” are available for most GNU/Linux distributions, and installation instructions are available for Windows and MacOS.
Note
Docker is the industry standard software to manage containers.
However, for security reasons, academic clusters worldwide prefer
to deploy Apptainer. Fortunately, Apptainer is 100%
compatible with Dockerfiles and DockerHub: for users, this just
means that we must “compile” DockerHub images to .sif
Singularity Image Files once, and use the apptainer run
command instead of docker run
.
0 – Find an image that you like
You have access to multiple images such as Python3, Pytorch and many others on Cleps-images project.
# --docker-login will ask for your gitlab username and password
apptainer pull --docker-login docker://registry.gitlab.inria.fr/paris-cluster-2019/cleps/cleps-images/<image>
apptainer run <image>.sif
If you don’t find what you are looking for there, there are other sources for containers such as :
Sources |
URls |
---|---|
Gitlab |
docker://registry/image |
Docker Hub |
docker://user/image:tag |
Singularity Hub |
shub://user/image:tag |
Library |
library://user/collection/container[:tag] |
OCI |
oras://registry/namespace/image:tag |
Http, Https |
http(s)://url/image:tag |
1 – Tutorials
In this tutorial, we will pick the image python3-ocr present on the Container Registry of the project CLEPS-images. And we will use it with a text recognition program.
Step 1 – Connect to a compute node
Use SSH to connect to the frontal node of your cluster (Cleps, Jean
Zay, etc.). As usual: - You can use a custom .ssh/config
file and
ssh-keygen
+ ssh-copy-id
to setup a password-free
connection. - Using tmux to create
a shell that will “survive” connection issues is also strongly
advised.
- If you are outside of the Inria Paris center, you may need to use
a VPN to access Cleps. To this end, simply run sudo openconnect -u
<login> vpn.inria.fr
in a new terminal and keep it open while you
work on the cluster.
Then, you may request an interactive session with e.g.:
# Request a node with (any) 1 GPU, a few cores and memory:
salloc -c 8 --gres=gpu:1 -p gpu --mem=15g
# Or request a specific GPU with:
salloc -c 8 --gres=gpu:v100:1 -p gpu --mem=15g
Once resources are available, you will be logged to a GPU-enabled computer
named e.g. gpu001
.
Step 2 – Pull the test project and the container
Place yourself in your /scratch/<login>
directory and clone the example project:
git clone https://gitlab.inria.fr/paris-cluster-2019/cleps/cleps-user-examples/ocr-pytorch-example.git
cd ocr-pytorch-example/ocr.pytorch
# --docker-login will ask for your gitlab username and password
apptainer pull --docker-login docker://registry.gitlab.inria.fr/paris-cluster-2019/cleps/cleps-images/python3-ocr:1.0
Step 3 – Connect to the container
apptainer run --nv python3-ocr_1.0.sif
Step 4 – Run the demo
python demo.py
Step 5 – Submit jobs to the queue
``my-job.sbatch`` is executed by the host system. It contains your hardware requirements, and the command that launches the Apptainer container with your custom
--bind
options.``my-job.sh`` is executed inside the Singularity container. It contains the actual logic of your script.
First of all, close your Singularity container (Ctrl+D), and get back to the frontal node of the cluster (Cleps or Jean Zay, the workflow is identical since both of them use the “Slurm” queuing system).
#!/bin/bash
#SBATCH --job-name=it_works # create a short name for your job
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=your.name@inria.fr # Where to send mail
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-gpu=8 # threads per task (>1 if multi-threaded tasks)
#SBATCH --partition=gpu # partition name
#SBATCH --gres=gpu:rtx6000:1 # Number and type of GPU cards
#SBATCH --mem=40G # Total memorye allocated
#SBATCH --time=03:00:00 # total run time limit (HH:MM:SS)
#SBATCH --output=logs/my_job_%j.out # output file name
#SBATCH --error=logs/my_job_%j.err # error file name
# Write interesting info in the log:
echo "### Running $SLURM_JOB_NAME ###"
# And make sure that commands are printed too:
set -x
# Go in the folder (presumably, /home/<login>/)
cd ${SLURM_SUBMIT_DIR}
# Run demo.py:
apptainer exec \
-H ~/scratch/ocr-pytorch-example/ocr.pytorch/ \
--bind ~/scratch/ocr-pytorch-example/ocr.pytorch/ \
--nv \ ~/scratch/ocr-pytorch-example/ocr.pytorch/python3-ocr_1.0.sif \ ~/scratch/ocr-pytorch-example/ocr.pytorch/demo.py
#!/bin/bash
echo "It works!"
To submit your job, just run ``sbatch my-job.sbatch``. You may
have to wait a little bit: use squeue -u $USER
to know if your job
is running or still in the queue. That’s it :-)
In this tutorial, we will pick the reference Docker image for the KeOps library:
It is based on Ubuntu and contains a full working installation of miniconda, R, CUDA, PyTorch, KeOps and GeomLoss.
It is available here on DockerHub and weighs around 10Gb.
The KeOps developers generated the image and pushed it on DockerHub by executing this script, which selects up-to-date versions for this Dockerfile. (So you don’t have to.)
Step 1 – Build an Apptainer Image File
Doing this on your local machine may be the simplest option, since you have admin rights and may work around possible quotas. Otherwise, you may simply try to run the build command on your cluster.
In any case, to download an image from DockerHub and turn it into a
portable .sif
image file, run:
mkdir cache
mkdir tmp
mkdir tmp2
APPTAINER_TMPDIR=`pwd`/tmp APPTAINER_CACHEDIR=`pwd`/cache \
apptainer build --tmpdir `pwd`/tmp2 keops-full.sif \
docker://getkeops/keops-full:latest
Warning
This image is pretty heavy (~10 Gb), so it is safer to create cache folders on the hard drive instead of relying on the RAM-only tmpfs.
Note
Depending on your connection speed, this step may take 10mn to
60mn. You may prefer to execute it on your local computer and then
copy the resulting file keops-full.sif
to the
cluster. Alternatively, on the Jean Zay cluster, you may use the
prepost
partition to have access to both a large RAM and an
internet connection.
If you built the image locally, upload it to the cluster with:
rsync -a -P keops-full.sif <login>@cleps.inria.fr:/home/<login>/scratch/
Warning
This command creates a large image file keops-full.sif
, whose
size may exceed the disk quota for your /home
folder.
Storing the image in the ``/scratch`` folder of your cluster is
strongly advised. Please also note that for security reasons, Jean
Zay asks users to follow a specific workflow.
Step 2 – Connect to a compute node
Use SSH to connect to the frontal node of your cluster (Cleps, Jean
Zay, etc.). As usual: - You can use a custom .ssh/config
file and
ssh-keygen
+ ssh-copy-id
to setup a password-free
connection. - Using tmux to create
a shell that will “survive” connection issues is also strongly
advised.
- If you are outside of the Inria Paris center, you may need to use
a VPN to access Cleps. To this end, simply run sudo openconnect -u
<login> vpn.inria.fr
in a new terminal and keep it open while you
work on the cluster.
Then, you may request an interactive session with e.g.:
# Request a node with (any) 1 GPU, a few cores and memory:
salloc -c 8 --gres=gpu:1 -p gpu --mem=15g
# Or request a specific GPU with:
salloc -c 8 --gres=gpu:v100:1 -p gpu --mem=15g
Once resources are available, you will be logged to a GPU-enabled computer
named e.g. gpu001
.
Step 3 – Connect to the container
Once you are on a compute node, you may connect to your custom environment with:
# Run a bash terminal in your container.
# The --nv option (for NVidia) mounts the GPU ;
# don't use it if you are on a CPU-only node.
apptainer exec --nv ~/scratch/keops-full.sif /bin/bash
That’s it! In this new terminal, you should have access to your custom software.
In practice, you probably want your container to access some files
and folders in your host system. This can be achieved with the
-H
and --bind
options:
mkdir ~/my-project-home-folder
# Use a custom folder as your "/home" within Apptainer,
# and forward your ".Xauthority" to store credentials for GUI applications:
apptainer exec \
-H ~/my-project-home-folder/:/home \
--bind ~/.Xauthority:/home/.Xauthority \
--nv \ ~/scratch/keops-full.sif /bin/bash
Step 4 – Run Jupyter notebooks interactively
On Cleps, you may easily access your container via a Web browser. To this end, in your remote Apptainer container, run:
# Install a JupyterLab server, within your container:
pip install jupyterlab
# Run the server, letting it know that the current machine is not
# called "localhost" (= the frontal node) but something like
# "gpu001" instead (= the compute node).
jupyter-lab --no-browser --ip $(hostname)
This should output a lot of information, with a connection link that reads like:
http://gpu001:8888/lab?token=...
Here, gpu001
is the name of your machine in the Cleps cluster, and
8888
is the id of the port that Jupyter intends to use for
communications. Let’s connect to this server!
On your local machine, in a new terminal, open an SSH tunnel with:
# Use the correct port and node name, something like:
ssh -N -L 8888:gpu001:8888 <login>@cleps.paris.inria.fr
This will redirect all of your local connections on port 8888
to
the gpu001
node of Cleps, on port 8888
.
Finally, open a web browser and copy-paste the connection link that
was created by JupyterLab. Just replace gpu001
or equivalent with
localhost
to get an address that reads:
http://localhost:8888/lab?token=...
and you should be good to go!
Warning
Unfortunately, due to security concerns, Jean Zay requires a specific workflow to run Jupyter. I don’t know if it is compatible with containers.
Step 5 – Submit jobs to the queue
Jupyter notebooks are great for interactive development, but are not
really suited to heavy workloads. To work asynchronously, use the
sbatch
command with two descriptive files:
``my-job.sbatch`` is executed by the host system. It contains your hardware requirements, and the command that launches the Singularity container with your custom
--bind
options.``my-job.sh`` is executed inside the Singularity container. It contains the actual logic of your script.
First of all, close your Singularity container (Ctrl+D), and get back to the frontal node of the cluster (Cleps or Jean Zay, the workflow is identical since both of them use the “Slurm” queuing system).
Then, in your /home
folder, create a file ``my-job.sbatch``
with the following content (adapt resources allocation to your needs):
#!/bin/bash
#SBATCH --job-name=it_works # create a short name for your job
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=your.name@inria.fr # Where to send mail
#SBATCH --nodes=1 # node count
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --cpus-per-gpu=8 # threads per task (>1 if multi-threaded tasks)
#SBATCH --partition=gpu # partition name
#SBATCH --gres=gpu:rtx6000:1 # Number and type of GPU cards
#SBATCH --mem=40G # Total memorye allocated
#SBATCH --time=03:00:00 # total run time limit (HH:MM:SS)
#SBATCH --output=logs/my_job_%j.out # output file name
#SBATCH --error=logs/my_job_%j.err # error file name
# Write interesting info in the log:
echo "### Running $SLURM_JOB_NAME ###"
# And make sure that commands are printed too:
set -x
# Go in the folder (presumably, /home/<login>/)
cd ${SLURM_SUBMIT_DIR}
# Mount your custom home folder and my-job.sh in the container,
# don't forget the --nv option if you use a GPU,
# and run my-job.sh:
apptainer exec \
-H ~/my-project-home-folder/:/home \
--bind ~/my-job.sh:/home/my-job.sh \
--nv \ ~/scratch/keops-full.sif \ /home/my-job.sh
Likewise, create your script ``my-job.sh``:
#!/bin/bash
echo "It works!"
To submit your job, just run ``sbatch my-job.sbatch``. You may
have to wait a little bit: use squeue -u $USER
to know if your job
is running or still in the queue. That’s it :-)
Warning
The partition names and hardware requirements depend on your cluster. Read the documentation of the cluster or contact your project manager to know what options are best for you. As an example, here is a typical set of scripts for Jean Zay.
Data management
Transfer files from your machine
If your files are located on your personal computer, you can transfer them onto CLEPS. On your machine, type the following command:
scp -r code_directory <login>@cleps.inria.fr:<destination/path>
In the same way, you can copy your data from the cluster to you local machine. On your machine, type:
scp -r <login>@cleps.inria.fr:<path_to_your_data> data_directory