LiHPC | Laboratory for High Performance Computing and Simulation

Adrien Roussel is currently a Research Scientist in Computer Science (HPC) at CEA since 2019. He obtained his PhD in Computer Science in 2018, on the parallelization of iterative linear solvers with a task-based programming model for many-core architectures. His research then oriented him towards dynamic scheduling techniques in distributed applications through a post-doc performed at Fraunhofer-ITWM in Germany during 1 year.

He is in charge of the Research and Development related to the OpenMP standard and its interaction with MPI. His research topics aim to anticipate how to efficiently program and exploit current and future supercomputers: asynchrony, heterogeneity, etc. Since February 2023, Adrien Roussel has been in charge of the MPC team.

Adrien Roussel has supervised 2 PhD thesis (+3 in progress) and has already supervised several interns. He is co-author of several research papers in international conferences.

Main research topics in HPC

Task-based programming with dependencies

The task-based parallel programming model with dependencies is an interesting approach for harnessing massively parallel and heterogeneous supercomputers but is complex to implement in simulation codes.

Research has been conducted and is still ongoing on this topic, including the publication of several research articles and the completion of a thesis (+2 in progress).

Programmation des accélérateurs de calculs

Heterogeneous programming with compute accelerators such as GPUs is complex on multiple levels: data locality, overlapping through asynchronous exploitation, interoperability with other programming models, coupling with the task-based programming model, etc. Furthermore, these machines continue to evolve, and thus the way they are programmed must also evolve accordingly.

Several works are underway on this topic, including 3 theses.

Hétérogénéité Mémoire

To meet the memory requirements of applications, various types of memory can be present within a compute node: high-bandwidth memory (HBM), persistent memory (PMEM), etc.

Research on this topic has been covered by the European project DEEP-SEA.

To Share or Not to Share: a case for MPI in Shared-Memory
Julien Adam Jean-Baptiste Besnard Adrien Roussel Julien Jaeger Romain Pereira Patrick Carribault Marc Pérache
European MPI Users' Group Meeting, 2024

abstract

Abstract

The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.

Measuring and Interpreting Dependent Task-based Applications Performances
Romain Pereira Thierry Gautier Adrien Roussel Patrick Carribault
15th International Conference on Parallel Processing & Applied Mathematics, 2024

abstract

Abstract

Breaking down the parallel time into work, idleness, and overheads is crucial for assessing the performance of HPC applications, but difficult to measure in asynchronous dependent tasking runtime systems. No existing tools allow its measurement portably and accurately. This paper introduces POT: a tool-suite for dependent task-based applications performance measurement. We focus on its low-disturbance methodology consisting of task modeling, discrete-event tracing, and post-mortem simulation-based analysis. It supports the OMPT standard OpenMP specifications. The paper evaluates the precision of POT's parallel time breakdown analysis on LLVM and MPC implementations and shows that measurement bias may be neglected above 16µs workload per task, portably across two architectures and OpenMP runtime systems

An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX
Romain Pereira Adrien Roussel Miwako Tsuji Patrick Carribault Sato Mitsuhisa Hitoshi Murai Thierry Gautier
HPCAsia 2024 Workshops Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops, 2024

abstract

Abstract

The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARMbased machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages like dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,... MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.

Enhancing productivity on heterogeneous supercomputers with task-based programming model
Adrien Roussel Mickael Boichot Romain Pereira Manuel Ferat
SIAM CSE 2023 - SIAM Conference on Computational Science and Engineering, 2023

abstract

Abstract

Heterogeneous supercomputers with GPUs are one of the best candidates to build Exascale machines. However, porting scientific applications with millions of lines of code lines is challenging. Data transfers/locality and exposing enough parallelism determine the maximum achievable performance on such systems. Thus porting efforts impose developers to rewrite parts of the application which is tedious and time-consuming and does not guarantee performances in all the cases. Being able to detect which parts can be expected to deliver performance gains on GPUs is therefore a major asset for developers. Moreover, task parallel programming model is a promising alternative to expose enough parallelism while allowing asynchronous execution between CPU and GPU. OpenMP 4.5 introduces the « target » directive to offload computation on GPU in a portable way. Target constructions are considered as explicit OpenMP task in the same way as for CPU but executed on GPU. In this work, we propose a methodology to detect the most profitable loops of an application that can be ported on GPU. While we have applied the detection part on several mini applications (LULESH, miniFE, XSBench and Quicksilver), we experimented the full methodology on LULESH through MPI+OpenMP task programming model with target directives. It relies on runtime modifications to enable overlapping of data transfers and kernel execution through tasks. This work has been integrated into the MPC framework, and has been validated on distributed heterogeneous system.

Investigating Dependency Graph Discovery Impact on Task-based MPI+OpenMP Applications Performances
Romain Pereira Adrien Roussel Patrick Carribault Thierry Gautier
52nd International Conference on Parallel Processing (ICPP 2023), 2023

abstract

Abstract

The architecture of supercomputers is evolving to expose massive parallelism. MPI and OpenMP are widely used in application codes on the largest supercomputers in the world. The community primarily focused on composing MPI with OpenMP before its version 3.0 introduced task-based programming. Recent advances in OpenMP task model and its interoperability with MPI enabled fine model composition and seamless support for asynchrony. Yet, OpenMP tasking overheads limit the gain of task-based applications over their historical loop parallelization (parallel for construct). This paper identifies the OpenMP task dependency graph discovery speed as a limiting factor in the performance of task-based applications. We study its impact on intra and inter-node performances over two benchmarks (Cholesky, HPCG) and a proxy-application (LULESH). We evaluate the performance impacts of several discovery optimizations, and introduce a persistent task dependency graph reducing overheads by a factor up to 15 at run-time. We measure 2x speedup over parallel for versions weak scaled to 16K cores, due to improved cache memory use and communication overlap, enabled by task refinement and depth-first scheduling.

Suspending OpenMP Tasks on Asynchronous Events: Extending the Taskwait Construct
Romain Pereira Maël Martin Adrien Roussel Thierry Gautier Patrick Carribault
IWOMP 23 - International Workshop on OpenMP, 2023

abstract

Abstract

Many-core and heterogeneous architectures now require programmers to compose multiple asynchronous programming model to fully exploit hardware capabilities. As a shared-memory parallel programming model, OpenMP has the responsibility of orchestrating the suspension and progression of asynchronous operations occurring on a compute node, such as MPI communications or CUDA/HIP streams. Yet, specifications only come with the task detach(event) API to suspend tasks until an asynchronous operation is completed, which presents a few drawbacks. In this paper, we introduce the design and implementation of an extension on the taskwait construct to suspend a task until an asynchronous event completion. It aims to reduce runtime costs induced by the current solution, and to provide a standard API to automate portable task suspension solutions. The results show twice less overheads compared to the existing task detach clause.

Generating and Scaling a Multi-Language Test-Suite for MPI
Julien Adam J.B. Besnard Paul Canat Sameer Shende Hugo Taboada Adrien Roussel Marc Pérache Julien Jaeger
EuroMPI'23, 2023

abstract

Abstract

High-Performance Computing (HPC) is currently facing significant challenges. The hardware pressure has become increasingly difficult to manage due to the lack of parallel abstractions in applications. As a result, parallel programs must undergo drastic evolution to effectively exploit underlying hardware parallelism. Failure to do so results in inefficient code. In this pressing environment, parallel runtimes play a critical role, and their esting becomes crucial. This paper focuses on the MPI interface and leverages the MPI binding tools to develop a multi-language test-suite for MPI. By doing so and building on previous work from the Forum’s document editors, we implement a systematic testing of MPI symbols in the context of the Parallel Computing Validation System (PCVS), which is an HPC validation platform dedicated to running and managing test-suites at scale. We first describe PCVS, then outline the process of generating the MPI API test suite, and finally, run these tests at scale. All data sets, code generators, and implementations are made available in open-source to the community. We also set up a dedicated website showcasing the results, which self-updates thanks to the Spack package manager.

Performance Improvements of Parallel Applicationsthanks to MPI-4.0 Hints
Maxim Moraru Adrien Roussel Hugo Taboada Christophe Jaillet Michael Krajecki Marc Pérache
Proceedings of SBAC-PAD 2022, IEEE, 2022

abstract

Abstract

HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.

Enhancing MPI+OpenMP Task based Applications for Heterogenous Architectures with GPU support
Manuel Ferat Romain Pereira Adrien Roussel Patrick Carribault Luiz Angelo Steffenel Thierry Gautier
IWOMP 2022 - 18th International Workshop on OpenMP, p. 1-14, 2022

abstract

Abstract

Heterogeneous supercomputers are widespread over HPC systems and programming efficient applications on these architectures is a challenge. Task-based programming models are a promising way to tackle this challenge. Since OpenMP 4.0 and 4.5, the target directives enable to offload pieces of code to GPUs and to express it as tasks with dependencies. Therefore, heterogeneous machines can be programmed using MPI+OpenMP(task+target) to exhibit a very high level of concurrent asynchronous operations for which data transfers, kernel executions, communications and CPU computations can be overlapped. Hence, it is possible to suspend tasks performing these asynchronous operations on the CPUs and to overlap their completion with another task execution. Suspended tasks can resume once the associated asynchronous event is completed in an opportunistic way at every scheduling point. We have integrated this feature into the MPC framework and validated it on a AXPY microbenchmark and evaluated on a MPI+OpenMP(tasks) implementation of the LULESH proxy applications. The results show that we are able to improve asynchronism and the overall HPC performance, allowing applications to benefit from asynchronous execution on heterogeneous machines.

Benefits of MPI Sessions for GPU MPI applications
Maxim Moraru Adrien Roussel Hugo Taboada Christophe Jaillet Michael Krajecki Marc Pérache
Proceedings of EuroMPI 2021, 2021

abstract

Abstract

Heterogeneous supercomputers are now considered the most valuable solution to reach the Exascale. Nowadays, we can frequently observe that compute nodes are composed of more than one GPU accelerator. Programming such architectures efficiently is challenging. MPI is the defacto standard for distributed computing. CUDAaware libraries were introduced to ease GPU inter-nodes communications. However, they induce some overhead that can degrade overall performances. MPI 4.0 Specification draft introduces the MPI Sessions model which offers the ability to initialize specific resources for a specific component of the application. In this paper, we present a way to reduce the overhead induced by CUDA-aware libraries with a solution inspired by MPI Sessions. In this way, we minimize the overhead induced by GPUs in an MPI context and allow to improve CPU + GPU programs efficiency. We evaluate our approach on various micro-benchmarks and some proxy applications like Lulesh, MiniFE, Quicksilver, and Cloverleaf. We demonstrate how this approach can provide up to a 7x speedup compared to the standard MPI model.

Communication-Aware Task Scheduling Strategy in Hybrid MPI+OpenMP Applications
Romain Pereira Adrien Roussel Patrick Carribault Thierry Gautier
IWOMP 2021 - 17th International Workshop on OpenMP, p. 1-15, 2021-09

Preliminary Experience with OpenMP Memory Management Implementation
Adrien Roussel Patrick Carribault Julien Jaeger
OpenMP: Portable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22-24, 2020, Proceedings, Springer, p. 313-327, 2020

Detecting Non-sibling Dependencies in OpenMP Task-Based Applications
Ricardo Bispo Vieira Antoine Capra Patrick Carribault Julien Jaeger Marc Pérache Adrien Roussel
OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Springer, p. 231-245, 2019

abstract

Abstract

The advent of the multicore era led to the duplication of functional units through an increasing number of cores. To exploit those processors, a shared-memory parallel programming model is one possible direction. Thus, OpenMP is a good candidate to enable different paradigms: data parallelism (including loop-based directives) and control parallelism, through the notion of tasks with dependencies. But this is the programmer responsibility to ensure that data dependencies are complete such as no data races may happen. It might be complex to guarantee that no issue will occur and that all dependencies have been correctly expressed in the context of nested tasks. This paper proposes an algorithm to detect the data dependencies that might be missing on the OpenMP task clauses between tasks that have been generated by different parents. This approach is implemented inside a tool relying on the OMPT interface.

Parallelization of iterative methods to solve sparse linear systems using task based runtime systems on multi and many-core architectures: application to Multi-Level Domain Decomposition methods
Adrien Roussel
Université Grenoble Alpes, 2018-02

Comparaison de moteurs exécutifs pour la parallélisation de solveurs linéaires itératifs
Adrien Roussel
Conférence d'informatique en Parallélisme, Architecture et Système (Compas'2016), 2016

Description, Implementation and Evaluation of an Affinity Clause for Task Directives
Philippe Virouleau Adrien Roussel François Broquedis Thierry Gautier Fabrice Rastello Jean-Marc Gratien
IWOMP 2016, 2016

Using Runtime Systems Tools to Implement Efficient Preconditioners for Heterogeneous Architectures
Adrien Roussel Jean-Marc Gratien Thierry Gautier
Oil & Gas Science and Technology - Revue d'IFP Energies nouvelles, Institut Français du Pétrole, p. 65:1-13, 2016

Adrien ROUSSEL

Main research topics in HPC

Task-based programming with dependencies

Programmation des accélérateurs de calculs

Hétérogénéité Mémoire

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract