Post-Image

Patrick CARRIBAULT

Post-Image Post-Image Post-Image

Patrick Carribault est chargé de mission en calcul haute performance et calcul quantique, expert Fellow CEA et titulaire d’une HDR en Informatique. Ses recherches se concentrent sur la pile logicielle et la co-conception entre les applications parallèle et les architectures de calcul haute performance. A travers des collaborations académiques et industrielles, il étudie les modèles de programmation, la compilation et l’optimisation des performances parallèle sur les supercalculateurs actuels et futurs.

Patrick Carribault a encadré et dirigé plus de 10 thèses et a publié plus de 40 article dans des conférences internationales et des journaux.

An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX
Romain Pereira   Adrien Roussel   Miwako Tsuji   Patrick Carribault   Sato Mitsuhisa   Hitoshi Murai   Thierry Gautier  
HPCAsia 2024 Workshops Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops, 2024

abstract

Abstract

The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARMbased machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages like dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,... MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.

Experimenting with Hybrid Quantum Optimization in HPC Software Stack for CPU Register Allocation
Brice Chichereau   Stéphane Vialle   Patrick Carribault  
IEEE International Conference on Quantum Computing and Engineering, 2023

abstract

Abstract

Quantum computers exploit the particular behavior of quantum physical systems to solve some problems in a different way than classical computers. We are now approaching the point where quantum computing could provide real advantages over classical methods. The computational capabilities of quantum systems will soon be available in future supercomputer architectures as hardware accelerators called Quantum Processing Units (QPU). From optimizing compilers to task scheduling, the High-Performance Computing (HPC) software stack could benefit from the advantages of quantum computing. We look here at the problem of register allocation, a crucial part of modern optimizing compilers. We propose a simple proof-of-concept hybrid quantum algorithm based on QAOA to solve this problem. We implement the algorithm and integrate it directly into GCC, a well-known modern compiler. The performance of the algorithm is evaluated against the simple Chaitin-Briggs heuristic as well as GCC's register allocator. While our proposed algorithm lags behind GCC's modern heuristics, it is a good first step in the design of useful quantum algorithms for the classical HPC software stack.

Investigating Dependency Graph Discovery Impact on Task-based MPI+OpenMP Applications Performances
Romain Pereira   Adrien Roussel   Patrick Carribault   Thierry Gautier  
52nd International Conference on Parallel Processing (ICPP 2023), 2023

abstract

Abstract

The architecture of supercomputers is evolving to expose massive parallelism. MPI and OpenMP are widely used in application codes on the largest supercomputers in the world. The community primarily focused on composing MPI with OpenMP before its version 3.0 introduced task-based programming. Recent advances in OpenMP task model and its interoperability with MPI enabled fine model composition and seamless support for asynchrony. Yet, OpenMP tasking overheads limit the gain of task-based applications over their historical loop parallelization (parallel for construct). This paper identifies the OpenMP task dependency graph discovery speed as a limiting factor in the performance of task-based applications. We study its impact on intra and inter-node performances over two benchmarks (Cholesky, HPCG) and a proxy-application (LULESH). We evaluate the performance impacts of several discovery optimizations, and introduce a persistent task dependency graph reducing overheads by a factor up to 15 at run-time. We measure 2x speedup over parallel for versions weak scaled to 16K cores, due to improved cache memory use and communication overlap, enabled by task refinement and depth-first scheduling.

Suspending OpenMP Tasks on Asynchronous Events: Extending the Taskwait Construct
Romain Pereira   Maël Martin   Adrien Roussel   Thierry Gautier   Patrick Carribault  
IWOMP 23 - International Workshop on OpenMP, 2023

abstract

Abstract

Many-core and heterogeneous architectures now require programmers to compose multiple asynchronous programming model to fully exploit hardware capabilities. As a shared-memory parallel programming model, OpenMP has the responsibility of orchestrating the suspension and progression of asynchronous operations occurring on a compute node, such as MPI communications or CUDA/HIP streams. Yet, specifications only come with the task detach(event) API to suspend tasks until an asynchronous operation is completed, which presents a few drawbacks. In this paper, we introduce the design and implementation of an extension on the taskwait construct to suspend a task until an asynchronous event completion. It aims to reduce runtime costs induced by the current solution, and to provide a standard API to automate portable task suspension solutions. The results show twice less overheads compared to the existing task detach clause.

Relative Performance Projection on Arm Architectures
Clément Gavoille   Hugo Taboada   Patrick Carribault   Fabrice Dupros   Brice Goglin   Emmanuel Jeannot  
Euro-Par 2022: Parallel Processing - 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22-26, 2022, Proceedings, Springer, p. 85-99, 2022

MPI detach - Towards automatic asynchronous local completion
Joachim Protze   Marc-André Hermanns   Matthias S. Müller   Van Man Nguyen   Julien Jaeger   Emmanuelle Saillard   Patrick Carribault   Denis Barthou  
Parallel Comput., p. 102859, 2022

Enhancing MPI+OpenMP Task based Applications for Heterogenous Architectures with GPU support
Manuel Ferat   Romain Pereira   Adrien Roussel   Patrick Carribault   Luiz Angelo Steffenel   Thierry Gautier  
IWOMP 2022 - 18th International Workshop on OpenMP, p. 1-14, 2022-09

abstract

Abstract

Heterogeneous supercomputers are widespread over HPC systems and programming efficient applications on these architectures is a challenge. Task-based programming models are a promising way to tackle this challenge. Since OpenMP 4.0 and 4.5, the target directives enable to offload pieces of code to GPUs and to express it as tasks with dependencies. Therefore, heterogeneous machines can be programmed using MPI+OpenMP(task+target) to exhibit a very high level of concurrent asynchronous operations for which data transfers, kernel executions, communications and CPU computations can be overlapped. Hence, it is possible to suspend tasks performing these asynchronous operations on the CPUs and to overlap their completion with another task execution. Suspended tasks can resume once the associated asynchronous event is completed in an opportunistic way at every scheduling point. We have integrated this feature into the MPC framework and validated it on a AXPY microbenchmark and evaluated on a MPI+OpenMP(tasks) implementation of the LULESH proxy applications. The results show that we are able to improve asynchronism and the overall HPC performance, allowing applications to benefit from asynchronous execution on heterogeneous machines.

Exploring Space-Time Trade-Off in Backtraces
Jean-Baptiste Besnard   Julien Adam   Allen D. Malony   Sameer Shende   Julien Jaeger   Patrick Carribault   Marc Pérache  
Tools for High Performance Computing 2018 / 2019, Springer International Publishing, p. 151-168, 2021

abstract

Abstract

The backtrace is one of the most common operations done by profiling and debugging tools. It consists in determining the nesting of functions leading to the current execution state. Frameworks and standard libraries provide facilities enabling this operation, however, it generally incurs both computational and memory costs. Indeed, walking the stack up and then possibly resolving functions pointers (to function names) before storing them can lead to non-negligible costs. In this paper, we propose to explore a means of extracting optimized backtraces with an O(1) storage size by defining the notion of stack tags. We define a new data-structure that we called a hashed-trie used to encode stack traces at runtime through chained hashing. Our process called stack-tagging is implemented in a GCC plugin, enabling its use of C and C++ application. A library enabling the decoding of stack locators though both static and brute-force analysis is also presented. This work introduces a new manner of capturing execution state which greatly simplifies both extraction and storage which are important issues in parallel profiling.

Enhancing Load-Balancing of MPI Applications with Workshare
Thomas Dionisi   Stéphane Bouhrour   Julien Jaeger   Patrick Carribault   Marc Pérache  
Proceedings of EuroPar 2021, 2021

Communication-Aware Task Scheduling Strategy in Hybrid MPI+OpenMP Applications
Romain Pereira   Adrien Roussel   Patrick Carribault   Thierry Gautier  
IWOMP 2021 - 17th International Workshop on OpenMP, p. 1-15, 2021-09

Preliminary Experience with OpenMP Memory Management Implementation
Adrien Roussel   Patrick Carribault   Julien Jaeger  
OpenMP: Portable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22-24, 2020, Proceedings, Springer, p. 313-327, 2020

PARCOACH Extension for Static MPI Nonblocking and Persistent Communication Validation
Van Man Nguyen   Emmanuelle Saillard   Julien Jaeger   Denis Barthou   Patrick Carribault  
4th IEEE/ACM International Workshop on Software Correctness for HPC Applications, Correctness\@SC 2020, Atlanta, GA, USA, November 11, 2020, IEEE, p. 31-39, 2020

Automatic Code Motion to Extend MPI Nonblocking Overlap Window
Van Man Nguyen   Emmanuelle Saillard   Julien Jaeger   Denis Barthou   Patrick Carribault  
High Performance Computing - ISC High Performance 2020 International Workshops, Frankfurt, Germany, June 21-25, 2020, Revised Selected Papers, Springer, p. 43-54, 2020

Unifying the Analysis of Performance Event Streams at the Consumer Interface Level
Jean-Baptiste Besnard   Allen D. Malony   Sameer Shende   Marc Pérache   Patrick Carribault   Julien Jaeger  
Tools for High Performance Computing 2017, Springer International Publishing, p. 57-71, 2019

abstract

Abstract

Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of MPI (MPI-T) and OpenMP (OMPT) runtimes are now in the standards. Taking advantage of these interfaces requires tools to selectively collect events from multiples interfaces by various techniques: function interposition (PMPI), value read (MPI-T), and callbacks (OMPT). In this paper, we present the unified instrumentation pipeline proposed by the MALP infrastructure that can be used to forward a variety of fine-grained events from multiple interfaces online to multi-threaded analysis processes implemented orthogonally with plugins. In essence, our contribution complements “front-end” instrumentation mechanisms by a generic “back-end” event consumption interface that allows “consumer” callbacks to generate performance measurements in various formats for analysis and transport. With such support, online and post-mortem cases become similar from an analysis point of view, making it possible to build more unified and consistent analysis frameworks. The paper describes the approach and demonstrates its benefits with several use cases.

Checkpoint/restart approaches for a thread-based MPI runtime
Julien Adam   Maxime Kermarquer   Jean-Baptiste Besnard   Leonardo Bautista-Gomez   Marc Pérache   Patrick Carribault   Julien Jaeger   Allen D. Malony   Sameer Shende  
Parallel Comput., p. 204-219, 2019

Detecting Non-sibling Dependencies in OpenMP Task-Based Applications
Ricardo Bispo Vieira   Antoine Capra   Patrick Carribault   Julien Jaeger   Marc Pérache   Adrien Roussel  
OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Springer, p. 231-245, 2019

abstract

Abstract

The advent of the multicore era led to the duplication of functional units through an increasing number of cores. To exploit those processors, a shared-memory parallel programming model is one possible direction. Thus, OpenMP is a good candidate to enable different paradigms: data parallelism (including loop-based directives) and control parallelism, through the notion of tasks with dependencies. But this is the programmer responsibility to ensure that data dependencies are complete such as no data races may happen. It might be complex to guarantee that no issue will occur and that all dependencies have been correctly expressed in the context of nested tasks. This paper proposes an algorithm to detect the data dependencies that might be missing on the OpenMP task clauses between tasks that have been generated by different parents. This approach is implemented inside a tool relying on the OMPT interface.

Mixing ranks, tasks, progress and nonblocking collectives
Jean-Baptiste Besnard   Julien Jaeger   Allen D. Malony   Sameer Shende   Hugo Taboada   Marc Pérache   Patrick Carribault  
Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 10:1-10:10, 2019

Checkpoint/restart approaches for a thread-based MPI runtime
Julien Adam   Maxime Kermarquer   Jean-Baptiste Besnard   Leonardo Bautista-Gomez   Marc Pérache   Patrick Carribault   Julien Jaeger   Allen D. Malony   Sameer Shende  
CoRR, 2019

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration
Marc Sergent   Mario Dagrada   Patrick Carribault   Julien Jaeger   Marc Pérache   Guillaume Papauré  
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 560-572, 2018

Transparent High-Speed Network Checkpoint/Restart in MPI
Julien Adam   Jean-Baptiste Besnard   Allen D. Malony   Sameer Shende   Marc Pérache   Patrick Carribault   Julien Jaeger  
Proceedings of the 25th European MPI Users’ Group Meeting, Barcelona, Spain, September 23-26, 2018, ACM, p. 12:1-12:11, 2018

Profile-guided scope-based data allocation method
Hugo Brunie   Julien Jaeger   Patrick Carribault   Denis Barthou  
Proceedings of the International Symposium on Memory Systems, MEMSYS 2018, Old Town Alexandria, VA, USA, October 01-04, 2018, ACM, p. 169-182, 2018

Contemporary High Performance Computing
Mickaël Amiet   Patrick Carribault   Elisabeth Charon   Guillaume Colin Verdière   Philippe Deniel   Gilles Grospellier   Guénolé Harel   François Jollet   Jacques-Charles Lafoucrière   Jacques-Bernard Lekien   Stéphane Mathieu   Marc Pérache   Jean-Christophe Weill   Gilles Wiber  
Chapman; Hall/CRC, p. 45-74, 2017

Towards a Better Expressiveness of the Speedup Metric in MPI Context
Jean-Baptiste Besnard   Allen D. Malony   Sameer Shende   Marc Pérache   Patrick Carribault   Julien Jaeger  
46th International Conference on Parallel Processing Workshops, ICPP Workshops 2017, Bristol, United Kingdom, August 14-17, 2017, IEEE Computer Society, p. 251-260, 2017

User Co-scheduling for MPI+OpenMP Applications Using OpenMP Semantics
Antoine Capra   Patrick Carribault   Jean-Baptiste Besnard   Allen D. Malony   Marc Pérache   Julien Jaeger  
Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Springer, p. 203-216, 2017

Resource-Management Study in HPC Runtime-Stacking Context
Arthur Loussert   Benoit Welterlen   Patrick Carribault   Julien Jaeger   Marc Pérache   Raymond Namyst  
29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2017, Campinas, Brazil, October 17-20, 2017, IEEE Computer Society, p. 177-184, 2017

Introducing Task-Containers as an Alternative to Runtime-Stacking
Jean-Baptiste Besnard   Julien Adam   Sameer Shende   Marc Pérache   Patrick Carribault   Julien Jaeger  
Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, Edinburgh, United Kingdom, September 25-28, 2016, ACM, p. 51-63, 2016

Fine-grain data management directory for OpenMP 4.0 and OpenACC
Julien Jaeger   Patrick Carribault   Marc Pérache  
Concurr. Comput. Pract. Exp., p. 1528-1539, 2015

An MPI Halo-Cell Implementation for Zero-Copy Abstraction
Jean-Baptiste Besnard   Allen D. Malony   Sameer Shende   Marc Pérache   Patrick Carribault   Julien Jaeger  
Proceedings of the 22nd European MPI Users’ Group Meeting, EuroMPI 2015, Bordeaux, France, September 21-23, 2015, ACM, p. 3:1-3:9, 2015

Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH
Julien Jaeger   Emmanuelle Saillard   Patrick Carribault   Denis Barthou  
Proceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015, Bordeaux, France, September 21-23, 2015, ACM, p. 16:1-16:2, 2015

Improving MPI communication overlap with collaborative polling
Sylvain Didelot   Patrick Carribault   Marc Pérache   William Jalby  
Computing, p. 263-278, 2014

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures
Jérôme Clet-Ortega   Patrick Carribault   Marc Pérache  
Euro-Par 2014 Parallel Processing - 20th International Conference, Porto, Portugal, August 25-29, 2014. Proceedings, Springer, p. 596-607, 2014

Optimizing Collective Operations in Hybrid Applications
Aurèle Mahéo   Patrick Carribault   Marc Pérache   William Jalby  
21st European MPI Users’ Group Meeting, EuroMPI/ASIA ’14, Kyoto, Japan - September 09 - 12, 2014, ACM, p. 121, 2014

Data-Management Directory for OpenMP 4.0 and OpenACC
Julien Jaeger   Patrick Carribault   Marc Pérache  
Euro-Par 2013: Parallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, August 26-27, 2013. Revised Selected Papers, Springer, p. 168-177, 2013

Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks
Marc Tchiboukdjian   Patrick Carribault   Marc Pérache  
26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, May 21-25, 2012, IEEE Computer Society, p. 366-377, 2012

Adaptive OpenMP for Large NUMA Nodes
Aurèle Mahéo   Souad Koliai   Patrick Carribault   Marc Pérache   William Jalby  
OpenMP in a Heterogeneous World - 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings, Springer, p. 254-257, 2012

Improving MPI Communication Overlap with Collaborative Polling
Sylvain Didelot   Patrick Carribault   Marc Pérache   William Jalby  
Recent Advances in the Message Passing Interface - 19th European MPI Users’ Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, Springer, p. 37-46, 2012

Thread-Local Storage Extension to Support Thread-Based MPI/OpenMP Applications
Patrick Carribault   Marc Pérache   Hervé Jourdren  
OpenMP in the Petascale Era - 7th International Workshop on OpenMP, IWOMP 2011, Chicago, IL, USA, June 13-15, 2011. Proceedings, Springer, p. 80-93, 2011

User level DB: a debugging API for user-level thread libraries
Kevin Pouget   Marc Pérache   Patrick Carribault   Hervé Jourdren  
24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Workshop Proceedings, IEEE, p. 1-7, 2010

Enabling Low-Overhead Hybrid MPI/OpenMP Parallelism with MPC
Patrick Carribault   Marc Pérache   Hervé Jourdren  
Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, 6th Internationan Workshop on OpenMP, IWOMP 2010, Tsukuba, Japan, June 14-16, 2010, Proceedings, Springer, p. 1-14, 2010

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption
Marc Pérache   Patrick Carribault   Hervé Jourdren  
Recent Advances in Parallel Virtual Machine and Message Passing Interface, 16th European PVM/MPI Users’ Group Meeting, Espoo, Finland, September 7-10, 2009. Proceedings, Springer, p. 94-103, 2009