Patrick Carribault est chargé de mission en calcul haute performance et calcul quantique, expert Fellow CEA et titulaire d’une HDR en Informatique. Ses recherches se concentrent sur la pile logicielle et la co-conception entre les applications parallèle et les architectures de calcul haute performance. A travers des collaborations académiques et industrielles, il étudie les modèles de programmation, la compilation et l’optimisation des performances parallèle sur les supercalculateurs actuels et futurs.
Patrick Carribault a encadré et dirigé plus de 10 thèses et a publié plus de 40 article dans des conférences internationales et des journaux.
European MPI Users' Group Meeting, 2024
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.
15th International Conference on Parallel Processing & Applied Mathematics, 2024
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
Breaking down the parallel time into work, idleness, and overheads is crucial for assessing the performance of HPC applications, but difficult to measure in asynchronous dependent tasking runtime systems. No existing tools allow its measurement portably and accurately. This paper introduces POT: a tool-suite for dependent task-based applications performance measurement. We focus on its low-disturbance methodology consisting of task modeling, discrete-event tracing, and post-mortem simulation-based analysis. It supports the OMPT standard OpenMP specifications. The paper evaluates the precision of POT's parallel time breakdown analysis on LLVM and MPC implementations and shows that measurement bias may be neglected above 16µs workload per task, portably across two architectures and OpenMP runtime systems
HPCAsia 2024 Workshops Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops, 2024
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARMbased machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages like dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,... MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.
IEEE International Conference on Quantum Computing and Engineering, 2023
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
abstract
Abstract
Quantum computers exploit the particular behavior of quantum physical systems to solve some problems in a different way than classical computers. We are now approaching the point where quantum computing could provide real advantages over classical methods. The computational capabilities of quantum systems will soon be available in future supercomputer architectures as hardware accelerators called Quantum Processing Units (QPU). From optimizing compilers to task scheduling, the High-Performance Computing (HPC) software stack could benefit from the advantages of quantum computing. We look here at the problem of register allocation, a crucial part of modern optimizing compilers. We propose a simple proof-of-concept hybrid quantum algorithm based on QAOA to solve this problem. We implement the algorithm and integrate it directly into GCC, a well-known modern compiler. The performance of the algorithm is evaluated against the simple Chaitin-Briggs heuristic as well as GCC's register allocator. While our proposed algorithm lags behind GCC's modern heuristics, it is a good first step in the design of useful quantum algorithms for the classical HPC software stack.
52nd International Conference on Parallel Processing (ICPP 2023), 2023
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
The architecture of supercomputers is evolving to expose massive parallelism. MPI and OpenMP are widely used in application codes on the largest supercomputers in the world. The community primarily focused on composing MPI with OpenMP before its version 3.0 introduced task-based programming. Recent advances in OpenMP task model and its interoperability with MPI enabled fine model composition and seamless support for asynchrony. Yet, OpenMP tasking overheads limit the gain of task-based applications over their historical loop parallelization (parallel for construct). This paper identifies the OpenMP task dependency graph discovery speed as a limiting factor in the performance of task-based applications. We study its impact on intra and inter-node performances over two benchmarks (Cholesky, HPCG) and a proxy-application (LULESH). We evaluate the performance impacts of several discovery optimizations, and introduce a persistent task dependency graph reducing overheads by a factor up to 15 at run-time. We measure 2x speedup over parallel for versions weak scaled to 16K cores, due to improved cache memory use and communication overlap, enabled by task refinement and depth-first scheduling.
IWOMP 23 - International Workshop on OpenMP, 2023
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
Many-core and heterogeneous architectures now require programmers to compose multiple asynchronous programming model to fully exploit hardware capabilities. As a shared-memory parallel programming model, OpenMP has the responsibility of orchestrating the suspension and progression of asynchronous operations occurring on a compute node, such as MPI communications or CUDA/HIP streams. Yet, specifications only come with the task detach(event) API to suspend tasks until an asynchronous operation is completed, which presents a few drawbacks. In this paper, we introduce the design and implementation of an extension on the taskwait construct to suspend a task until an asynchronous event completion. It aims to reduce runtime costs induced by the current solution, and to provide a standard API to automate portable task suspension solutions. The results show twice less overheads compared to the existing task detach clause.
IWOMP 2022 - 18th International Workshop on OpenMP, p. 1-14, 2022
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
Heterogeneous supercomputers are widespread over HPC systems and programming efficient applications on these architectures is a challenge. Task-based programming models are a promising way to tackle this challenge. Since OpenMP 4.0 and 4.5, the target directives enable to offload pieces of code to GPUs and to express it as tasks with dependencies. Therefore, heterogeneous machines can be programmed using MPI+OpenMP(task+target) to exhibit a very high level of concurrent asynchronous operations for which data transfers, kernel executions, communications and CPU computations can be overlapped. Hence, it is possible to suspend tasks performing these asynchronous operations on the CPUs and to overlap their completion with another task execution. Suspended tasks can resume once the associated asynchronous event is completed in an opportunistic way at every scheduling point. We have integrated this feature into the MPC framework and validated it on a AXPY microbenchmark and evaluated on a MPI+OpenMP(tasks) implementation of the LULESH proxy applications. The results show that we are able to improve asynchronism and the overall HPC performance, allowing applications to benefit from asynchronous execution on heterogeneous machines.
Zenodo, 2022
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
data:image/s3,"s3://crabby-images/ab306/ab30648c14f1721e9f601fc86f461bf77c3b609c" alt=""
abstract
Abstract
This document feeds research and development priorities devel-oped by the European HPC ecosystem into EuroHPC’s Research and Innovation Advisory Group with an aim to define the HPC Technology research Work Programme and the calls for proposals included in it and to be launched from 2023 to 2026. This SRA also describes the major trends in the deployment of HPC and HPDA methods and systems, driven by economic and societal needs in Europe, taking into account the changes ex-pected in the technologies and architectures of the expanding underlying IT infrastructure. The goal is to draw a complete pic-ture of the state of the art and the challenges for the next three to four years rather than to focus on specific technologies, implementations or solutions.
Tools for High Performance Computing 2018 / 2019, Springer International Publishing, p. 151-168, 2021
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
abstract
Abstract
The backtrace is one of the most common operations done by profiling and debugging tools. It consists in determining the nesting of functions leading to the current execution state. Frameworks and standard libraries provide facilities enabling this operation, however, it generally incurs both computational and memory costs. Indeed, walking the stack up and then possibly resolving functions pointers (to function names) before storing them can lead to non-negligible costs. In this paper, we propose to explore a means of extracting optimized backtraces with an O(1) storage size by defining the notion of stack tags. We define a new data-structure that we called a hashed-trie used to encode stack traces at runtime through chained hashing. Our process called stack-tagging is implemented in a GCC plugin, enabling its use of C and C++ application. A library enabling the decoding of stack locators though both static and brute-force analysis is also presented. This work introduces a new manner of capturing execution state which greatly simplifies both extraction and storage which are important issues in parallel profiling.
Tools for High Performance Computing 2017, Springer International Publishing, p. 57-71, 2019
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
abstract
Abstract
Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of MPI (MPI-T) and OpenMP (OMPT) runtimes are now in the standards. Taking advantage of these interfaces requires tools to selectively collect events from multiples interfaces by various techniques: function interposition (PMPI), value read (MPI-T), and callbacks (OMPT). In this paper, we present the unified instrumentation pipeline proposed by the MALP infrastructure that can be used to forward a variety of fine-grained events from multiple interfaces online to multi-threaded analysis processes implemented orthogonally with plugins. In essence, our contribution complements “front-end” instrumentation mechanisms by a generic “back-end” event consumption interface that allows “consumer” callbacks to generate performance measurements in various formats for analysis and transport. With such support, online and post-mortem cases become similar from an analysis point of view, making it possible to build more unified and consistent analysis frameworks. The paper describes the approach and demonstrates its benefits with several use cases.
OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Springer, p. 231-245, 2019
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
abstract
Abstract
The advent of the multicore era led to the duplication of functional units through an increasing number of cores. To exploit those processors, a shared-memory parallel programming model is one possible direction. Thus, OpenMP is a good candidate to enable different paradigms: data parallelism (including loop-based directives) and control parallelism, through the notion of tasks with dependencies. But this is the programmer responsibility to ensure that data dependencies are complete such as no data races may happen. It might be complex to guarantee that no issue will occur and that all dependencies have been correctly expressed in the context of nested tasks. This paper proposes an algorithm to detect the data dependencies that might be missing on the OpenMP task clauses between tasks that have been generated by different parents. This approach is implemented inside a tool relying on the OMPT interface.
Euro-Par 2013: Parallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, August 26-27, 2013. Revised Selected Papers, Springer, p. 168-177, 2013
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
Chapman; Hall/CRC, 2013
data:image/s3,"s3://crabby-images/df15d/df15d4cd540421b5d11d9416e33d005b388171b6" alt=""
abstract
Abstract
Contemporary High Performance Computing: From Petascale toward Exascale focuses on the ecosystems surrounding the world’s leading centers for high performance computing (HPC). It covers many of the important factors involved in each ecosystem: computer architectures, software, applications, facilities, and sponsors. The first part of the book examines significant trends in HPC systems, including computer architectures, applications, performance, and software. It discusses the growth from terascale to petascale computing and the influence of the TOP500 and Green500 lists. The second part of the book provides a comprehensive overview of 18 HPC ecosystems from around the world. Each chapter in this section describes programmatic motivation for HPC and their important applications; a flagship HPC system overview covering computer architecture, system software, programming systems, storage, visualization, and analytics support; and an overview of their data center/facility. The last part of the book addresses the role of clouds and grids in HPC, including chapters on the Magellan, FutureGrid, and LLGrid projects. With contributions from top researchers directly involved in designing, deploying, and using these supercomputing systems, this book captures a global picture of the state of the art in HPC.