LiHPC | Laboratory for High Performance Computing and Simulation

Dr. Julien Jaeger is a research scientist at LIHPC. He joined the MPC (Multi-Processor Computing) team at CEA in 2012, after defending his PhD thesis in computer science from the University of Versailles Saint-Quentin-En-Yvelines the same year. Since 2019, he has been leading the MPC effort focusing on parallel programming models such as MPI and OpenMP, their scheduling and their interactions on HPC supercomputers. He also actively participates to the MPI Forum, helping to design the next MPI standard.

To Share or Not to Share: a case for MPI in Shared-Memory
Julien Adam Jean-Baptiste Besnard Adrien Roussel Julien Jaeger Romain Pereira Patrick Carribault Marc Pérache
European MPI Users' Group Meeting, 2024

abstract

Abstract

The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.

IO-SEA: Storage I/O and Data Management for Exascale Architectures
Daniel Medeiros Eric B. Gregory Philippe Couvee James Hawkes Sebastien Gougeaud Maike Gilliot Olivier Bressand Yoann Valeri Julien Jaeger Damien Chapon Frederic Bournaud Loı̈c Strafella Daniel Caviedes-Voullième Ghazal Tashakor Jolanta Zjupa Max Holicki Tom Ridley Yanik Müller Filipe Souza Mendes Guimarães Wolfgang Frings Jan-Oliver Mirus Ilya Zhukov Eric Rodrigues Borba Nafiseh Moti Reza Salkhordeh Nadia Derbey Salim Mimouni Simon Derr Buket Benek Gursoy James Grogan Radek Furmánek Martin Golasowski Kateřina Slaninová Jan Martinovič Jan Faltýnek Jenny Wong Metin Cakircali Tiago Quintino Simon Smart Olivier Iffrig Sai Narasimhamurthy Sonja Happ Michael Rauh Stephan Krempel Mark Wiggins Jiřı́ Nováček André Brinkmann Stefano Markidis Philippe Deniel
Proceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions, Association for Computing Machinery, p. 94-100, 2024

abstract

Abstract

The new emerging scientific workloads to be executed in the upcoming exascale supercomputers face major challenges in terms of storage, given their extreme volume of data. In particular, intelligent data placement, instrumentation, and workflow handling are central to application performance. The IO-SEA project developed multiple solutions to aid the scientific community in adressing these challenges: a Workflow Manager, a hierarchical storage management system, and a semantic API for storage. All of these major products incorporate additional minor products that support their mission. In this paper, we discuss both the roles of all these products and how they can assist the scientific community in achieving exascale performance.

Generating and Scaling a Multi-Language Test-Suite for MPI
Julien Adam J.B. Besnard Paul Canat Sameer Shende Hugo Taboada Adrien Roussel Marc Pérache Julien Jaeger
EuroMPI'23, 2023

abstract

Abstract

High-Performance Computing (HPC) is currently facing significant challenges. The hardware pressure has become increasingly difficult to manage due to the lack of parallel abstractions in applications. As a result, parallel programs must undergo drastic evolution to effectively exploit underlying hardware parallelism. Failure to do so results in inefficient code. In this pressing environment, parallel runtimes play a critical role, and their esting becomes crucial. This paper focuses on the MPI interface and leverages the MPI binding tools to develop a multi-language test-suite for MPI. By doing so and building on previous work from the Forum’s document editors, we implement a systematic testing of MPI symbols in the context of the Parallel Computing Validation System (PCVS), which is an HPC validation platform dedicated to running and managing test-suites at scale. We first describe PCVS, then outline the process of generating the MPI API test suite, and finally, run these tests at scale. All data sets, code generators, and implementations are made available in open-source to the community. We also set up a dedicated website showcasing the results, which self-updates thanks to the Spack package manager.

Towards Achieving Transparent Malleability Thanks to MPI Process Virtualization
Hugo Taboada Romain Pereira Julien Jaeger J.B. Besnard
ISC High Performance 2023: High Performance Computing pp 28–41, 2023

abstract

Abstract

The field of High-Performance Computing is rapidly evolving, driven by the race for computing power and the emergence of new architectures. Despite these changes, the process of launching programs has remained largely unchanged, even with the rise of hybridization and accelerators. However, there is a need to express more complex deployments for parallel applications to enable more efficient use of these machines. In this paper, we propose a transparent way to express malleability within MPI applications. This process relies on MPI process virtualization, facilitated by a dedicated privatizing compiler and a user-level scheduler. With this framework, using the MPC thread-based MPI context, we demonstrate how code can mold its resources without any software changes, opening the door to transparent MPI malleability. After detailing the implementation and associated interface, we present performance results on representative applications.

A methodology for assessing computation/communication overlap of MPI nonblocking collectives
Alexandre Denis Julien Jaeger Emmanuel Jeannot Florian Reynier
Concurr. Comput. Pract. Exp., 2022

abstract

Abstract

By allowing computation/communication overlap, MPI nonblocking collectives (NBC) are supposed to improve application scalability and performance. However, it is known that to actually get overlap, the MPI library has to implement progression mechanisms in software or rely on the network hardware. These mechanisms may be present or not, adequate or perfectible, they may have an impact on communication performance or may interfere with computation by stealing CPU cycles. From a user point of view, assessing and understanding the behavior of an MPI library concerning computation/communication overlap is difficult. In this article, we propose a methodology to assess the computation/communication overlap of NBC. We propose new metrics to measure how much communication and computation do overlap, and to evaluate how they interfere with each other. We integrate these metrics into a complete methodology. We compare our methodology with state of the art metrics and benchmarks, and show that ours provides more meaningful informations. We perform experiments on a large panel of MPI implementations and network hardware and show when and why overlap is efficient, nonexistent or even degrades performance.

Towards leveraging collective performance with the support of MPI 4.0 features in MPC
Stéphane Bouhrour Thibaut Pepin Julien Jaeger
Parallel Comput., p. 102860, 2022

MPI detach - Towards automatic asynchronous local completion
Joachim Protze Marc-André Hermanns Matthias S. Müller Van Man Nguyen Julien Jaeger Emmanuelle Saillard Patrick Carribault Denis Barthou
Parallel Comput., p. 102859, 2022

One core dedicated to MPI nonblocking communication progression? A model to assess whether it is worth it
Alexandre Denis Julien Jaeger Emmanuel Jeannot Florian Reynier
22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16-19, 2022, IEEE, p. 736-746, 2022

abstract

Abstract

Overlapping communications with computation is an efficient way to amortize the cost of communications of an HPC application. To do so, it is possible to utilize MPI nonblocking primitives so that communications run in back-ground alongside computation. However, these mechanisms rely on communications actually making progress in the background, which may not be true for all MPI libraries. Some MPI libraries leverage a core dedicated to communications to ensure communication progression. However, taking a core away from the application for such purpose may have a negative impact on the overall execution time. It may be difficult to know when such dedicated core is actually helpful. In this paper, we propose a model for the performance of applications using MPI nonblocking primitives running on top of an MPI library with a dedicated core for communications. This model is used to understand the compromise between computation slowdown due to the communication core not being available for computation, and the communication speed-up thanks to the dedicated core; evaluate whether nonblocking communication is actually obtaining the expected performance in the context of the given application; predict the performance of a given application if ran with a dedicated core. We describe the performance model and evaluate it on different applications. We compare the predictions of the model with actual executions.

Enabling Global MPI Process Addressing in MPI Applications
Jean-Baptiste Besnard Sameer Shende Allen D. Malony Julien Jaeger Marc Pérache
EuroMPI/USA'22: 29th European MPI Users' Group Meeting, Chattanooga, TN, USA, September 26 - 28, 2022, ACM, p. 27-36, 2022

Exploring Space-Time Trade-Off in Backtraces
Jean-Baptiste Besnard Julien Adam Allen D. Malony Sameer Shende Julien Jaeger Patrick Carribault Marc Pérache
Tools for High Performance Computing 2018 / 2019, Springer International Publishing, p. 151-168, 2021

abstract

Abstract

The backtrace is one of the most common operations done by profiling and debugging tools. It consists in determining the nesting of functions leading to the current execution state. Frameworks and standard libraries provide facilities enabling this operation, however, it generally incurs both computational and memory costs. Indeed, walking the stack up and then possibly resolving functions pointers (to function names) before storing them can lead to non-negligible costs. In this paper, we propose to explore a means of extracting optimized backtraces with an O(1) storage size by defining the notion of stack tags. We define a new data-structure that we called a hashed-trie used to encode stack traces at runtime through chained hashing. Our process called stack-tagging is implemented in a GCC plugin, enabling its use of C and C++ application. A library enabling the decoding of stack locators though both static and brute-force analysis is also presented. This work introduces a new manner of capturing execution state which greatly simplifies both extraction and storage which are important issues in parallel profiling.

Enhancing Load-Balancing of MPI Applications with Workshare
Thomas Dionisi Stéphane Bouhrour Julien Jaeger Patrick Carribault Marc Pérache
Proceedings of EuroPar 2021, 2021

Partitioned Collective Communication
Daniel J. Holmes Anthony Skjellum Julien Jaeger Ryan E. Grant Purushotham V. Bangalore Matthew G. F. Dosanjh Amanda Bienz Derek Schafer
Workshop on Exascale MPI, ExaMPI\@SC 2021, St. Louis, MO, USA, November 14, 2021, IEEE, p. 9-17, 2021

Preliminary Experience with OpenMP Memory Management Implementation
Adrien Roussel Patrick Carribault Julien Jaeger
OpenMP: Portable Multi-Level Parallelism on Modern Systems - 16th International Workshop on OpenMP, IWOMP 2020, Austin, TX, USA, September 22-24, 2020, Proceedings, Springer, p. 313-327, 2020

Implementation and performance evaluation of MPI persistent collectives in MPC: a case study
Stéphane Bouhrour Julien Jaeger
EuroMPI/USA '20: 27th European MPI Users' Group Meeting, Virtual Meeting, Austin, TX, USA, September 21-24, 2020, ACM, p. 51-60, 2020

Application-Driven Requirements for Node Resource Management in Next-Generation Systems
Edgar A. León Balazs Gerofi Julien Jaeger Guillaume Mercier Rolf Riesen Masamichi Takagi Brice Goglin
2020 IEEE/ACM International Workshop on Runtime and Operating Systems for Supercomputers, ROSS\@SC 2020, Atlanta, GA, USA, November 13, 2020, IEEE, p. 1-11, 2020

PARCOACH Extension for Static MPI Nonblocking and Persistent Communication Validation
Van Man Nguyen Emmanuelle Saillard Julien Jaeger Denis Barthou Patrick Carribault
4th IEEE/ACM International Workshop on Software Correctness for HPC Applications, Correctness\@SC 2020, Atlanta, GA, USA, November 11, 2020, IEEE, p. 31-39, 2020

Automatic Code Motion to Extend MPI Nonblocking Overlap Window
Van Man Nguyen Emmanuelle Saillard Julien Jaeger Denis Barthou Patrick Carribault
High Performance Computing - ISC High Performance 2020 International Workshops, Frankfurt, Germany, June 21-25, 2020, Revised Selected Papers, Springer, p. 43-54, 2020

Unifying the Analysis of Performance Event Streams at the Consumer Interface Level
Jean-Baptiste Besnard Allen D. Malony Sameer Shende Marc Pérache Patrick Carribault Julien Jaeger
Tools for High Performance Computing 2017, Springer International Publishing, p. 57-71, 2019

abstract

Abstract

Several instrumentation interfaces have been developed for parallel programs to make observable actions that take place during execution and to make accessible information about the program’s behavior and performance. Following in the footsteps of the successful profiling interface for MPI (PMPI), new rich interfaces to expose internal operation of MPI (MPI-T) and OpenMP (OMPT) runtimes are now in the standards. Taking advantage of these interfaces requires tools to selectively collect events from multiples interfaces by various techniques: function interposition (PMPI), value read (MPI-T), and callbacks (OMPT). In this paper, we present the unified instrumentation pipeline proposed by the MALP infrastructure that can be used to forward a variety of fine-grained events from multiple interfaces online to multi-threaded analysis processes implemented orthogonally with plugins. In essence, our contribution complements “front-end” instrumentation mechanisms by a generic “back-end” event consumption interface that allows “consumer” callbacks to generate performance measurements in various formats for analysis and transport. With such support, online and post-mortem cases become similar from an analysis point of view, making it possible to build more unified and consistent analysis frameworks. The paper describes the approach and demonstrates its benefits with several use cases.

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor
Alexandre Denis Julien Jaeger Emmanuel Jeannot Marc Pérache Hugo Taboada
Int. J. High Perform. Comput. Appl., 2019

Checkpoint/restart approaches for a thread-based MPI runtime
Julien Adam Maxime Kermarquer Jean-Baptiste Besnard Leonardo Bautista-Gomez Marc Pérache Patrick Carribault Julien Jaeger Allen D. Malony Sameer Shende
Parallel Comput., p. 204-219, 2019

Detecting Non-sibling Dependencies in OpenMP Task-Based Applications
Ricardo Bispo Vieira Antoine Capra Patrick Carribault Julien Jaeger Marc Pérache Adrien Roussel
OpenMP: Conquering the Full Hardware Spectrum - 15th International Workshop on OpenMP, IWOMP 2019, Auckland, New Zealand, September 11-13, 2019, Proceedings, Springer, p. 231-245, 2019

abstract

Abstract

The advent of the multicore era led to the duplication of functional units through an increasing number of cores. To exploit those processors, a shared-memory parallel programming model is one possible direction. Thus, OpenMP is a good candidate to enable different paradigms: data parallelism (including loop-based directives) and control parallelism, through the notion of tasks with dependencies. But this is the programmer responsibility to ensure that data dependencies are complete such as no data races may happen. It might be complex to guarantee that no issue will occur and that all dependencies have been correctly expressed in the context of nested tasks. This paper proposes an algorithm to detect the data dependencies that might be missing on the OpenMP task clauses between tasks that have been generated by different parents. This approach is implemented inside a tool relying on the OMPT interface.

Mixing ranks, tasks, progress and nonblocking collectives
Jean-Baptiste Besnard Julien Jaeger Allen D. Malony Sameer Shende Hugo Taboada Marc Pérache Patrick Carribault
Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 10:1-10:10, 2019

Exposition, clarification, and expansion of MPI semantic terms and conventions: is a nonblocking MPI function permitted to block?
Purushotham V. Bangalore Rolf Rabenseifner Daniel J. Holmes Julien Jaeger Guillaume Mercier Claudia Blaas-Schenner Anthony Skjellum
Proceedings of the 26th European MPI Users' Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 2:1-2:10, 2019

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor
Alexandre Denis Julien Jaeger Emmanuel Jeannot Marc Pérache Hugo Taboada
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 616-627, 2018

Efficient Communication/Computation Overlap with MPI+OpenMP Runtimes Collaboration
Marc Sergent Mario Dagrada Patrick Carribault Julien Jaeger Marc Pérache Guillaume Papauré
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 560-572, 2018

Transparent High-Speed Network Checkpoint/Restart in MPI
Julien Adam Jean-Baptiste Besnard Allen D. Malony Sameer Shende Marc Pérache Patrick Carribault Julien Jaeger
Proceedings of the 25th European MPI Users’ Group Meeting, Barcelona, Spain, September 23-26, 2018, ACM, p. 12:1-12:11, 2018

Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading
Alexandre Denis Julien Jaeger Hugo Taboada
Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, Springer, p. 123-133, 2018

Profile-guided scope-based data allocation method
Hugo Brunie Julien Jaeger Patrick Carribault Denis Barthou
Proceedings of the International Symposium on Memory Systems, MEMSYS 2018, Old Town Alexandria, VA, USA, October 01-04, 2018, ACM, p. 169-182, 2018

Towards a Better Expressiveness of the Speedup Metric in MPI Context
Jean-Baptiste Besnard Allen D. Malony Sameer Shende Marc Pérache Patrick Carribault Julien Jaeger
46th International Conference on Parallel Processing Workshops, ICPP Workshops 2017, Bristol, United Kingdom, August 14-17, 2017, IEEE Computer Society, p. 251-260, 2017

User Co-scheduling for MPI+OpenMP Applications Using OpenMP Semantics
Antoine Capra Patrick Carribault Jean-Baptiste Besnard Allen D. Malony Marc Pérache Julien Jaeger
Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017, Stony Brook, NY, USA, September 20-22, 2017, Proceedings, Springer, p. 203-216, 2017

Resource-Management Study in HPC Runtime-Stacking Context
Arthur Loussert Benoit Welterlen Patrick Carribault Julien Jaeger Marc Pérache Raymond Namyst
29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2017, Campinas, Brazil, October 17-20, 2017, IEEE Computer Society, p. 177-184, 2017

Introducing Task-Containers as an Alternative to Runtime-Stacking
Jean-Baptiste Besnard Julien Adam Sameer Shende Marc Pérache Patrick Carribault Julien Jaeger
Proceedings of the 23rd European MPI Users’ Group Meeting, EuroMPI 2016, Edinburgh, United Kingdom, September 25-28, 2016, ACM, p. 51-63, 2016

Fine-grain data management directory for OpenMP 4.0 and OpenACC
Julien Jaeger Patrick Carribault Marc Pérache
Concurr. Comput. Pract. Exp., p. 1528-1539, 2015

An MPI Halo-Cell Implementation for Zero-Copy Abstraction
Jean-Baptiste Besnard Allen D. Malony Sameer Shende Marc Pérache Patrick Carribault Julien Jaeger
Proceedings of the 22nd European MPI Users’ Group Meeting, EuroMPI 2015, Bordeaux, France, September 21-23, 2015, ACM, p. 3:1-3:9, 2015

Correctness Analysis of MPI-3 Non-Blocking Communications in PARCOACH
Julien Jaeger Emmanuelle Saillard Patrick Carribault Denis Barthou
Proceedings of the 22nd European MPI Users' Group Meeting, EuroMPI 2015, Bordeaux, France, September 21-23, 2015, ACM, p. 16:1-16:2, 2015

Data-Management Directory for OpenMP 4.0 and OpenACC
Julien Jaeger Patrick Carribault Marc Pérache
Euro-Par 2013: Parallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, August 26-27, 2013. Revised Selected Papers, Springer, p. 168-177, 2013

Binary Instrumentation for Scalable Performance Measurement of OpenMP Applications
Julien Jaeger Peter Philippen Eric Petit Andres Charif Rubial Christian Rössel William Jalby Bernd Mohr
Parallel Computing: Accelerating Computational Science and Engineering (CSE), Proceedings of the International Conference on Parallel Computing, ParCo 2013, 10-13 September 2013, Garching (near Munich), Germany, IOS Press, p. 783-792, 2013

Transformations source-à-source pour l'optimisation de codes irréguliers et multithreads. (Source-to-source transformations for irregular and multithreaded code optimization)
Julien Jaeger
Versailles Saint-Quentin-en-Yvelines University, France, 2012

Automatic efficient data layout for multithreaded stencil codes on CPU sand GPUs
Julien Jaeger Denis Barthou
19th International Conference on High Performance Computing, HiPC 2012, Pune, India, December 18-22, 2012, IEEE Computer Society, p. 1-10, 2012

Julien JAEGER

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract