Post-Image Post-Image Post-Image

Hugo Taboada is a research scientist at CEA. He received his Ph.D. in computer science from the University of Bordeaux in 2018, and he joined CEA the same year. He is now working to help applications benefit from architecture and runtime specificities. His topics of interest are scheduling, thread placement, communication overlap, performance projection, and Domain Specific Language interaction with MPI runtimes. He also participates in the MPI Forum, helping to design the next MPI standard. He is a member of the Technical Manager Board of the European Project RED-SEA.

Hugo Taboada is currently supervising 2 Ph.D. thesis and one engineer working on the RED-SEA European project. He has already supervised several interns. He is co-author of several research papers at international conferences.

Generating and Scaling a Multi-Language Test-Suite for MPI
Julien Adam   J.B. Besnard   Paul Canat   Sameer Shende   Hugo Taboada   Adrien Roussel   Marc Pérache   Julien Jaeger  
EuroMPI'23, 2023


High-Performance Computing (HPC) is currently facing significant challenges. The hardware pressure has become increasingly difficult to manage due to the lack of parallel abstractions in applications. As a result, parallel programs must undergo drastic evolution to effectively exploit underlying hardware parallelism. Failure to do so results in inefficient code. In this pressing environment, parallel runtimes play a critical role, and their esting becomes crucial. This paper focuses on the MPI interface and leverages the MPI binding tools to develop a multi-language test-suite for MPI. By doing so and building on previous work from the Forum’s document editors, we implement a systematic testing of MPI symbols in the context of the Parallel Computing Validation System (PCVS), which is an HPC validation platform dedicated to running and managing test-suites at scale. We first describe PCVS, then outline the process of generating the MPI API test suite, and finally, run these tests at scale. All data sets, code generators, and implementations are made available in open-source to the community. We also set up a dedicated website showcasing the results, which self-updates thanks to the Spack package manager.

Towards Achieving Transparent Malleability Thanks to MPI Process Virtualization
Hugo Taboada   Romain Pereira   Julien Jaeger   J.B. Besnard  
ISC High Performance 2023: High Performance Computing pp 28–41, 2023


The field of High-Performance Computing is rapidly evolving, driven by the race for computing power and the emergence of new architectures. Despite these changes, the process of launching programs has remained largely unchanged, even with the rise of hybridization and accelerators. However, there is a need to express more complex deployments for parallel applications to enable more efficient use of these machines. In this paper, we propose a transparent way to express malleability within MPI applications. This process relies on MPI process virtualization, facilitated by a dedicated privatizing compiler and a user-level scheduler. With this framework, using the MPC thread-based MPI context, we demonstrate how code can mold its resources without any software changes, opening the door to transparent MPI malleability. After detailing the implementation and associated interface, we present performance results on representative applications.

Relative Performance Projection on Arm Architectures
Clément Gavoille   Hugo Taboada   Patrick Carribault   Fabrice Dupros   Brice Goglin   Emmanuel Jeannot  
Euro-Par 2022: Parallel Processing - 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22-26, 2022, Proceedings, Springer, p. 85-99, 2022

Performance Improvements of Parallel Applicationsthanks to MPI-4.0 Hints
Maxim Moraru   Adrien Roussel   Hugo Taboada   Christophe Jaillet   Michael Krajecki   Marc Pérache  
Proceedings of SBAC-PAD 2022, IEEE, 2022


HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.

Benefits of MPI Sessions for GPU MPI applications
Maxim Moraru   Adrien Roussel   Hugo Taboada   Christophe Jaillet   Michael Krajecki   Marc Pérache  
Proceedings of EuroMPI 2021, 2021


Heterogeneous supercomputers are now considered the most valuable solution to reach the Exascale. Nowadays, we can frequently observe that compute nodes are composed of more than one GPU accelerator. Programming such architectures efficiently is challenging. MPI is the defacto standard for distributed computing. CUDAaware libraries were introduced to ease GPU inter-nodes communications. However, they induce some overhead that can degrade overall performances. MPI 4.0 Specification draft introduces the MPI Sessions model which offers the ability to initialize specific resources for a specific component of the application. In this paper, we present a way to reduce the overhead induced by CUDA-aware libraries with a solution inspired by MPI Sessions. In this way, we minimize the overhead induced by GPUs in an MPI context and allow to improve CPU + GPU programs efficiency. We evaluate our approach on various micro-benchmarks and some proxy applications like Lulesh, MiniFE, Quicksilver, and Cloverleaf. We demonstrate how this approach can provide up to a 7x speedup compared to the standard MPI model.

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor
Alexandre Denis   Julien Jaeger   Emmanuel Jeannot   Marc Pérache   Hugo Taboada  
Int. J. High Perform. Comput. Appl., 2019

Mixing ranks, tasks, progress and nonblocking collectives
Jean-Baptiste Besnard   Julien Jaeger   Allen D. Malony   Sameer Shende   Hugo Taboada   Marc Pérache   Patrick Carribault  
Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 10:1-10:10, 2019

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor
Alexandre Denis   Julien Jaeger   Emmanuel Jeannot   Marc Pérache   Hugo Taboada  
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 616-627, 2018

Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading
Alexandre Denis   Julien Jaeger   Hugo Taboada  
Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, Springer, p. 123-133, 2018