LiHPC | Laboratoire en Informatique Haute Performance pour le Calcul et la Simulation

Hugo Taboada est ingénieur-chercheur au CEA. Il reçoit sont doctorat en informatique de l’université de Bordeaux en 2018 et rejoint le CEA la même année. Son travail consiste à aider les applications à bénéficier des spécificités des architectures et des supports exécutifs. Ses thématiques de recherche sont l’ordonnancement, le placement de threads, le recouvrement des communications, la projection de performance et l’interaction des language spécifique avec les runtimes. Il participe au Technical Manager Board du projet européen RED-SEA. Il participe aussi au MPI Forum, aidant à l’élaboration du prochain standard.

Hugo Taboada encadre actuellement 2 thèses, un ingénieur en CDD dans le cadre du projet européen RED-SEA ainsi que plusieurs stagiaires. Il est co-auteur de plusieurs publications dans des conférences internationales.

Predicting GPU Kernel's Performance on Upcoming Architectures
Lucas Van Lanker Hugo Taboada Elisabeth Brunet François Trahay
Euro-Par 2024: Parallel Processing, Springer Nature Switzerland, p. 77-90, 2024

abstract

Abstract

With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires predicting the performance on a non-existing hardware platform.

RED-SEA Project: Towards a new-generation European interconnect
Maria Engracia Gomez Julio Sahuquillo Andrea Biagioni Nikos Chrysos Damien Berton Ottorino Frezza Francesca Lo Cicero Alessandro Lonardo Michele Martinelli Pier Stanislao Paolucci Elena Pastorelli Francesco Simula Matteo Turisini Piero Vicini Roberto Ammendola Carlotta Chiarini Chiara De Luca Fabrizio Capuani Adrián Castelló Jose Duro Eugenio Stabile Enrique Quintana Pascale Bernier-Bruna Claire Chen Pierre-Axel Lagadec Gregoire Pichon Etienne Walter Manolis Katevenis Sokratis Bartzis Orestis Mousouros Pantelis Xirouchakis Vangelis Mageiropoulos Michalis Gianioudis Harisis Loukas Aggelos Ioannou Nikos Kallimanis Miguel Sanchez de la Rosa Gabriel Gomez-Lopez Francisco Alfaro-Cortés Jesus Escudero Sahuquillo Pedro Javier Garcia Francisco J. Quiles Jose L. Sanchez Gaetan De Gassowski Matthieu Hautreaux Stephane Mathieu Gilles Moreau Marc Pérache Hugo Taboada Torsten Hoefler Timo Schneider Matteo Barnaba Giuseppe Piero Brandino Francesco De Giorgi Matteo Poggi Iakovos Mavroidis Yannis Papaefstathiou Nikolaos Tampouratzis Benjamin Kalisch Ulrich Krackhardt Mondrian Nuessle Wolfang Frings Dominik Gottwald Felime Guimaraes Max Holicki Volker Marx Yannik Muller Carsten Clauss Hugo Falter Xu Huang Jennifer Lopez Barillao Thomas Moschny Simon Pickartz
Microprocessors and Microsystems, Volume 110, October 2024, 105102, 2024

abstract

Abstract

RED-SEA is a H2020 EuroHPC project, whose main objective is to prepare a new-generation European Interconnect, capable of powering the EU Exascale systems to come, through an economically viable and technologically efficient interconnect, leveraging European interconnect technology (BXI) associated with standard and mature technology (Ethernet), previous EU-funded initiatives, as well as open standards and compatible APIs. To achieve this objective, the RED-SEA project is being carried out around four key pillars: (i) network architecture and workload requirements-interconnects co-design – aiming at optimizing the fit with the other EuroHPC projects and with the EPI processors; (ii) development of a high-performance, low-latency, seamless bridge with Ethernet; (iii) efficient network resource management, including congestion and Quality-of-Service; and (iv) end-to-end functions implemented at the network edges. This paper presents key achievements and results at the midterm of the project for each key pillar in the way to reach the final project objective. In this regard we can highlight: (i) The definition of the network requirements and architecture as well as a list of benchmarks and applications; (ii) In addition to initially planned IPs progress, BXI3 architecture has evolved to support natively Ethernet at low level, resulting in reduced complexity, with advantages in terms of cost optimization, and power consumption; (iii) The congestion characterization of target applications and proposals to reduce this congestion by the optimization of collective communication primitives, injection throttling and adaptive routing; and (iv) the low-latency high-message rate endpoint functions and their connection with new open technologies.

Generating and Scaling a Multi-Language Test-Suite for MPI
Julien Adam J.B. Besnard Paul Canat Sameer Shende Hugo Taboada Adrien Roussel Marc Pérache Julien Jaeger
EuroMPI'23, 2023

abstract

Abstract

High-Performance Computing (HPC) is currently facing significant challenges. The hardware pressure has become increasingly difficult to manage due to the lack of parallel abstractions in applications. As a result, parallel programs must undergo drastic evolution to effectively exploit underlying hardware parallelism. Failure to do so results in inefficient code. In this pressing environment, parallel runtimes play a critical role, and their esting becomes crucial. This paper focuses on the MPI interface and leverages the MPI binding tools to develop a multi-language test-suite for MPI. By doing so and building on previous work from the Forum’s document editors, we implement a systematic testing of MPI symbols in the context of the Parallel Computing Validation System (PCVS), which is an HPC validation platform dedicated to running and managing test-suites at scale. We first describe PCVS, then outline the process of generating the MPI API test suite, and finally, run these tests at scale. All data sets, code generators, and implementations are made available in open-source to the community. We also set up a dedicated website showcasing the results, which self-updates thanks to the Spack package manager.

Towards Achieving Transparent Malleability Thanks to MPI Process Virtualization
Hugo Taboada Romain Pereira Julien Jaeger J.B. Besnard
ISC High Performance 2023: High Performance Computing pp 28–41, 2023

abstract

Abstract

The field of High-Performance Computing is rapidly evolving, driven by the race for computing power and the emergence of new architectures. Despite these changes, the process of launching programs has remained largely unchanged, even with the rise of hybridization and accelerators. However, there is a need to express more complex deployments for parallel applications to enable more efficient use of these machines. In this paper, we propose a transparent way to express malleability within MPI applications. This process relies on MPI process virtualization, facilitated by a dedicated privatizing compiler and a user-level scheduler. With this framework, using the MPC thread-based MPI context, we demonstrate how code can mold its resources without any software changes, opening the door to transparent MPI malleability. After detailing the implementation and associated interface, we present performance results on representative applications.

Relative Performance Projection on Arm Architectures
Clément Gavoille Hugo Taboada Patrick Carribault Fabrice Dupros Brice Goglin Emmanuel Jeannot
Euro-Par 2022: Parallel Processing - 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22-26, 2022, Proceedings, Springer, p. 85-99, 2022

Performance Improvements of Parallel Applicationsthanks to MPI-4.0 Hints
Maxim Moraru Adrien Roussel Hugo Taboada Christophe Jaillet Michael Krajecki Marc Pérache
Proceedings of SBAC-PAD 2022, IEEE, 2022

abstract

Abstract

HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.

Benefits of MPI Sessions for GPU MPI applications
Maxim Moraru Adrien Roussel Hugo Taboada Christophe Jaillet Michael Krajecki Marc Pérache
Proceedings of EuroMPI 2021, 2021

abstract

Abstract

Heterogeneous supercomputers are now considered the most valuable solution to reach the Exascale. Nowadays, we can frequently observe that compute nodes are composed of more than one GPU accelerator. Programming such architectures efficiently is challenging. MPI is the defacto standard for distributed computing. CUDAaware libraries were introduced to ease GPU inter-nodes communications. However, they induce some overhead that can degrade overall performances. MPI 4.0 Specification draft introduces the MPI Sessions model which offers the ability to initialize specific resources for a specific component of the application. In this paper, we present a way to reduce the overhead induced by CUDA-aware libraries with a solution inspired by MPI Sessions. In this way, we minimize the overhead induced by GPUs in an MPI context and allow to improve CPU + GPU programs efficiency. We evaluate our approach on various micro-benchmarks and some proxy applications like Lulesh, MiniFE, Quicksilver, and Cloverleaf. We demonstrate how this approach can provide up to a 7x speedup compared to the standard MPI model.

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor
Alexandre Denis Julien Jaeger Emmanuel Jeannot Marc Pérache Hugo Taboada
Int. J. High Perform. Comput. Appl., 2019

Mixing ranks, tasks, progress and nonblocking collectives
Jean-Baptiste Besnard Julien Jaeger Allen D. Malony Sameer Shende Hugo Taboada Marc Pérache Patrick Carribault
Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 10:1-10:10, 2019

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor
Alexandre Denis Julien Jaeger Emmanuel Jeannot Marc Pérache Hugo Taboada
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 616-627, 2018

Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading
Alexandre Denis Julien Jaeger Hugo Taboada
Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, Springer, p. 123-133, 2018

Hugo TABOADA

Abstract

Abstract

Abstract

Abstract

Abstract

Abstract