Post-Image

Hugo TABOADA

Post-Image Post-Image Post-Image

Hugo Taboada is a research scientist at CEA. He received his Ph.D. in computer science from the University of Bordeaux in 2018, and he joined CEA the same year. He is now working to help applications benefit from architecture and runtime specificities. His topics of interest are scheduling, thread placement, communication overlap, performance projection, and Domain Specific Language interaction with MPI runtimes. He also participates in the MPI Forum, helping to design the next MPI standard. He is a member of the Technical Manager Board of the European Project RED-SEA.

Hugo Taboada is currently supervising 2 Ph.D. thesis and one engineer working on the RED-SEA European project. He has already supervised several interns. He is co-author of several research papers at international conferences.

Predicting GPU Kernel's Performance on Upcoming Architectures
Lucas Van Lanker   Hugo Taboada   Elisabeth Brunet   François Trahay  
Euro-Par 2024: Parallel Processing, Springer Nature Switzerland, p. 77-90, 2024

abstract

Abstract

With the advent of heterogeneous systems that combine CPUs and GPUs, designing a supercomputer becomes more and more complex. The hardware characteristics of GPUs significantly impact the performance. Choosing the GPU that will maximize performance for a limited budget is tedious because it requires predicting the performance on a non-existing hardware platform.

RED-SEA Project: Towards a new-generation European interconnect
Maria Engracia Gomez   Julio Sahuquillo   Andrea Biagioni   Nikos Chrysos   Damien Berton   Ottorino Frezza   Francesca Lo Cicero   Alessandro Lonardo   Michele Martinelli   Pier Stanislao Paolucci   Elena Pastorelli   Francesco Simula   Matteo Turisini   Piero Vicini   Roberto Ammendola   Carlotta Chiarini   Chiara De Luca   Fabrizio Capuani   Adrián Castelló   Jose Duro   Eugenio Stabile   Enrique Quintana   Pascale Bernier-Bruna   Claire Chen   Pierre-Axel Lagadec   Gregoire Pichon   Etienne Walter   Manolis Katevenis   Sokratis Bartzis   Orestis Mousouros   Pantelis Xirouchakis   Vangelis Mageiropoulos   Michalis Gianioudis   Harisis Loukas   Aggelos Ioannou   Nikos Kallimanis   Miguel Sanchez de la Rosa   Gabriel Gomez-Lopez   Francisco Alfaro-Cortés   Jesus Escudero Sahuquillo   Pedro Javier Garcia   Francisco J. Quiles   Jose L. Sanchez   Gaetan De Gassowski   Matthieu Hautreaux   Stephane Mathieu   Gilles Moreau   Marc Pérache   Hugo Taboada   Torsten Hoefler   Timo Schneider   Matteo Barnaba   Giuseppe Piero Brandino   Francesco De Giorgi   Matteo Poggi   Iakovos Mavroidis   Yannis Papaefstathiou   Nikolaos Tampouratzis   Benjamin Kalisch   Ulrich Krackhardt   Mondrian Nuessle   Wolfang Frings   Dominik Gottwald   Felime Guimaraes   Max Holicki   Volker Marx   Yannik Muller   Carsten Clauss   Hugo Falter   Xu Huang   Jennifer Lopez Barillao   Thomas Moschny   Simon Pickartz  
Microprocessors and Microsystems, Volume 110, October 2024, 105102, 2024

abstract

Abstract

RED-SEA is a H2020 EuroHPC project, whose main objective is to prepare a new-generation European Interconnect, capable of powering the EU Exascale systems to come, through an economically viable and technologically efficient interconnect, leveraging European interconnect technology (BXI) associated with standard and mature technology (Ethernet), previous EU-funded initiatives, as well as open standards and compatible APIs. To achieve this objective, the RED-SEA project is being carried out around four key pillars: (i) network architecture and workload requirements-interconnects co-design – aiming at optimizing the fit with the other EuroHPC projects and with the EPI processors; (ii) development of a high-performance, low-latency, seamless bridge with Ethernet; (iii) efficient network resource management, including congestion and Quality-of-Service; and (iv) end-to-end functions implemented at the network edges. This paper presents key achievements and results at the midterm of the project for each key pillar in the way to reach the final project objective. In this regard we can highlight: (i) The definition of the network requirements and architecture as well as a list of benchmarks and applications; (ii) In addition to initially planned IPs progress, BXI3 architecture has evolved to support natively Ethernet at low level, resulting in reduced complexity, with advantages in terms of cost optimization, and power consumption; (iii) The congestion characterization of target applications and proposals to reduce this congestion by the optimization of collective communication primitives, injection throttling and adaptive routing; and (iv) the low-latency high-message rate endpoint functions and their connection with new open technologies.

Generating and Scaling a Multi-Language Test-Suite for MPI
Julien Adam   J.B. Besnard   Paul Canat   Sameer Shende   Hugo Taboada   Adrien Roussel   Marc Pérache   Julien Jaeger  
EuroMPI'23, 2023

abstract

Abstract

High-Performance Computing (HPC) is currently facing significant challenges. The hardware pressure has become increasingly difficult to manage due to the lack of parallel abstractions in applications. As a result, parallel programs must undergo drastic evolution to effectively exploit underlying hardware parallelism. Failure to do so results in inefficient code. In this pressing environment, parallel runtimes play a critical role, and their esting becomes crucial. This paper focuses on the MPI interface and leverages the MPI binding tools to develop a multi-language test-suite for MPI. By doing so and building on previous work from the Forum’s document editors, we implement a systematic testing of MPI symbols in the context of the Parallel Computing Validation System (PCVS), which is an HPC validation platform dedicated to running and managing test-suites at scale. We first describe PCVS, then outline the process of generating the MPI API test suite, and finally, run these tests at scale. All data sets, code generators, and implementations are made available in open-source to the community. We also set up a dedicated website showcasing the results, which self-updates thanks to the Spack package manager.

Towards Achieving Transparent Malleability Thanks to MPI Process Virtualization
Hugo Taboada   Romain Pereira   Julien Jaeger   J.B. Besnard  
ISC High Performance 2023: High Performance Computing pp 28–41, 2023

abstract

Abstract

The field of High-Performance Computing is rapidly evolving, driven by the race for computing power and the emergence of new architectures. Despite these changes, the process of launching programs has remained largely unchanged, even with the rise of hybridization and accelerators. However, there is a need to express more complex deployments for parallel applications to enable more efficient use of these machines. In this paper, we propose a transparent way to express malleability within MPI applications. This process relies on MPI process virtualization, facilitated by a dedicated privatizing compiler and a user-level scheduler. With this framework, using the MPC thread-based MPI context, we demonstrate how code can mold its resources without any software changes, opening the door to transparent MPI malleability. After detailing the implementation and associated interface, we present performance results on representative applications.

Relative Performance Projection on Arm Architectures
Clément Gavoille   Hugo Taboada   Patrick Carribault   Fabrice Dupros   Brice Goglin   Emmanuel Jeannot  
Euro-Par 2022: Parallel Processing - 28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22-26, 2022, Proceedings, Springer, p. 85-99, 2022

Performance Improvements of Parallel Applicationsthanks to MPI-4.0 Hints
Maxim Moraru   Adrien Roussel   Hugo Taboada   Christophe Jaillet   Michael Krajecki   Marc Pérache  
Proceedings of SBAC-PAD 2022, IEEE, 2022

abstract

Abstract

HPC systems have experienced significant growth over the past years, with modern machines having hundreds of thousands of nodes. Message Passing Interface (MPI) is the de facto standard for distributed computing on these architectures. On the MPI critical path, the message-matching process is one of the most time-consuming operations. In this process, searching for a specific request in a message queue represents a significant part of the communication latency. So far, no miracle algorithm performs well in all cases. This paper explores potential matching specializations thanks to hints introduced in the latest MPI 4.0 standard. We propose a hash-table-based algorithm that performs constant time message-matching for no wildcard requests. This approach is suitable for intensive point-to-point communication phases in many applications (more than 50% of CORAL benchmarks). We demonstrate that our approach can improve the overall execution time of real HPC applications by up to 25%. Also, we analyze the limitations of our method and propose a strategy for identifying the most suitable algorithm for a given application. Indeed, we apply machine learning techniques for classifying applications depending on their message pattern characteristics.

Benefits of MPI Sessions for GPU MPI applications
Maxim Moraru   Adrien Roussel   Hugo Taboada   Christophe Jaillet   Michael Krajecki   Marc Pérache  
Proceedings of EuroMPI 2021, 2021

abstract

Abstract

Heterogeneous supercomputers are now considered the most valuable solution to reach the Exascale. Nowadays, we can frequently observe that compute nodes are composed of more than one GPU accelerator. Programming such architectures efficiently is challenging. MPI is the defacto standard for distributed computing. CUDAaware libraries were introduced to ease GPU inter-nodes communications. However, they induce some overhead that can degrade overall performances. MPI 4.0 Specification draft introduces the MPI Sessions model which offers the ability to initialize specific resources for a specific component of the application. In this paper, we present a way to reduce the overhead induced by CUDA-aware libraries with a solution inspired by MPI Sessions. In this way, we minimize the overhead induced by GPUs in an MPI context and allow to improve CPU + GPU programs efficiency. We evaluate our approach on various micro-benchmarks and some proxy applications like Lulesh, MiniFE, Quicksilver, and Cloverleaf. We demonstrate how this approach can provide up to a 7x speedup compared to the standard MPI model.

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor
Alexandre Denis   Julien Jaeger   Emmanuel Jeannot   Marc Pérache   Hugo Taboada  
Int. J. High Perform. Comput. Appl., 2019

Mixing ranks, tasks, progress and nonblocking collectives
Jean-Baptiste Besnard   Julien Jaeger   Allen D. Malony   Sameer Shende   Hugo Taboada   Marc Pérache   Patrick Carribault  
Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI 2019, Zürich, Switzerland, September 11-13, 2019, ACM, p. 10:1-10:10, 2019

Dynamic Placement of Progress Thread for Overlapping MPI Non-blocking Collectives on Manycore Processor
Alexandre Denis   Julien Jaeger   Emmanuel Jeannot   Marc Pérache   Hugo Taboada  
Euro-Par 2018: Parallel Processing - 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27-31, 2018, Proceedings, Springer, p. 616-627, 2018

Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading
Alexandre Denis   Julien Jaeger   Hugo Taboada  
Euro-Par 2018: Parallel Processing Workshops - Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers, Springer, p. 123-133, 2018