LiHPC | Laboratory for High Performance Computing and Simulation

Florian Reynier joined the CEA in 2022 after defending a PhD thesis in computer science at the University of Bordeaux, in partnership with CEA and Inria Bordeaux sud-ouest.
He is now research engineer at the LIHPC, his research focuses on HPC, and more specifically on the optimization of parallel programming models such as MPI and GPU programming.

A methodology for assessing computation/communication overlap of MPI nonblocking collectives
Alexandre Denis Julien Jaeger Emmanuel Jeannot Florian Reynier
Concurr. Comput. Pract. Exp., 2022

abstract

Abstract

By allowing computation/communication overlap, MPI nonblocking collectives (NBC) are supposed to improve application scalability and performance. However, it is known that to actually get overlap, the MPI library has to implement progression mechanisms in software or rely on the network hardware. These mechanisms may be present or not, adequate or perfectible, they may have an impact on communication performance or may interfere with computation by stealing CPU cycles. From a user point of view, assessing and understanding the behavior of an MPI library concerning computation/communication overlap is difficult. In this article, we propose a methodology to assess the computation/communication overlap of NBC. We propose new metrics to measure how much communication and computation do overlap, and to evaluate how they interfere with each other. We integrate these metrics into a complete methodology. We compare our methodology with state of the art metrics and benchmarks, and show that ours provides more meaningful informations. We perform experiments on a large panel of MPI implementations and network hardware and show when and why overlap is efficient, nonexistent or even degrades performance.

One core dedicated to MPI nonblocking communication progression? A model to assess whether it is worth it
Alexandre Denis Julien Jaeger Emmanuel Jeannot Florian Reynier
22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16-19, 2022, IEEE, p. 736-746, 2022

abstract

Abstract

Overlapping communications with computation is an efficient way to amortize the cost of communications of an HPC application. To do so, it is possible to utilize MPI nonblocking primitives so that communications run in back-ground alongside computation. However, these mechanisms rely on communications actually making progress in the background, which may not be true for all MPI libraries. Some MPI libraries leverage a core dedicated to communications to ensure communication progression. However, taking a core away from the application for such purpose may have a negative impact on the overall execution time. It may be difficult to know when such dedicated core is actually helpful. In this paper, we propose a model for the performance of applications using MPI nonblocking primitives running on top of an MPI library with a dedicated core for communications. This model is used to understand the compromise between computation slowdown due to the communication core not being available for computation, and the communication speed-up thanks to the dedicated core; evaluate whether nonblocking communication is actually obtaining the expected performance in the context of the given application; predict the performance of a given application if ran with a dedicated core. We describe the performance model and evaluate it on different applications. We compare the predictions of the model with actual executions.

Étude sur la progression des communications MPI à base de ressources dédiées
Florian Reynier
Thèse de Doctorat de l'Université de Bordeaux, 2022

abstract

Abstract

De nos jours, MPI est de facto le standard pour la programmation à mémoire distribuée pour les supercalculateurs. Les communications non bloquantes sont un des modèles proposés par le standard MPI. Ces opérations peuvent être utilisées pour recouvrir les communications avec du calcul (ou d’autres communications) afin d’amortir leurs coûts. Cependant, pour être utilisées efficacement, ces opérations nécessitent une progression asynchrone pouvant régulièrement utiliser un montant non négligeable de ressources de calcul (particulièrement les collectives non bloquantes). De plus, partager les ressources de calcul avec l’application peut provoquer un ralentissement global. Les mécanismes utilisés pour cette progression asynchrone parviennent difficilement à concilier un bon recouvrement en gardant un impact minimal sur l’application, ce qui raréfie leur utilisation. Afin de résoudre ces différents problèmes, nous avons suivi plusieurs étapes. Premièrement, nous proposons une étude approfondie de la progression asynchrone dans les implémentations MPI, en utilisant de nouvelles métriques se concentrant sur l’évaluation des mécanismes de progression et de leur impact sur le système global. Après avoir exposé les faiblesses de ces implémentations MPI, nous proposons une nouvelle solution pour la progression des collectives non bloquantes en utilisant des coeurs dédiés combinés à des algorithmes de collectives basés sur des évènements. Nous avons mesuré l’efficacité de cette solution en utilisant nos métriques, pour nous comparer avec les implémentations MPI étudiées dans la première étape. Enfin, nous avons développé un modèle permettant de prédire le gain potentiel et le surcout induit par l’utilisation d’opérations non bloquantes avec des coeurs dédiés. Ce modèle peut être utilisé pour évaluer l’utilité de transformer une application basée sur des opérations bloquantes en opérations non bloquantes pour bénéficier du recouvrement. Nous évaluons ce modèle sur plusieurs benchmarks.

Florian REYNIER

Abstract

Abstract

Abstract