Philippe Deniel est Ingénieur-Chercher et expert Fellow CEA. Il a été diplômé de l’Ecole Centrale Paris en 1996 et il est titulaire d’un doctorat en Informatique. Il a été responsable des équipes en charges des systèmes de stockage de 2015 à 2023. Ses centres d’intérêts recoupent le stockage massif pour le HPC, l’intégration système du Quantum Computing dans le HPC et l’hybridation HPC/QC.
Proceedings of the 21st ACM International Conference on Computing Frontiers: Workshops and Special Sessions, Association for Computing Machinery, p. 94-100, 2024
abstract
Abstract
The new emerging scientific workloads to be executed in the upcoming exascale supercomputers face major challenges in terms of storage, given their extreme volume of data. In particular, intelligent data placement, instrumentation, and workflow handling are central to application performance. The IO-SEA project developed multiple solutions to aid the scientific community in adressing these challenges: a Workflow Manager, a hierarchical storage management system, and a semantic API for storage. All of these major products incorporate additional minor products that support their mission. In this paper, we discuss both the roles of all these products and how they can assist the scientific community in achieving exascale performance.
Thèse de Doctorat de l'Université Paris-Saclay, 2023
abstract
Abstract
Cette thèse présente NFS-Ganesha, un serveur NFS en espace utilisateur pour le HPC, et ses évolutions depuis sa création à l'aube des années 2000 jusqu'à la période Exascale actuelle. Créé à l'origine pour des besoins opérationnels liés à l'exploitation des grands systèmes de stockage, NFS-Ganesha a été pensé pour être générique et parallélisé. L'apparition conjointe des systèmes de fichiers parallèles, donnant naissance aux architectures «data-centriques» de centre de calcul, et celle du protocole NFSv4 vont faire évoluer de NFS-Ganesha qui va devenir un serveur NFS générique capable de s'interfacer avec de nombreux backends. L'évolution de NFSv4, sous la forme de NFSv4.1 et du protocole pNFS, fera de NFS-Ganesha un standard adopté par une forte communauté open-source impliquant chercheurs et industriels. NFS-Ganesha sera utilisé pour réaliser la fonctionnalité IO-Proxy, et la création de nouveaux protocoles parallèles afférents. Impliqués dans des projets de R&D européens, NFS-Ganesha servira à implémenter la fonctionnalité de serveur éphémère afin de répondre aux exigences de l'Exascale.
SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, 2023
abstract
Abstract
HPC application developers and administrators need to understand the complex interplay between compute clusters and storage systems to make effective optimization decisions. Ad hoc investigations of this interplay based on isolated case studies can lead to conclusions that are incorrect or difficult to generalize. The I/O Trace Initiative aims to improve the scientific community’s understanding of I/O operations by building a searchable collaborative archive of I/O traces from a wide range of applications and machines, with a focus on high-performance computing and scalable AI/ML. This initiative advances the accessibility of I/O trace data by enabling users to locate and compare traces based on user-specified criteria. It also provides a visual analytics platform for in-depth analysis, paving the way for the development of advanced performance optimization techniques. By acting as a hub for trace data, the initiative fosters collaborative research by encouraging data sharing and collective learning.
Zenodo, 2022
abstract
Abstract
This document feeds research and development priorities devel-oped by the European HPC ecosystem into EuroHPC’s Research and Innovation Advisory Group with an aim to define the HPC Technology research Work Programme and the calls for proposals included in it and to be launched from 2023 to 2026. This SRA also describes the major trends in the deployment of HPC and HPDA methods and systems, driven by economic and societal needs in Europe, taking into account the changes ex-pected in the technologies and architectures of the expanding underlying IT infrastructure. The goal is to draw a complete pic-ture of the state of the art and the challenges for the next three to four years rather than to focus on specific technologies, implementations or solutions.
CHEOPS '21: Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, p. 1-9, 2021
abstract
Abstract
The emergence of Exascale machines in HPC will have the foreseen consequence of putting more pressure on the storage systems in place, not only in terms of capacity but also bandwidth and latency. With limited budget we cannot imagine using only storage class memory, which leads to the use of a heterogeneous tiered storage hierarchy. In order to make the most efficient use of the high performance tier in this storage hierarchy, we need to be able to place user data on the right tier and at the right time. In this paper, we assume a 2-tier storage hierarchy with a high performance tier and a high capacity archival tier. Files are placed on the high performance tier at creation time and moved to capacity tier once their lifetime expires (that is once they are no more accessed). The main contribution of this paper lies in the design of a file lifetime prediction model solely based on its path based on the use of Convolutional Neural Network. Results show that our solution strikes a good trade-off between accuracy and under-estimation. Compared to previous work, our model made it possible to reach an accuracy close to previous work (around 98.60% compared to 98.84%) while reducing the underestimations by almost 10x to reach 2.21% (compared to 21.86%). The reduction in underestimations is crucial as it avoids misplacing files in the capacity tier while they are still in use.
Simulation Tools and Techniques, Springer International Publishing, p. 796-810, 2021
abstract
Abstract
In the High Performance Computing field (HPC), metadata server cluster is a critical aspect of a storage system performance and with object storage growth, systems must now be able to distribute metadata across servers thanks to distributed metadata servers. Storage systems reach better performances if the workload remains balanced over time. Indeed, an unbalanced distribution can lead to frequent requests to a subset of servers while other servers are completely idle. To avoid this issue, different metadata distribution methods exist and each one has its best use cases. Moreover, each system has different usages and different workloads, which means that one distribution method could fit to a specific kind of storage system and not to another one. To this end, we propose a tool to evaluate metadata distribution methods with different workloads. In this paper, we describe this tool and we use it to compare state-of-the-art methods and one method we developed. We also show how outputs generated by our tool enable us to deduce distribution weakness and chose the most adapted method.
2018 International Conference on High Performance Computing and Simulation (HPCS), p. 1059-1060, 2018
abstract
Abstract
The use of object storage in the HPC world becomes a common case as it enables to overcome some POSIX limitations in scalability and performance. Indeed, object stores use a flat namespace, avoiding hierarchy in access requests and the cost of maintaining dependencies between multiple entries. Object stores also differentiate data flow from metadata flow, providing better concurrency and throughput. They can store trillions of objects and each object has its own customized metadata attributes so these metadata can be richer than POSIX attributes.
Proceedings of the 4th ACM International Conference of Computing for Engineering and Sciences, Association for Computing Machinery, 2018
abstract
Abstract
Simulation is the most appropriate technique to evaluate the performance of current data storage systems and predict it for the future ones as part of data centers or cloud infrastructures. It assesses the potential of a system to meet the users requirements in terms of storage capacity, devices heterogeneity, delivered performance and robustness. We developed a simulation tool called OGSSim to address efficiently these criteria within a reduced execution time. But the number of threads on the test machine put an upper bound to the size of the simulated systems. To push this limitation and improve the simulation time, we define in this paper a parallel version of OGSSim. We explain how the parallelization process generate both design and implementation challenges due to the multi-node environment and the related communications and how MPI and ZeroMQ libraries respectively help us to address those challenges.
2017 International Conference on High Performance Computing and Simulation (HPCS), p. 236-243, 2017
abstract
Abstract
Storage systems capacity provided by data centers do not cease to increase, currently reaching the exabyte scale using thousands of disks. In this way, the question of the resiliency of such systems becomes critical, to avoid data loss and reduce the impact of the reconstruction process on the data access time. We propose SD2S, a method to create a placement scheme for declustered RAID organizations, based on a shifting placement. It consists in the calculation of degree matrices, which represent the distance between the source sets of each couple of physical disks, thus the number of data blocks which will be reconstructed in case of a double failure. The scheme creation is made by the computation of a score function for all possible shifting offsets and the selection of the one ensuring the reconstruction of the highest percentage of data. Results show the data reconstruction distribution against the number of double failure events. Also, the overhead generated by the calculation of the shifting offsets is compared to greedy SD2S and CRUSH without replicas for systems reaching the hundred of disks. These results confirm that the selection of the best offset can lead to a complete data reconstruction giving a small overhead, especially for large systems.
2017 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), p. 1-8, 2017
abstract
Abstract
Using simulation to study the behavior of large-scale data storage systems becomes capital to predict the performance and the reliability of any of them at a lower cost. This helps to take the right decisions before the system development and deployment. OGSSim is a simulation tool for large and heterogeneous storage systems that uses parallelism to provide informations about the behavior of such systems in a reduced time. It uses ZeroMQ communication library to implement not only the data communication but also the synchronization functions between the generated threads. These synchronization points occur during the requests parallel execution and need to be treated efficiently to ensure data coherency for the fast and accurate computation of performance metrics. In this work, different issues due to the parallel execution of our simulation tool OGSSim are presented and the adopted solutions using ZeroMQ are discussed. The impact of these solutions in term of simulation time overhead are measured considering various system configurations. The obtained results s how t hat ZeroMQ has almost no impact on the simulation time, even for complex and large configurations.
2016 International Conference on High Performance Computing and Simulation (HPCS), p. 342-349, 2016
abstract
Abstract
Modern disks are very large (SSDs, HDDs) and their capacities will certainly increase in the future. Storage systems use an important number of such devices to compose storage pools and fulfil the storage capacity demands. The result is a higher probability of a failure and a longer reconstruction duration. Consequently, the whole system is penalized as the response time is higher and a second failure will generate a data loss. In this paper, we propose a new method based on block shifting layout which increases the efficiency of a RAID declustered storage system and improves its robustness in both normal and failure modes. We define four mapping rules to reach these objectives. Conducted tests reveal that exploiting the coprime property between the number of devices and the block shifting factor leads to an optimal layout. It reduces significantly the redirection time proportionally to the number of disks, reaching 50% for 1000 disks and a negligeable memory cost as we avoid the use of a redirection table. It also allows the recovery of additional data in case of a second failure during the degraded mode which gives to our proposed method a huge interest for large storage systems comparing to other existing methods.
2016 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), p. 1-8, 2016
abstract
Abstract
Actual storage systems are very large, with complex and distributed architectural configurations, composed of various technologies devices. However, simulation, analysis and evaluation tools in the literature do not handle this complex design and these heterogeneous components. This paper presents OGSSim (Open and Generic Storage systems Simulation tool): a new simulation tool for such systems. Being generic to all devices technologies and open to diverse management strategies and architecture layouts, it fulfills all the storage systems needs in term of representativeness. Also, it has been validated against real systems, thus its accuracy makes it a useful tool for the conception of future storage systems, the choice of hardware components and the analysis of the adequacy between the applications needs and the management strategies combined with the configuration layout. This validation confirmed only a maximum of 15% of difference between real and simulated execution time. Also, OGSSim execution in a competitive time, just 3.5 sec for common workloads on a large system of 500 disks, makes it a challenging simulation and evaluation tool. Thus, it is the appropriate and accurate tool for modern storage systems conception, evaluation and maintenance.
EAI Endorsed Transactions on Scalable Information Systems, ACM, 2015
abstract
Abstract
In this paper, an open and generic storage simulator is proposed. It simulates with accuracy multi-tiered storage systems based on heterogeneous devices including HDDs, SSDs and the connecting buses. The target simulated sys- tem is constructed from the hardware configuration input, then sent to the simulator modules along with the trace file and the appropriate simulator functions are selected and executed. Each module of the simulator is executed by a thread, and communicates with the others via ZeroMQ, a message transmission API using sockets for the information transfer. The result is an accurate behavior of the simulated system submitted to a specific workload and represented by performance and reliability metrics. No restriction is put on the input hardware configuration which can handle differ- ent levels of details and makes this simulator generic. The diversity of the supported devices, regardless to their na- ture: disks, buses, ..etc and organisation: JBOD, RAID, ..etc makes the simulator open to many technologies. The modularity of its design and the independence of its exe- cution functions, makes it open to handle any additional mapping, access, maintenance or reconstruction strategies. The conducted tests using OLTP and scientific workloads show accurate results, obtained in a competitive runtime.
IBISC, university of Evry / Paris-Saclay, 2014
abstract
Abstract
Distributed storage systems are nowadays ubiquitous, often under the form of multiple caches forming a hierarchy. A large amount of work has been dedicated to design, implement and optimise such systems. However, there exists to the best of our knowledge no attempt to use formal modelling and analysis in this field. This paper proposes a formal modelling framework to design distributed storage systems while separating the various concerns they involve like data-model, operations, placement, consistency, topology, etc. A system modelled in such a way can be analysed through model-checking to prove correctness properties, or through simulation to measure timed performance. In this paper, we define the modelling framework and then focus on timing analysis. We illustrate these two aspects on a simple example showing that our proposal has the potential to be used to make design decisions before the real system is implemented.
Chapman; Hall/CRC, 2013
abstract
Abstract
Contemporary High Performance Computing: From Petascale toward Exascale focuses on the ecosystems surrounding the world’s leading centers for high performance computing (HPC). It covers many of the important factors involved in each ecosystem: computer architectures, software, applications, facilities, and sponsors. The first part of the book examines significant trends in HPC systems, including computer architectures, applications, performance, and software. It discusses the growth from terascale to petascale computing and the influence of the TOP500 and Green500 lists. The second part of the book provides a comprehensive overview of 18 HPC ecosystems from around the world. Each chapter in this section describes programmatic motivation for HPC and their important applications; a flagship HPC system overview covering computer architecture, system software, programming systems, storage, visualization, and analytics support; and an overview of their data center/facility. The last part of the book addresses the role of clouds and grids in HPC, including chapters on the Magellan, FutureGrid, and LLGrid projects. With contributions from top researchers directly involved in designing, deploying, and using these supercomputing systems, this book captures a global picture of the state of the art in HPC.
USENIX Association, 2009
M.I.S.C, 2005
Linux Magazine, Diamond Editions, 2003