Performance Analysis of Distributed and Scalable Deep Learning; Varrette, Sébastien ; Plugaru, Valentin et alin 20th IEEE/ACM Intl. Symp. on Cluster, Cloud and Internet Computing (CCGrid'20) (2020, May) With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). More ... [more ▼] With renewed global interest for Artificial Intelligence (AI) methods, the past decade has seen a myriad of new programming models and tools that enable better and faster Machine Learning (ML). More recently, a subset of ML known as Deep Learning (DL) raised an increased interest due to its inherent ability to tackle efficiently novel cognitive computing applications. DL allows computational models that are composed of multiple processing layers to learn in an automated way representations of data with multiple levels of abstraction, and can deliver higher predictive accuracy when trained on larger data sets. Based on Artificial Neural Networks (ANN), DL is now at the core of state of the art voice recognition systems (which enable easy control over e.g. Internet-of- Things (IoT) smart home appliances for instance), self-driving car engine, online recommendation systems. The ecosystem of DL frameworks is fast evolving, as well as the DL architectures that are shown to perform well on specialized tasks and to exploit GPU accelerators. For this reason, the frequent performance evaluation of the DL ecosystem is re- quired, especially since the advent of novel distributed training frameworks such as Horovod allowing for scalable training across multiple computing resources. In this paper, the scalability evaluation of the reference DL frameworks (Tensorflow, Keras, MXNet, and PyTorch) is performed over up-to-date High Performance Comput- ing (HPC) resources to compare the efficiency of differ- ent implementations across several hardware architectures (CPU and GPU). Experimental results demonstrate that the DistributedDataParallel features in the Pytorch library seem to be the most efficient framework for distributing the training process across many devices, allowing to reach a throughput speedup of 10.11 when using 12 NVidia Tesla V100 GPUs when training Resnet44 on the CIFAR10 dataset. [less ▲] Detailed reference viewed: 368 (14 UL) Monitoring & Profiling II (Advanced performance engineering)Plugaru, Valentin ; Besseron, Xavier ![]() Presentation (2019, June 20) Detailed reference viewed: 169 (0 UL) Amazon Elastic Compute Cloud (EC2) versus In-House HPC Platform: A Cost Analysis; Varrette, Sébastien ; Plugaru, Valentin et alin IEEE Transactions on Cloud Computing (2019), 7(2), 456-468 Abstract—While High Performance Computing (HPC) centers continuously evolve to provide more computing power to their users, we observe a wish for the convergence between Cloud Computing (CC) and High ... [more ▼] Abstract—While High Performance Computing (HPC) centers continuously evolve to provide more computing power to their users, we observe a wish for the convergence between Cloud Computing (CC) and High Performance Computing (HPC) platforms, with the commercial hope to see Cloud Computing (CC) infrastructures to eventually replace in-house facilities. If we exclude the performance point of view where many previous studies highlight a non-negligible overhead induced by the virtualization layer at the heart of every Cloud middleware when running a HPC workload, the question of the real cost-effectiveness is often left aside with the intuition that, most probably, the instances offered by the Cloud providers are competitive from a cost point of view. In this article, we wanted to assert (or infirm) this intuition by analyzing what composes the Total Cost of Ownership (TCO) of an in-house HPC facility operated internally since 2007. This Total Cost of Ownership (TCO) model is then used to compare with the induced cost that would have been required to run the same platform (and the same workload) over a competitive Cloud IaaS offer. Our approach to address this price comparison is three-fold. First we propose a theoretical price-performance model based on the study of the actual Cloud instances proposed by one of the major Cloud IaaS actors: Amazon Elastic Compute Cloud (EC2). Then, based on the HPC facility TCO analysis we propose a hourly price comparison between our in-house cluster and the equivalent EC2 instances. Finally, based on the experimental benchmarking on the local cluster and on the Cloud instances we propose an update of the former theoretical price model to reflect the real system performance. The results obtained advocate in general for the acquisition of an in-house HPC facility, which balances the common intuition in favor of Cloud Computing platforms, would they be provided by the reference Cloud provider worldwide. [less ▲] Detailed reference viewed: 158 (5 UL) UL HPC Tutorial: Machine and Deep Learning on the UL HPC PlatformVarrette, Sébastien ; Plugaru, Valentin ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 145 (5 UL) UL HPC Tutorial: Performance engineering - HPC debugging and profillingPlugaru, Valentin ; Besseron, Xavier ; Varrette, Sébastien et alPresentation (2018, June) Detailed reference viewed: 137 (15 UL) Overview and Challenges of the UL HPC Facility at the EuroHPC HorizonVarrette, Sébastien ; Plugaru, Valentin ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 165 (3 UL) UL HPC Tutorial: Multi-Physics workflows: test cases on CFD / MD / Chemistry applicationsPlugaru, Valentin ; Varrette, Sébastien ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 126 (0 UL) UL HPC Tutorial: Building [custom] software with EasyBuild on the UL HPC platformDiehl, Sarah ; Varrette, Sébastien ; Plugaru, Valentin et alPresentation (2018, June) Detailed reference viewed: 145 (3 UL) UL HPC Tutorial: HPC Containers with SingularityPlugaru, Valentin ; Varrette, Sébastien ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 179 (7 UL) UL HPC Tutorial: Getting Started on the Uni.lu HPC platformParisot, Clément ; Cartiaux, Hyacinthe ; Varrette, Sébastien et alPresentation (2018, June) Detailed reference viewed: 149 (10 UL) UL HPC Tutorial: HPC workflow with sequential jobsCartiaux, Hyacinthe ; Varrette, Sébastien ; Plugaru, Valentin et alPresentation (2018, June) Detailed reference viewed: 200 (11 UL) UL HPC Tutorial: Advanced Job scheduling with SLURM and OARPlugaru, Valentin ; Varrette, Sébastien ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 161 (5 UL) UL HPC Tutorial: Big Data Applications (batch, stream, hybrid) with Hadoop and SparkVarrette, Sébastien ; Plugaru, Valentin ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 140 (2 UL) UL HPC Tutorial: Basic and Advanced scientific computing using MATLABPlugaru, Valentin ; Varrette, Sébastien ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 155 (3 UL) UL HPC Tutorial: Parallel computations with OpenMP/MPIVarrette, Sébastien ; Plugaru, Valentin ; Diehl, Sarah et alPresentation (2018, June) Detailed reference viewed: 136 (3 UL) UL HPC Tutorial: Bio-informatics workflows and applicationsPlugaru, Valentin ; Diehl, Sarah ; Varrette, Sébastien et alPresentation (2018, June) Detailed reference viewed: 138 (3 UL) UL HPC Tutorial: Statistical Computing with RGinolhac, Aurélien ; ; Varrette, Sébastien et alPresentation (2018, June) Detailed reference viewed: 208 (22 UL) UL HPC Tutorial: (Advanced) Prototyping with PythonParisot, Clément ; Diehl, Sarah ; Varrette, Sébastien et alPresentation (2018, June) Detailed reference viewed: 151 (4 UL) Large-scale research data management: Road to GDPR complianceBouvry, Pascal ; Varrette, Sébastien ; Plugaru, Valentin et alPresentation (2018, April) Detailed reference viewed: 247 (14 UL) Overview and Challenges of the UL HPC Facility at the EuroHPC HorizonVarrette, Sébastien ; Bouvry, Pascal ; Plugaru, Valentin et alPresentation (2017, November) Detailed reference viewed: 133 (2 UL) |
||