Nvidia nview 149.21

9/19/2023

Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

0 Comments

Nvidia nview 149.21

Leave a Reply.

Author

Archives

Categories