Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer
Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer作者机构:State Key Laboratory of Plateau Ecology and AgricultureDepartment of Computer Technology and ApplicationsQinghai UniversityXining 810016China Department of Computer Science and TechnologyTsinghua UniversityBeijing 100084China Department of Computer Science and TechnologyTsinghua UniversityBeijing 100084China
出 版 物:《Tsinghua Science and Technology》 (清华大学学报(自然科学版(英文版))
年 卷 期:2020年第25卷第1期
页 面:56-67页
核心收录:
学科分类:07[理学] 070102[理学-计算数学] 0701[理学-数学]
基 金:supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306) the Science and Technology Plan of Beijing Municipality (No. Z161100000216147) the National Natural Science Foundation of China (No. 61762074) Youth Foundation Program of Qinghai University (No. 2016-QGY-5) the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ7034)
主 题:parallel algorithms Weighted Essentially Non-Oscillatory scheme(WENO) optimization many-core Sunway TaihuLight
摘 要:A Weighted Essentially Non-Oscillatory scheme(WENO) is a solution to hyperbolic conservation laws,suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing(HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks,and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.