咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Improving performance portabil... 收藏

Improving performance portability for GPU-specific Open CL kernels on multi-core/many-core CPUs by analysis-based transformations

使用“基于分析的代码转换方法”来提升GPU特定的OpenCL kernel在多核/众核CPU上的性能移植性(英文)

作     者:Mei WEN Da-fei HUANG Chang-qing XUN Dong CHEN 

作者机构:School of Computer National University of Defense Technology National Key Laboratory of Parallel and Distributed Processing 

出 版 物:《Frontiers of Information Technology & Electronic Engineering》 (信息与电子工程前沿(英文版))

年 卷 期:2015年第16卷第11期

页      面:899-916页

核心收录:

学科分类:0810[工学-信息与通信工程] 0808[工学-电气工程] 0809[工学-电子科学与技术(可授工学、理学学位)] 0839[工学-网络空间安全] 08[工学] 080203[工学-机械设计及理论] 0835[工学-软件工程] 0802[工学-机械工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 081202[工学-计算机软件与理论] 

基  金:Project supported by the National Natural Science Foundation of China(No.61272145) the National High-Tech R&D Program(863)of China(No.2012AA012706) 

主  题:OpenCL Performance portability Multi-core/many-core CPU Analysis-based transformation 

摘      要:OpenCL is an open heterogeneous programming framework. Although OpenCL programs are func- tionally portable, they do not provide performance portability, so code transformation often plays an irreplaceable role. When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus has been extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. Typi- cally, the use of OpenCL's local memory on multi-core/many-core CPUs may lead to an opposite performance effect, because local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by (1) removing all the unwanted local-memory arrays together with the obsolete barrier statements and (2) optimizing the coalesced kernel code with vectorization and locality re-exploitation. Moreover, we have developed an automated tool chain that makes this transformation of GPU-specific OpenCL kernels into a CPU-friendly form, which is accompanied with a scheduler that forms a new OpenCL runtime. Experiments show that the automated transformation can improve OpenCL kernel performance on a multi-core CPU by an average factor of 3.24. Satisfactory performance improvements axe also achieved on Intel's many-integrated-core coprocessor. The resultant performance on both architectures is better than or comparable with the corresponding OpenMP performance.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分