Hardware-oriented algorithms for softmax and layer normalization of large language models
作者机构:School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University Department of Micro/Nano Electronics Shanghai Jiao Tong University MoE Key Laboratory of Artificial Intelligence Shanghai Jiao Tong University
出 版 物:《Science China(Information Sciences)》 (中国科学:信息科学(英文版))
年 卷 期:2024年第67卷第10期
页 面:85-99页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 081104[工学-模式识别与智能系统] 081203[工学-计算机应用技术] 08[工学] 0835[工学-软件工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported by National Natural Science Foundation of China (Grant No. 62074097)
主 题:large language model softmax layer normalization hardware architecture Transformer
摘 要:While large language models(LLMs) have sparked a new revolution in the field of natural language processing(NLP), their hardware accelerators have garnered tremendous attention. However, softmax and layer normalization which are the most common non-linear operations in LLMs are frequently *** paper presents hardware-oriented algorithms for both softmax and layer normalization of LLMs. We propose an approximate approach to implementing division in softmax and extend it for simultaneously computing square root and performing division in layer normalization. It replaces the original computation by multiplication and shifting. For softmax, we further approximate the exponential function by truncating its exponent and then reuse the involved subtraction. For layer normalization, we additionally simplify the computation of denominator by directly removing the term regarding the square of the mean. Furthermore,hardware architectures are developed for the proposed algorithms of softmax and layer normalization. They can work as plug-and-play units for LLM accelerators, requiring no fine-tuning and introducing negligible performance loss. Compared with the state-of-the-art designs, the proposed softmax architecture can save up to 23.45% area cost and 17.39% power consumption, while the proposed layer normalization architecture can save up to 32.70% area cost and 14.29% power consumption.