0% Complete
English
صفحه اصلی
/
شانزدهمین کنفرانس بین المللی فناوری اطلاعات و دانش
Design of low-latency Floating-Point units for Softmax Computation in Transformer-based Large Language Models
نویسندگان :
Hoda Ghabeli
1
Amir Sabbagh Molahosseini
2
1- دانشکاه آزاد کرمان
2- دانشکاه آزاد کرمان
کلمات کلیدی :
LLM،transformer،softmax،speculative،floating-point
چکیده :
Large Language Models (LLMs) have emerged as one of the most desirable and widely used interactive digital tools in the world in the last decade. Softmax is one of the key steps in LLMs where the output is a vector of probabilities for each token in the model dictionary. The softmax computations are time-consuming due to the large vocabulary size, which can significantly increase the exponential computations and normalization, impacting the overall speed of the model. Given the importance of accuracy and speed, some of the main operations and computations of softmax are performed on the floating-point units. Arithmetic speculative computations are considered when the result of the computations can be estimated from a path shorter than the critical path, with improved speedup. In this paper, speculative 32-bit floating-point computation is proposed by merging two formats, 32-bit and 16-bit, for softmax computations. Both the floating-point adder and the floating-point multiplier use this strategy. The proposed design, based on the input data of the softmax function, speculates that the 32-bit floating-point computations can be obtained by concatenating the result of 16-bit format and a part of the 32-bit format result, that gives correct results most of the time with less delay. If speculation is unsuccessful, the longer path from through the conventional 32-bit floating-point unit is activated at the cost of a slightly longer critical path. Experimental results show that speculative floating-point units lead to a reduction in delay with only marginal overhead in area and power consumption.
لیست مقالات
لیست مقالات بایگانی شده
Benchmarking Embedding Models for Persian-Language Semantic Information Retrieval
Mahmood Kalantari - Mehdi Feghhi - Nasser Mozayani
IoT-Based Model in Smart Urban Traffic Control: Graph theory and Genetic Algorithm
Saeed Doostali - Seyed Morteza Babamir - Mohammad Shiralizadeh Dezfoli - Behzad Soleimani Neysiani
بهینهسازی مسیر وسیله ی نقلیه ی هوایی بدون سرنشین جهت کاهش زمان جمع آوری داده از حسگرها در شبکه ی اینترنت اشیا مبتنی بر الگوریتم یادگیری تقویتی عمیق
محمد ناظمی جنابی - هادی اشعریون - مهدی پورقلی
Architectural Insights: Comparing Weight Stationary and Output Stationary Systolic Arrays for Efficient Computation
Mahdi Kalbasi
پیشنهادات کالیبره شده براساس احساسات استخراج شده از متون مرتبط با آیتم ها
شیوا پارساراد - دکتر سامان هراتی زاده شیوا پارساراد - سامان هراتی زاده -
LuckyAgent2022: A Stop-Learning Multi-Armed Bandit Automated Negotiating Agent
Arash Ebrahimnezhad - Faria Nassiri-Mofakham
Design of low-latency Floating-Point units for Softmax Computation in Transformer-based Large Language Models
Hoda Ghabeli - Amir Sabbagh Molahosseini
An integrated approach for estimating software cost estimation using Adaptive Neuro-Fuzzy Inference System and the Grey Wolf Optimization algorithm
Maryam Karimi - Taghi Javdani Gandomani - Mahdi Mosleh
Scattering Wavelet-Based Image Quality Assessment Metric for Medical Images
Sina Omidvar - Jamshid Shanbehzadeh
Mamba-SAM: A Hybrid Architecture for Efficient Cardiac MRI Medical Image Segmentation
Mohammadreza Gholipour Shahraki - Mehdi Rezaeian - Mohammad Ghasemzadeh
بیشتر
ثمین همایش، سامانه مدیریت کنفرانس ها و جشنواره ها - نگارش 43.8.0