Candidate: Marat Bekmyrza
Date: April 8, 2025
Time: 1:00pm
Location: online
Supervisor: Drs. Nachiket Kapre and Hiren Patel
All are welcome!
Abstract:
Large Language Models (LLMs) are currently dominating the field of Artificial Intelligence (AI) applications, but their integration for edge computing purposes is rather limited due to computational complexity and power consumption. The thesis addresses this challenge by investigating integer-only acceleration of transformer models on FPGAs, focusing on the BERT architecture. We demonstrate that by removing the floating-point operations from the inference pipeline especially from non-linear functions like GELU, Softmax and Layer Normalization, we are able to improve the performance without sacrificing the accuracy. Our pipelined, batched architecture parallelly processes multiple sequences and optimizes the FPGA resources. We achieve a 2.6x latency per sequence improvement over a single-sequence inference and at least 10x speedup over the offloading to CPU approach. The results of the experiments show that our implementation has comparable accuracy to the floating-point models for the GLUE benchmark tasks with INT8 quantization. These findings reveal that integer-only transformer inference on FPGAs is a feasible way of implementing complex language models on resource-limited edge devices, with potential for new privacy-conscious, low latency AI applications.