MASc Seminar Notice: Inference of Encoder Transformers on eFPGAs via Structured Resource Management

Thursday, August 7, 2025 11:00 am - 12:00 pm EDT (GMT -04:00)

iCal

Candidate: Omar Elayat

Date: August 7, 2025

Time: 11:00am

Location: online

Supervisor: Dr. Vincent Gaudet

All are welcome!

Abstract:

The widespread adoption of Large Language Models (LLMs) in various applications has pushed the demand for efficient hardware acceleration beyond the capabilities of traditional platforms. Field-Programmable Gate Array (FPGA)s are widely used to accelerate LLMs for their high parallelism and low latency. However, the trained models are still too large to accommodate an FPGA fabric. While existing FPGA-based solutions have demonstrated promising throughput and energy efficiency, they often rely on abundant fabric resources, assume high-bandwidth environments, or employ highly customized acceleration architectures, making them unsuitable for fast, scalable deployment at the edge.

This thesis addresses these challenges by proposing a novel on-chip resources manager architecture for encoder-based transformer inference, with a focus on Bidirectional Encoder-Representations from Transformers (BERT) models. We target resource-constrained Embedded FPGA (eFPGA)s with limited memory bandwidth support. We show that, through structured operation scheduling and resource-sharing algorithms, significant performance improvements can be achieved alongside peak memory bandwidth utilization. The resource shared infrastructure of the proposed accelerator is also designed to be modular and extensible, allowing newly introduced computation blocks to be easily integrated into the overlay, without requiring major modifications or incurring additional off-chip data movement.

Demonstrated on a fully quantized integer-only variant of the BERT model as a representative workload, the proposed system achieves 2.32x latency improvement over the baseline custom accelerator, 1.17x over Jetson Orin Nano GPU, and at least 23.63x over CPU offloading approaches. The design supports scalable scheduling across varying transformer configurations and is validated on two FPGA platforms: the PYNQ-Z1 as a low-end proof-of-concept and the KV260 as a mid-range deployment target.

Abstract:

Support Waterloo Engineering