Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs

Thursday, December 9, 2021 1:00 pm - 1:00 pm EST (GMT -05:00)

Candidate: Srinirdheeshwar Kuttuva Prakash
Title: Managing HBM’s bandwidth in Multi-Die FPGAs using Overlay NoCs
Date: December 9, 2021
Time: 13:00
Place: online
Supervisor(s): Kapre, Nachiket - Patel, Hiren

Abstract:
We can improve HBM bandwidth distribution and utilization on a multi-die FPGA like Xilinx Alveo U280 by using Overlay
Network-on-Chips (NoCs). HBM in Xilinx Alveo U280 offers 8GBs of memory capacity with a theoretical maximum bandwidth of 460
GBps, but all the thirty-two HBM ports in Xilinx Alveo U280 are exposed to the FPGA fabric in only one die. As a result,
processing elements assigned to other dies must use the scarcely available and challenging to use Super Long Lines (SLL) to
access the HBM’s bandwidth. Furthermore, HBM is fractured internally into thirty-two smaller memories called pseudo channels.
They are connected together by a hardened and flawed cross-bar, which enables global accesses from any of the HBM ports, but
introduces several throughput bottlenecks, degrading the achievable throughput when the entire memory space is used. An Overlay
Hybrid NoC combining the features of Hoplite and Butterfly Fat Trees (BFT) NoC offers a high-frequency solution for distributing
HBM’s bandwidth across all three dies, as well as overcoming the throughput bottleneck introduced by the internal cross-bar. The
Hybrid NoC combines multiple high-frequency Ring NoCs for inter-die communication and Butterfly Fat tree NoCs for intra-die
communication. In addition, the routing capability of the NoC can be modified to supplant the HBM’s internal cross-bar for global
accesses. We demonstrate this in Xilinx Alveo 280 using synthetic benchmarks and two application-based benchmarks, Dense
matrix-matrix multiplication (DMM) and Sparse Matrix-Vector multiplication (SPMV). Our experiments show that NoCs can improve
throughput utilization by as much as 8.6 times for single-flit global accesses, 1.7 times for multi-flit global accesses with
burst length 16, and as much as 1.4 times for SpMV benchmark.