MASc Seminar: NengoFPGA: an FPGA Backend for the Nengo Neural Simulator

Friday, August 9, 2019 2:00 pm - 2:00 pm EDT (GMT -04:00)

iCal

Candidate: Benjamin Morcos

Title: NengoFPGA: an FPGA Backend for the Nengo Neural Simulator

Date: August 9, 2019

Time: 2:00pm

Place: EIT 3151

Supervisor(s): Kapre, Nachiket G.

Abstract:

Low-power, high-speed neural networks are critical for providing deployable embedded AI applications at the edge. We describe an FPGA implementation of Neural Engineering Framework (NEF) networks with online learning that outperforms mobile GPU imple- mentations by an order of magnitude or more. Specifically, we provide an embedded Python-capable PYNQ FPGA implementation supported with a High-Level Synthesis

(HLS) workflow that allows sub-millisecond implementation of adaptive neural networks with low-latency, direct I/O access to the physical world. The outcome of this work is NengoFPGA, a seamless and user-friendly extension to the neural compiler Python package Nengo. To reduce memory requirements and improve performance we tune the precision of the different intermediate variables in the code to achieve competitive absolute accuracy against slower and larger floating-point reference designs. The online learning component of the neural network exploits immediate feedback to adjust the network weights to best support a given arithmetic precision. As the space of possible design configurations of such quantized networks is vast and is subject to a target accuracy constraint, we use the Hyperopt hyper-parameter tuning tool instead of manual search to find Pareto optimal designs. Specifically, we are able to generate the optimized designs in under 500 short iterations of Vivado HLS C synthesis before running the complete Vivado place-and-route phase on that subset, a much longer process not conducive to rapid exploration. For neural network populations of 64–4096 neurons and 1–8 representational dimensions our optimized FPGA implementation generated by Hyperopt has a speedup of 10–484× over a competing cuBLAS implementation on the Jetson TX1 GPU while using 2.4–9.5× less power. Our speedups are a result of HLS-specific reformulation (15× improvement), precision adaptation (3× improvement), and low-latency direct I/O access (1000× improvement)

Support Waterloo Engineering