FlexGO: A Unified Overlay for General Graph Neural Network Acceleration

Pramath Balisavira, Rishov Sarkar, Cong Hao
Georgia Institute of Technology


Abstract

Field Programmable Gate Arrays (FPGAs) provide abundant parallel computing resources suitable for energy-efficient and low-latency inference of Graph Neural Networks (GNNs). To address the limitations posed by the inflexibility of specialized FPGA-based GNN accelerators, prior research has developed accelerators to individually accelerate several GNN classes to cope with the increasing demand for multi-family GNN inference. However, the reconfiguration overhead incurred from model and parameter variations across such accelerators can be fatal for latency-critical applications. Overlay processors significantly reduce FPGA reconfiguration time with compiler frameworks that map software-compiled GNN executables to a fixed GNN-agnostic FPGA microarchitecture. Limitations of state-of-the-art (SOTA) overlay processors include performance degradation and involvement of complex compiler frameworks. To retain generality and simultaneously inhibit performance loss, in this paper, we propose FlexGO, a compiler-free workload-agnostic unified overlay with both GNN-specific and GNN-agnostic hardware units for real-time inference. First, we propose a simple hardware-decoded instruction set for the run-time selection of GNN inference. Second, we implement a graph-preprocessing-free and scalable dataflow architecture that can effectively parallelize and optimize the node embedding, edge embedding, and message-passing stages of inference. Third, we achieve 33.74%, 58.51%, 61.55%, and 63.58% reduction in DSPs, LUTs, FFs, and BRAMs, respectively, against the cumulative resource utilization for all the supported GNNs' specialized implementations. Fourth, we implement our architecture on the Xilinx Alveo U50 FPGA board to evaluate the on-board end-to-end performance. FlexGO achieves a speedup of up to 16.23–272.35× against CPU (Intel 6226R) and 1.31–514.80× against GPU (Nvidia A6000) (with batch sizes 1 through 1024). We also outperform a SOTA software-programmable NPU accelerator by an average speedup of 1.55× for two models against their latency and throughput-optimized configurations across two datasets. We open-source our implementation code and bitstreams for all our design configurations on GitHub.