Preliminary Program

Bit-Flexible Systolic Architecture: Optimizing Processing Elements with Application-Specific Floating-Point Truncation

Dantu Nandini Devi¹ and Madhav Rao²
¹International Institute of Information Technology Bangalore, ²International Institute of Information Technology-Bangalore

Abstract

Leveraging hardware-level approximate computing in Convolutional Neural Networks (CNNs) enables faster computation, improved power efficiency, and a reduced design footprint. This paper proposes a unified framework for designing approximate floating-point multipliers optimized for error-resilient applications. By adopting a right shift-based multiplication algorithm for combinational logic and integrating multiple compressor configurations with unique error profiles, the design space is significantly expanded across FP32, TF32, and BF16 formats. A multi-objective optimization using NSGA-II is employed to explore this space efficiently, targeting truncation depth, approximation range, and compressor arrangement. Evaluations across image processing, CNNs, and JPEG compression demonstrate around 15% improvement in the critical path delay and over 50% gains in power consumption and silicon footprint with minimal loss in output quality. In some CNN scenarios, approximate multipliers even improve inference accuracy, revealing their potential not only for efficiency but also for enhanced generalisation. Furthermore, the framework and hardware design files are made available for further usage to the designers' and researchers community.