A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator

Abstract

A Sparse Matrix-Matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40 nm CMOS. The compute fabric consist of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures as scratchpad or cache depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0 mm$\times$2.6 mm chip exhibits 12.6$\times$ (8.4$\times$) energy efficiency gain, 11.7$\times$ (77.6$\times$) off-chip bandwidth efficiency gain and 17.1$\times$ (36.9$\times$) compute density gain\revision{s} against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph based sparse matrices.

Publication
Journal of Solid-State Circuits

Related