A 7.3 M Output Non-Zeros/J Sparse Matrix-Matrix Multiplication Accelerator using Memory Reconfiguration in 40 nm

Abstract

A Sparse Matrix-Matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40 nm CMOS. On-chip memories are reconfigured as scratchpad or cache and interconnected with synthesizable coalescing crossbars for efficient memory access in each phase of the algorithm. The 2.0 mm$\times$2.6 mm chip exhibits 12.6$\times$ (8.4$\times$) energy efficiency gain, 11.7$\times$ (77.6$\times$) off-chip bandwidth efficiency gain and 17.1$\times$ (36.9$\times$) compute density gain against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph based sparse matrices.

Publication
In Symposium on VLSI Technology and Circuits

Related