R2D3: A Reliability Engine for 3D Parallel Systems


This paper proposes a holistic reliability management engine, R2D3, for post-Moore’s technology based parallel 3D systems that have low yield and high failure rate. The proposed engine, comprising of a controller, reconfigurable crossbars and detection circuitry, provides concurrent single-replay detection and diagnosis, fault-mitigating repair and aging-aware lifetime management at runtime. We show that R2D3 achieves 96% coverage of defects, repairs faulty cores, and reduces $V_{th}$ degradation by 53%. This leads to a 78% performance improvement over 8 years and a 2.16$\times$ longer mean-time-to-failure over a baseline 8-core 3D processor with no reliability management.

In Design Automation Conference