Centip3De: A 64-Core, 3D Stacked, Near-Threshold System

Ronald G. Dreslinski

David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, Michael Wieckowski, Gregory Chen, Trevor Mudge, Dennis Sylvester, David Blaauw

University of Michigan
The Problem of Power

Power does not decrease at the same rate that transistor count increases, resulting in increased energy density.

Circuit supply voltages are no longer scaling...

Dynamic dominates

\[ U \approx \frac{CV_{dd}^2}{A} + \frac{I_{\text{leak}}V_{dd}}{Af} \]

A = gate area \(\rightarrow\) scaling \(1/s^2\)
C = capacitance \(\rightarrow\) scaling < \(1/s\)

The emerging dilemma:
More and more gates can fit on a die, but cooling constraints are restricting their use.
Today: Super-$V_{th}$, High Performance, Power Constrained

Large gate overdrive favors performance with unsustainable power density

**Must design within fixed TDP**

Goal: maintain performance, improved Energy/Operation
Operating in sub-threshold yields large power gains at the expense of performance.

Applications: sensors, medical
Operating in sub-threshold yields large power gains at the expense of performance.

Applications: sensors, medical
Near-Threshold Computing (NTC):

- >60X power reduction
- 6-8X energy reduction
- Enables 3D integration
- Caches have higher Vopt and operating frequency
- Smaller activity rate when compared to core logic
- Leakage larger proportion of total power in caches
- New Architectures Possible
SRAM is run at a higher $V_{\text{DD}}$
- Caches operate faster than core

Can introduce clustered architecture
- Multiple cores share L1
- Cores see private L1
- L1 still provides single-cycle latency

Advantages:
- Less coherence/snoop traffic
- Larger cache for processes that need it

Drawbacks:
- Core conflicts evicting L1 data
  - Not dominant in simulation
- Longer interconnect
  - 3D addressable
Proposed Boosting Approach

Measured results for 130nm LP design
10MHz becomes ~110MHz in 32nm simulation
140 FO4 delay core

Baseline
- Cache runs 4x core frequency
- Pipelined cache

Better Single Thread Performance
- Turn some cores off, speed up the rest
- Cache de-pipelined
- Faster response time, same throughput
- Core sees larger cache
  - Faster cores needs larger caches
Cache Timing

NTC Mode (3/4 Cores)
Low power
Tag arrays read first
0-1 data arrays accessed

Boost Mode (1/2)
Low latency
Data and tags read in parallel
4 data arrays accessed

Diagram:
- Tag Array
- Data Array
- Equations

University of Michigan
**Cache Timing**

NTC Mode (3/4 Cores)
- Low power
- Tag arrays read first
- 0-1 data arrays accessed
Cache Timing

Boost Mode (1/2)
Low latency
Data and tags read in parallel
4 data arrays accessed
Centip3De System Overview

DRAM Layer 2

DRAM Layer 3
Centip3De System Overview

- 7-Layer NTC system
- 2-Layer system completed fabrication with measured results
- Full 7-layer system expected End of 2012
Centip3De System Overview

- Cluster architecture
  - 4 Cores/cluster
  - 1kB I$, 8kB D$
  - Local clock controller operates cores 90° Out-of-phase
  - 1591 F2F connections per cluster

- Organized into layer pairs (cache↔core)
  - Minimizes routing
  - Up to two pairs
  - 16 clusters per pair
  - Cores have only vertical interconnections
Centip3De System Overview

- Bus interconnect architecture
  - Up to 500 MHz
  - 9-11 cycle latency
  - 1-3 core cycles
- 8 lanes, each 128b
  - One per DRAM interface
  - Each cluster connects to all eight
  - 1024b total
- Vertically connected through all four layers
  - Flipping interface enables 128-core system
Centip3De System Overview

- 3D-Stacked DRAM
  - Tezzaron Octopus

- 1 control layer
  - 130nm CMOS

- 1 Gb bitcell layers
  - Up to two layers
  - DRAM process

- 8x 128b DDR2 interfaces
  - Operated at bus frequency (up to 500 MHz)
Centip3De System Overview

130nm process
12.66x5mm per layer
28.4M device core layer
18.0M device cache layer
2-Layer Stacking Process Evaluated

For the measured 2-layer system, aluminum wirebond pads were used instead of TSVs like for 7-layer.
Cache 3D Connections

Cache

Sea of Gates

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

Bus Interface
Cluster 3D Connections

1591 F2F Connections
Each saved ~600-1000um in routing
Prevented wiring congestion around SRAMS
Silicon Results

Top Core Layer
- Cortex M3 [092]
- Cortex M3 [094]
- Cortex M3 [095]
- Cortex M3 [097]
- Cortex M3 [099]
- Cortex M3 [093]
- Cortex M3 [095]
- Cortex M3 [064]
- Cortex M3 [065]
- Cortex M3 [066]
- Cortex M3 [067]
- Cortex M3 [068]
- Cortex M3 [069]
- Cortex M3 [070]
- Cortex M3 [071]
- Cortex M3 [072]
- Cortex M3 [073]
- Cortex M3 [074]
- Cortex M3 [075]
- Cortex M3 [076]
- Cortex M3 [077]
- Cortex M3 [078]
- Cortex M3 [079]
- Cortex M3 [080]
- Cortex M3 [081]
- Cortex M3 [082]
- Cortex M3 [083]
- Cortex M3 [084]
- Cortex M3 [085]
- Cortex M3 [086]
- Cortex M3 [087]
- Cortex M3 [088]
- Cortex M3 [089]
- Cortex M3 [090]
- Cortex M3 [091]
- Cortex M3 [092]
- Cortex M3 [093]
- Cortex M3 [094]
- Cortex M3 [095]

Top Cache Layer
- IS/D$ [16]
- IS/D$ [23]
- IS/D$ [00]
- IS/D$ [07]

Bottom Core Layer
- Cortex M3 [000]
- Cortex M3 [001]
- Cortex M3 [002]
- Cortex M3 [003]
- Cortex M3 [004]
- Cortex M3 [005]
- Cortex M3 [006]
- Cortex M3 [007]
- Cortex M3 [008]
- Cortex M3 [009]
- Cortex M3 [010]
- Cortex M3 [011]
- Cortex M3 [012]
- Cortex M3 [013]
- Cortex M3 [014]
- Cortex M3 [015]
- Cortex M3 [016]
- Cortex M3 [017]
- Cortex M3 [018]
- Cortex M3 [019]
- Cortex M3 [020]
- Cortex M3 [021]
- Cortex M3 [022]
- Cortex M3 [023]
- Cortex M3 [024]
- Cortex M3 [025]
- Cortex M3 [026]
- Cortex M3 [027]
- Cortex M3 [028]
- Cortex M3 [029]
- Cortex M3 [030]
- Cortex M3 [031]
- Cortex M3 [032]
- Cortex M3 [033]
- Cortex M3 [034]
- Cortex M3 [035]
- Cortex M3 [036]
- Cortex M3 [037]
- Cortex M3 [038]
- Cortex M3 [039]
- Cortex M3 [040]
- Cortex M3 [041]
- Cortex M3 [042]
- Cortex M3 [043]
- Cortex M3 [044]
- Cortex M3 [045]
- Cortex M3 [046]
- Cortex M3 [047]
- Cortex M3 [048]
- Cortex M3 [049]
- Cortex M3 [050]
- Cortex M3 [051]
- Cortex M3 [052]
- Cortex M3 [053]
- Cortex M3 [054]
- Cortex M3 [055]
- Cortex M3 [056]
- Cortex M3 [057]

Tezzaron Octopus DRAM

DRAM Control Layer

DRAM Bitcell Layer

DRAM Bitcell Layer

Disabled Due To Redundancy
Die Shot

Looking through back of core-layer

DRAM Interface/Bus Hub

4-Core Cluster

Aluminum wirebond pads

130nm process
12.66x5mm per layer
28.4M device core layer
18.0M device cache layer
System Configurations

Cache Bus Hub
160 MHz
1.15 Volts

4 Core Mode

I$/D$
Div 4x
40 MHz
0.80 Volts

0 Core Boosted
0 Cores Gated

2 Core Mode

Cache Bus Hub
160 MHz
1.15 Volts

Div 2x
80 MHz
1.15 Volts

Div 2x
40 MHz
0.85 Volts

2 Core Boosted
2 Cores Gated

1 Core Mode

Cache Bus Hub
320 MHz
1.6 Volts

I$/D$
Div 2x
160 MHz
1.65 Volts

Div 2x
160 MHz
1.15 Volts

1 Core Boosted
3 Cores Gated

3 Core Mode

Cache Bus Hub
160 MHz
1.15 Volts

Div 2x
80 MHz
1.15 Volts

3 Cores Boosted
1 Core Gated

Div 4x
20 MHz
0.75 Volts

1 Core Boosted
3 Cores Gated
Measured Results

Boosting a single cluster to 1-core mode requires disabling, or down-boosting other clusters

1-core cluster:
- 15x 4-core clusters
- 6x 3-core clusters
- 4.5x 2-core clusters

Baseline configuration depends on TDP and processing needs
Measured Results

Single-Threaded Performance (DMIPS)

System Configuration

Power (mW)

- Core Power
- Cache Power
- Memory System Power

System Configuration

University of Michigan
Measured Results

Centip3De – 3,930 (130nm)

ARM A9 – 8,000 (40nm) [1]

Centip3De – 18,500 (45nm)

Conclusion

- Near threshold computing (NTC)
  - Need low power solutions to maintain TDP
  - Achieves 10x energy efficiency => 10x more computation to give TDP
  - Offers optimum balance between performance and energy
  - Allows boosting for single threaded performance (Amdahl's law)

- Large scale 3D CMP demonstrated
  - 64 cores currently
  - 128 cores + DRAM in the future
  - 3D design shown to be feasible

- This work was funded and organized with the help of DARPA, Tezzaron, ARM, and the National Science Foundation