# 27.6 An 821MHz 7.9Gb/s 7.3pJ/b/iteration Charge-Recovery LDPC Decoder

Tai-Chuan Ou, Zhengya Zhang, Marios C. Papaefthymiou

## University of Michigan, Ann Arbor, MI

This paper presents a 576b LDPC decoder test-chip designed using a chargerecovery logic family. The chip has been fabricated in a 65nm CMOS process and relies on 16 integrated inductors to achieve energy-efficient operation by recovering charge from gate fanouts. When self-oscillating at 821MHz, the chip recovers 51.4% of the energy supplied to it. In terms of device count, this chip is more than an order of magnitude larger than the largest previously-reported chips with charge-recovery logic [3-4]. When operating at 821MHz, it achieves a 7.9Gb/s throughput at 7.3pJ/b/iteration, improving on results in [1-2,5] by at least  $1.7 \times$  in energy efficiency and  $2.3 \times$  in area efficiency.

Figure 27.6.1 shows the schematic of a two-input four-bit charge-recovery logic comparator used in this chip. Compared to other similar charge-recovery logic families [3], this so-called subthreshold boost logic (SBL) [4] can achieve higher clock frequencies. The operation of a SBL gate is divided into two steps. The logic stage is similar to static CMOS logic operation, except that NMOS devices are used in both the pull-up and the pull-down network (PUN and PDN) to achieve gate-overdrive and perform functional evaluation with a subthreshold supply  $(V_{cc})$  to develop an initial voltage difference between the dual-rail outputs. The boost stage, composed of a pair of cross-coupled inverters, then amplifies the voltage difference to full-rail during the rising transition of a power-clock waveform at pin PC. During the falling transition of the power-clock, charge is recovered from the output, and the output voltage returns back to about V<sub>t</sub>. The dual-phase power-clock is generated using a so-called blip circuit that uses an inductive element and negative transconductance devices to resonate the parasitic capacitance of the network that distributes the power-clock to the charge-recovery gates. To achieve frequency scaling, an on-chip ring oscillator (RO), a pulse generator (PG), and frequency tuning circuits are included in the power-clock generator design. To operate the design off-resonance, a reference clock generated by the RO is supplied to the PG, and the PG then outputs a pair of 180-degree out-of-phase pulses with tunable duty cycle, enabling frequency tuning, and forcing the power-clock to run at the same frequency as the reference clock.

Figure 27.6.2 shows the charge-recovery LDPC decoder architecture for the 576b, rate-5/6 code specified by the IEEE 802.16e standard. Two columns in the code matrix are swapped for a regular structure to facilitate partitioning into four blocks. This partitioning results in regular relay interconnects between neighboring blocks, replacing complex and long global interconnects. The min-sum decoding begins with the check node operation on the first row of Block 1, the results of which are relayed in order to Blocks 2, 3, and 4. Variable node operations on the first row follow the check node operation, while Block 1 begins the check node operation on the second row in parallel. One decoding iteration takes 24 cycles (48 phases) for one complete check node and variable node operation in all four blocks. The check node operation in one block, shown as an example in Fig. 27.6.2, takes 2.5 cycles (5 phases). The deeply-pipelined relay architecture accommodates the processing of four streams in parallel without any pipeline stalls.

With over 57,000 SBL gates, the device count of this chip is more than an order of magnitude larger than previously-reported charge-recovery test-chips [3-4]. To accommodate this increased complexity, an automated standard-cell-like design flow has been developed that incorporates custom-designed dynamic cells and a two-phase power-clock. A SBL gate library has been created with 52 SBL gates of different drive strengths. Library gates have been characterized and used with commercial EDA tools for floorplanning, synthesis, placement, and routing.

Figure 27.6.3 shows the power-clock distribution network. To minimize clock skew, a clock mesh is employed for each of the two clock phases using top level metal (metal 9 for horizontal strips, and metal 8 for vertical strips). For each standard-cell row, two metal-3 horizontal strips are reserved for the power-clock waveform and its complement. These strips are tied to metal 8 of the clock mesh, so that the PC pin of each SBL gate can be connected to the mesh in a predictable manner using an automated place-and-route tool, while avoiding any possible large clock skew. 16 integrated 0.96nH (at 1GHz) inductors, along with 144 distributed negative transconductance devices are used to generate the

two-phase power-clock in the chip by resonating the parasitic capacitance of the clock distribution network. A tree structure with supply and ground shielding distributes a pair of 180-degree out-of-phase pulses to frequency-scaling circuitry at each inductor that can be used to operate the power-clock off-resonance at a desired frequency.

The design of the power-clock network plays a key role in the efficiency of the charge-recovery LDPC decoder chip. Fig. 27.6.4 shows the energy consumption of the power-clock as obtained from one of the 16 inductors through simulations with the verified inductor models from the foundry and an extracted post-layout netlist that includes parasitic resistance, capacitance, and coupling capacitance. Simulation results show that the chip recovers 51.4% of the energy supplied from the specific inductor to the power-clock every cycle.

The chip has been fabricated in a 65nm CMOS process. The charge-recovery LDPC decoder logic occupies 1.54mm<sup>2</sup>. Fig. 27.6.5 shows measured energy per cycle at each operating frequency. Minimum energy consumption is 702.9pJ per cycle when the power-clock is operating at its resonant frequency of 821MHz, with supply voltage  $V_{\text{DC}}$  = 0.64V and  $V_{\text{CC}}$  = 0.36V, yielding 576.8mW of power dissipation at room temperature. Correct functionality has been validated for clock frequencies ranging from 640MHz to 1.05GHz. Fig. 27.6.5 indicates that voltage supply  $V_{cc}$  consumes only 0.69-to-1.34% of total energy. In contrast, in the 5-to-187MHz FIR presented in [4],  $V_{cc}$  consumes about 3-to-10% of total energy. This difference can be explained by the distinctness in the operating frequency of the two chips. At lower operating frequencies, the boost stage recovers charge more efficiently, whereas the crowbar current in the logic stage increases, resulting in relatively higher energy consumption through  $V_{cc}$ . However, higher operating frequencies lead to higher energy consumption in the boost stage and lower crowbar current in the logic stage, thus decreasing the relative levels of energy consumption through  $V_{CC}$ .

Figure 27.6.6 gives the performance characteristics of the chip and compares it with the latest LDPC decoders. The charge-recovery LDPC decoder chip outperforms state-of-the-art designs with comparable code length and complexity [1-2,5], achieving at least  $1.7 \times$  better energy efficiency and  $2.3 \times$  better area efficiency. Compared to the decoder in [6] that uses an  $18 \times$  smaller code length, this chip achieves higher area efficiency. It does not match its energy efficiency, however, as the  $18 \times$  smaller code length avoids the significant overheads associated with scaling up to longer code lengths. Including the area overhead of inductors, the chip in this work is more area efficient compared with other LDPC decoder designs.

A die microphotograph is shown in Fig. 27.6.7. A built-in-self-test (BIST) circuit that is used to generate and process the input and output of the decoder, along with RO, PG, and frequency-tuning circuits are implemented with static CMOS logic and are distributed around the decoder core. To decrease eddy currents, the 16  $214\mu$ m×237µm 0.96nH inductors are placed outside the staggered I/O pads, occupying 0.81mm<sup>2</sup>. This charge-recovery LDPC decoder chip demonstrates the potential of charge-recovery logic for energy- and area-efficient high-performance design, as well as an accompanying design methodology that leverages automated EDA tools and is applicable to large-scale DSP applications.

#### Acknowledgments:

This work was supported in part by NSF under grant No. CCF-0916714. The authors thank Chia-Hsiang Chen, Jerry Kao, Jinwoo Kim, Suhwan Kim, and Wei-Hsiang Ma for their valuable contributions to this work.

## References:

[1] B. Xiang, *et al.*, "An 847–955 Mb/s 342–397mW Dual-Path Fully-Overlapped QC-LDPC Decoder for WiMAX System in 0.13µm CMOS," *IEEE J. Solid-State Circuits*, vol. 46, no. 6, pp.1416-1432, 2011.

[2] X. Peng, et al., "A 115mW 1Gbps QC-LDPC Decoder ASIC for WiMAX in 65nm CMOS," Asian Solid-State Circuits Conf., pp.317-320, 2011.

[3] Y. Zhang, *et al.*, "A 1pJ/cycle Processing Engine in LDPC Application with Charge Recovery Logic," *Asian Solid-State Circuits Conf.*, pp.213-216, 2011.

[4] W.-H. Ma, *et al.*, "187 MHz Subthreshold-Supply Charge-Recovery FIR," *IEEE J. Solid-State Circuits*, vol. 45, no. 4, pp. 793-803, 2010.

[5] S.-W. Yen, *et al.*, "A 5.79-Gb/s Energy-Efficient Multirate LDPC Codec Chip for IEEE 802.15.3c Applications," *IEEE J. Solid-State Circuits*, vol. 47, no. 9, pp. 2246-2257, 2012.

[6] D. Miyashita, *et al.*, "A 10.4pJ/b (32, 8) LDPC Decoder with Time-Domain Analog and Digital Mixed-Signal Processing," *ISSCC Dig. Tech. Papers*, pp. 420-421, 2013.

Swapped

1 0 9 5 1 2 20

3 18 0 11 12

2 21 19 15 22 16

23 14 7 21 23

Blcok

Buf

Sum

В

0 0

2 5 20

¢

Ś

puf

- Ja

Check Node Operation,

φ φ

\$

X>y

pnt XnL

××

ò

×>y

51.4% Recovery

31.5pJ

7n

7.5n

\*\*\*\* pnt

1/20

0

0

8

17







 $\begin{array}{l} \mbox{Technology scaling from 0.13 \mu m, V_{\rm DD}=1.2V to 65 nm, V_{\rm DD}=1.0V: \\ S = & \frac{0.13 \, \mu m}{65 nm}, \ U = & \frac{V_{\rm DD}}{V_{\rm DD}} = & \frac{1.2V}{1.0V}, \ Delay \sim \frac{1}{S}, \ Area \sim \frac{1}{S}, \\ \mbox{Information throughput is scaled to data throughput} \end{array}$  $\frac{1}{S^2}$ , Power ~  $\overline{U^2}$ 

Normalized to 10 iterations

Figure 27.6.6: Chip summary and comparison with state-of-the-art.

# **ISSCC 2014 PAPER CONTINUATIONS**





