Design Time Body Bias Selection for Parametric Yield Improvement

Cheng Zhuo, Yung-Hsu Chang, Dennis Sylvester, David Blaauw
EECS Department, University of Michigan, Ann Arbor, MI 48109
{czhuo, yunghsuc, dennis, blaauw}@eecs.umich.edu

Abstract—Circuits designed in aggressively scaled technologies face both stringent power constraints and increased process variability. Achieving high parametric yield is a key design objective, but is complicated by the correlation between power and performance. This paper proposes a novel design time body bias selection framework for parametric yield optimization while reducing testing costs. The framework considers both inter- and intra-die variations as well as power-performance correlations. This approach uses a feature extraction technique to explore the underlying similarity between the gates for effective clustering. Once the gates are clustered, a Gaussian quadrature based model is applied for fast yield analysis and optimization. This work also introduces an incremental method for statistical power computation to further reduce the optimization complexity. The proposed framework improves parametric yield from 39% to 80% on average for 11 benchmark circuits while runtime is linear with circuit size and on the order of minutes for designs with up to 15K gates.

I. INTRODUCTION

Semiconductor technologies are characterized by trends of ever shrinking feature dimensions. As a result, process variability has become more prominent in sub-nanometer regime designs and poses a major challenge to improving circuit performance and reducing leakage [1]. Given the large contribution of leakage power to total power in recent technology nodes, delay and power are now negatively correlated across process corners [2]. In such a scenario, high speed parts are also very high leakage, imposing a two-sided constraint on the feasible region of delay and leakage for parametric yield optimization. This ultimately causes a significant yield loss of manufactured dies in modern integrated circuits [2].

To address this issue, numerous pre- and post-silicon statistical optimization methods have been proposed to mitigate yield loss due to process variability [2]–[14]. However, several of these approaches neglect the power-performance correlation by treating the impact of delay and power separately [8], [9]. Works in [10], [11] investigated nonlinear optimization by assuming that gate sizes are continuous. Thus, they could either apply a simplified power yield model or transform yield maximization to slack minimization. These techniques only mitigate the variation indirectly, rather than performing true yield maximization due to the approximated formulation or neglect of power-performance correlation. Moreover, growing circuit size further restricts the efficacy of traditional pre-silicon techniques (gate sizing or dual-threshold voltage assignment) in guaranteeing reliable circuit operation with desired parametric yield [2], [10], [11].

Beyond these pre-silicon approaches, several post-silicon techniques have been proposed for design optimization [12]–[14]. Among these, adaptive body biasing (ABB) is a promising post-silicon technique due to its flexibility [15]. Traditionally ABB is used to tune each chip individually after chip fabrication and testing, as shown in Figure 1. Thus, conventional post-silicon ABB is limited by (1) routing/control overhead to adjust devices/gates at a very fine grained level [12], [15] and (2) increased post-fabrication testing costs to determine the optimal body voltage. To reduce overhead to a feasible level, [12] presented a heuristic clustering method, in which gates are grouped at design time into a small set of clusters and controlled by one body bias within the cluster. Mani, et.al., suggested the coordination of pre-silicon (gate sizing) and post-silicon (ABB) techniques, and formulated this as a robust programming problem. Similar to [10], [11], both methods [12], [13] separate the correlation between power and delay and do not evaluate the true parametric yield (joint yield of power and delay). Above all, tuning in [12], [13] is carried out entirely as a post-silicon step. Clearly, such a strategy incurs large post-silicon testing costs.

In order to reduce testing overhead, this paper presents a low-cost pre-silicon body bias selection framework to maximize parametric yield. This framework considers both inter- and intra-die variation as well as power-performance correlation. The major difference between the proposed framework and traditional ABB is that our work does not require individual tuning of each chip during post-silicon testing to select the body bias to be applied. Instead, as shown in Figure 2, our framework optimizes and fixes body bias during design time to improve the yield of manufactured dies. Once the bias levels are chosen, simple and compact circuits can be readily designed to provide the chosen reference voltages [16], [17]. Unlike post-silicon ABB, in which bias voltages for each chip are chosen in a deterministic way (since measurement results for a particular manufactured die are known and deterministic), pre-silicon body bias selection framework must statistically incorporate the variability to tune the ensemble of all chips simultaneously. This provides the following advantages:

- **Low testing time and cost** Pre-silicon body bias selection statistically optimizes the body bias for the manufactured dies, eliminating the testing cost increases associated with the post-silicon approaches.
• High flexibility: Design-time body bias selection can be easily implemented using on-chip reference voltages [16], [17] and hence has continuous-domain design variables. This appealing feature is in contrast to most traditional pre-silicon techniques such as gate sizing or dual $V_{th}$, which have discrete domain design variables.

• Good scalability: Our pre-silicon body bias selection uses a small number of gate clusters (where each cluster is assigned to a different bias voltage), enabling a theoretically rigorous formulation of parametric yield as well as scalability to large circuits.

The proposed framework consists of two phases. We first determine the body bias profiles for each gate, which reflects the preferred body biases across an expected representative set of dies based on process variability models. Then a feature extraction technique is applied to those profiles to efficiently cluster the gates. The general concept behind gate clustering is to group gates with statistically similar behavior. The heuristic approach in [12] uses an affine function of mean, standard deviation, and correlation coefficients to determine similarity. However, this method requires large runtime and memory consumption for the greedy search and correlation matrix construction. Also, the chosen weights of the affine function may not be globally applicable across all circuit topologies. In addition, this approach discards most information from the original body bias profiles and therefore is not robust with respect to outliers. As a result, in our framework we propose a general and scalable clustering method based on feature extraction, without any dependence on empirical parameters. The feature extraction technique projects the original body bias profiles of the gates to a reduced set of features (feature vector) [18]. The feature vectors contain the general characteristics of the profiles and can be computed efficiently for body bias profile similarity comparison. In particular, the comparison is made by computing the distance of two feature vectors and grouping together the gates with closer distance.

After clustering the gates, the second phase formulates the body bias selection problem as a small-sized unconstrained nonlinear programming (NLP) problem. The NLP is solved by a large-scale optimizer (Lancelot [19]) with a fast yield evaluation scheme called Gaussian quadrature to compute the objective yield. An incremental method is also introduced to quickly compute the probability density function (PDF) of leakage power. Experiments show that the proposed framework can optimize a circuit with 14592 gates within 20 minutes to achieve 52 point yield improvement. For eleven circuits of different sizes, yield is improved from 39% to 80% on average. The key contributions of this paper are listed below:

1. We present a low-cost pre-silicon framework to select body bias at design time for direct parametric yield optimization. We show that the proposed pre-silicon approach retains the majority of the yield benefits of more complex die-specific post-silicon ABB approach as well as a higher flexibility than traditional pre-silicon methods like dual-$V_{th}$ and gate sizing. The framework considers both process variation and the correlation between performance and power.

2. To effectively cluster the gates, a feature extraction-based technique is employed. We apply a Haar wavelet transform to extract the features from the statistical body bias profile of each gate. Then a k-median-like algorithm is presented to optimally cluster the gates with similar features.

3. In the optimization framework, the yield objective is repeatedly computed. We present a fast and accurate method using Gaussian quadrature to compute the yield in the form of a bi-variate normal integral. An incremental technique for statistical power computation is also introduced to further reduce gradient computation complexity.

II. FEATURE EXTRACTION-BASED GATE CLUSTERING

Gate clustering is a critical step in practical ABB approaches. Once the clustering is performed, the body voltage of the cluster is determined so that its most timing critical gates meet the overall circuit delay constraint, indicating that most gates in a cluster will end up requiring a larger (more forward) body bias than necessary. It is therefore vital to cluster gates with similar body bias characteristics to minimize optimality loss. This section discusses a new feature extraction-based technique for gate clustering.

A. Leakage Power and Delay Modeling

Body biasing uses the body effect to modulate the threshold voltage of a MOSFET. Since the analytical expressions that govern the impact of body bias on delay and leakage at the gate level are fairly complex, we adopt the quadratic leakage model and linear delay model from [12]. SPICE simulation validates that a quadratic model for leakage and linear model for delay achieve an average error of 5.9% and 1.5% respectively across a 90nm standard cell library. Change of leakage and delay with body bias can then be expressed by [12]:

\[
\Delta L_i(v_{b,i}) = L_{0,i}(p_{0,i} + p_{1,i}v_{b,i} + p_{2,i}v_{b,i}^2) 
\]
\[
\Delta D_i(v_{b,i}) = D_{0,i}(d_{0,i} + d_{1,i}v_{b,i}) 
\]

where $\Delta L_i(v_{b,i})$ and $\Delta D_i(v_{b,i})$ are the leakage and delay change, $L_{0,i}$ and $D_{0,i}$ are the nominal leakage and delay value with zero body bias, $v_{b,i}$ is the body voltage for gate $i$, $p_{j,i}$ and $d_{j,i}$ (j=0,1,2) are the fitting parameters for the quadratic leakage model, and $d_{j,i}$ (j=0,1) are the fitting parameters for the linear delay model.

B. Design Space Exploration

We assume that each circuit constitutes its own unique design space subject to certain parameter variations. Our variation formulation incorporates both inter- and intra-die variations [3], [4]. To identify the difference in gates, we first generate multiple “die samples” following certain variations for the given circuit in a Monte Carlo fashion. For each sample circuit we assume each gate can be tuned individually and construct the deterministic quadratic programming (QP) to find the optimal body bias of each gate for leakage minimization [12]:

\[
\text{Minimize} \quad \sum_j \Delta L_j(v_{b,j}) 
\]

Subject to

\[
AT_{PI} = 0, \quad AT_{PO} < \text{Target} \tag{4}
\]
\[
AT_{i,j} + D_{j,ABB}^{ABB} < AT_{o,j} \quad \text{for } \forall j \tag{5}
\]
\[
D_{j}^{ABB} = D_{0,j} - \Delta D_j(v_{b,j}) \quad \text{for } \forall j \tag{6}
\]

where $AT$ is the arrival time of the signal on a wire, subscripts “i” and “o” denote input and output, and $D_{j}^{ABB}$ denotes the delay of a biased gate. The first constraint limits the arrival times at primary input (PI) to be zero and the arrival times at primary output (PO) to be less than the design target. The second and third constraints indicate that the delay at the output of each gate should be at least equal to the arrival
time at each of its inputs plus the delay of the gate $D_{ABB}$. This QP can be efficiently solved by CPLEX [20] to obtain the optimal body voltage for a particular sample (die) in the design space. The histogram of the optimal body bias for a gate sheds insight on the statistical behavior of the gate in this design space. We can then distill information from this histogram in determining which gates should share a common body potential. This is the critical clustering step (described below). The design space exploration is executed only once for each unique design before the optimization stage.

C. Feature Extraction

A straightforward approach to clustering is to group together the gates with similar body bias profiles. However, it is difficult to define "similarity" quantitatively. It is impractical and inefficient to simply use the complete profiles to cluster the gates because of their large sizes and the resulting noise sensitivity in the distributions. As previously stated, [12] suggested a weighted affine function of mean, deviation and correlation to judge the similarity between the gates. However, construction of the correlation matrix between the gates leads to a memory complexity of $O(N_g^2)$ for $N_g$ gates and limits its applicability. Beyond these runtime concerns, the greedy search is heavily dependent on the carefully chosen weights and the order of the gates to be visited. This makes the method sensitive to outliers and allows gates to be misgrouped. Furthermore, since the affinity of the non-grouped gate to the cluster is computed by taking the average of the weights, highly deviated data may have a disproportionate impact on the average and lead to poor selections. We therefore present a faster and more robust clustering strategy in our framework.

We employ a pattern recognition technique called feature extraction to obtain the main features of the profile while filtering out noise and redundant information. The concept behind feature extraction is to extract the general characteristics of the profiles, maintaining the most common information and discarding outliers. The body bias profile for each gate is then uniquely identified by a feature vector, $v_i = [x_{i1}, x_{i2}, ..., x_{in}]^T$ with $n$ features, which are used to measure similarity. To apply the technique across all gates and preserve important information, some pre-processing is performed to build a unified and suitable system. The pre-processing includes two stages:

1. Offset removal. This stage simply aligns the histograms to the same body voltage intervals, so that voltage ranges are unified for all the gates.
2. Envelope construction. The original histogram is based on a coarse grid and cannot be directly used. In this stage, we apply linear interpolation to map the histogram data to a finer grid and construct the basic shape of the profile envelope. Figure 3(a)-(c) shows the pre-processing procedure for a randomly selected gate in circuit c6288.

Fig. 3. Pre-processing procedure: (a) original histogram (b) offset removal for body voltage (x-scale) (c) envelope construction

Once the body bias profiles are available in the form of unified envelopes, we apply the feature extraction technique to determine the underlying characteristics of each gate. The proposed feature extraction is achieved by Haar wavelet transform. With a one-level Haar transform, the original body bias waveform with $n$-sample points can be transformed to two $n/2$-entry vectors (approximation coefficients $x_i$ and detail coefficients $y_i$):

$$x_i[n] = (x_{i-1}[2n] + x_{i-1}[2n+1]) / \sqrt{2} \quad (7)$$

$$y_i[n] = (x_{i-1}[2n] - x_{i-1}[2n+1]) / \sqrt{2} \quad (8)$$

The detail coefficients representing the local characteristics are easily disturbed by outliers and hence discarded. The approximation coefficients preserving the general characteristics are then decomposed repeatedly until a feature vector with a required number of features ($n/4, n/8$, etc.) is obtained. In our work, an eight-entry feature vector is extracted from the body bias profile for each gate. Figure 4 shows a simple example of a two-level Haar transform architecture, where $g[n]$ and $f[n]$ represent (7) and (8), respectively.

Since the approximation coefficients indicate the accumulated activities, the feature vectors naturally embody the mean and variance information of the profiles. Moreover, as two highly correlated gates should exhibit similar body bias profiles and hence similar feature vectors, the correlation between gates is well modeled by the proposed method. Thus, the feature vector preserves more information than the method in [12].

D. Gate Clustering

Gate clustering groups gates with similar behaviors. We here propose the following definition to quantify the similarity of feature vectors.

**Definition:** The similarity of two feature vectors $v_1, v_2$ is the cosine of the angle between them:

$$S_{v_1,v_2} = \cos(a) = \frac{|v_1^T v_2|}{\|v_1\| \|v_2\|} \quad (9)$$

where $\| \cdot \|$ denotes the Euclidean norm. The use of the angle between vectors provides two main advantages: (1) It correctly measures the distance between two vectors. Since any entry in a feature vector is always non-negative, a larger Euclidean distance is equivalent to a larger angle and hence a smaller $S_{v_1,v_2}$ (2) The value is normalized and does not depend on any amplitude gains or empirically chosen weights.

Fig. 4. An example of a two-level Haar wavelet transform. A 128-entry input $x_1[n]$ is reduced to a 32-entry vector.

Here we present the simplest example of two clusters. $N$-cluster decomposition will be an extension of the two-cluster case and is discussed in section II-E. In this example we need to classify the gates into two clusters based on their feature
The initial seed gates for each cluster may be easily assigned, namely the most forward-biased and most reverse-biased gates, which are determined by sorting the mean of the body bias profiles. These two gates should clearly be in separate clusters. The seeds become the initial centroids of the clusters. A centroid is defined as a vector that maximizes the sum of similarities of all other points within the cluster to itself. After initial seeds are selected, gates are visited in sequence and their similarities to the centroid of each cluster computed by (9). Each gate is then placed in the cluster with the highest similarity after which the centroid of the corresponding cluster is updated. This procedure is described in Figure 5 and carried out repeatedly until all gates are classified.

Since computing the centroid using the arithmetic mean is not robust to outliers or noise, we therefore propose a low-cost k-median-like algorithm in this paper to compute the centroid. This strategy circumvents the potential problem of highly deviated data skewing the arithmetic average in [12].

For example, if \( n \) gates are contained in the cluster, the centroid is:

\[
\mathbf{u} = \arg \max_{\mathbf{v}_i \in \text{cluster}} S_{\mathbf{u}, \mathbf{v}_i}, \quad \mathbf{u} \in \{\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_m\} \tag{10}
\]

This is a nonlinear discrete optimization problem that is difficult to solve. We therefore employ a two-phase relaxation scheme to tackle this problem. The first phase relaxes the problem to an unconstrained continuous optimization and finds the optimal condition, which is:

\[
\max \sum_{\mathbf{v}_i \in \text{cluster}} \frac{\mathbf{v}_i^T \mathbf{u}}{\|\mathbf{v}_i\| \|\mathbf{u}\|}, \quad \mathbf{u} \in \mathbb{R}^{n \times 1} \tag{11}
\]

This can be further simplified to:

\[
\max \mathbf{w}^T \mathbf{x} \tag{12}
\]

where \( \mathbf{w} = \sum \mathbf{v}_i/\|\mathbf{v}_i\| \) and \( \mathbf{x} \) is a normalized vector \( \|\mathbf{u}\| \). The inner product of vector \( \mathbf{a} \) and a normalized vector \( \mathbf{b} \) is the length of projection of \( \mathbf{a} \) on \( \mathbf{b} \). Thus, the maximum of (12) is reached when vectors \( \mathbf{x} \) and \( \mathbf{w} \) lie in the same direction, i.e., \( \mathbf{x} = \mathbf{w}/\|\mathbf{w}\| \). This is denoted as the optimal condition for centroid selection.

The second phase consists of a local search among gates in the cluster to find the closest match to the optimal centroid found above. This is achieved by computing the similarity between each normalized feature vector and the optimal centroid (using (9)). The vector with the largest similarity is then chosen to be the centroid of the cluster. The algorithm for centroid update is shown in Figure 6. Since a correlation matrix is not required in our algorithm, the memory complexity is reduced from \( O(N_g^2) \) to \( O(N_g) \).

E. Extension to \( N \) Clusters

The 2-cluster gate clustering algorithm in Figure 5 can be extended to an efficient successive clustering algorithm for \( N \)-clusters. The cluster is recursively bi-partitioned until the number of clusters reaches or exceeds \( N \). The algorithm is outlined in Figure 7. We use a binary-tree data structure to model the successive clustering of the gates. The root node of the tree contains all the gates in the circuit whereas a leaf node represents the resulting cluster without any children nodes.

There are two possible scenarios for creating a leaf node: (1) Normal termination. When the total number of leaf nodes and non-leaf nodes reaches the required \( N \), all the nodes at the lowest level become leaf nodes; (2) Fast termination. For a node with no more than 10% of the total gates, we consider this to be a leaf node without further decomposition. A typical example is shown in Figure 8(a) in which \( N=3 \). On the second level, the node on the right contains fewer than 10% of the total gates and is immediately considered to be a leaf without further decomposition (fast termination). The node on the left with 153 gates is further decomposed to two non-leaf nodes on the third level. Since the total number of non-leaf nodes and leaf nodes has reached the required number \( N (=3) \), the two non-leaf nodes are then considered as leaves and terminated (normal termination).

If the number of the nodes (including leafs and non-leafs on the bottom level) exceeds \( N \), which commonly occurs, a recombination stage is employed. In this case the node with the fewest gates (node A in Figure 8(b)) is recombined to either node A's sibling node or the node whose parent is the sibling of node A's parent. The candidate with fewer gates will be chosen (node B in the figure). Since the number of clusters is limited in practice, the algorithm in general is terminated within 3-4 iterations.

\footnote{In practice the number of the clusters will not exceed 10}
III. DESIGN-TIME BODY BIAS SELECTION

A. Statistical Delay and Leakage Models for Biased Gates

This section describes the statistical gate-level models for the parametric yield optimization framework. Following the examples of [2]–[4], each process parameter can be transformed to a linear combination of \( m \) independent gaussian random variables \((z_i)\) and the random residual \( R \) from principal component analysis (PCA). Both delay and log of leakage can then be canonically expressed by two gaussian random variables:

\[
D = D_0 + \sum_{i=1}^{m} a_i z_i + a_{m+1} R
\]

\[
\ln(L) = \ln(V_0) + \sum_{i=1}^{m} b_i z_i + b_{m+1} R
\]

where \( a_i \) and \( b_i \) are the corresponding coefficients obtained from PCA [3], [4]. Assuming the gate is biased at a particular voltage \( V_b \), with the models in (1)-(2), gate delay and log of leakage are:

\[
D_{ABB} = \Delta k \times D, \quad \ln(L_{ABB}) = \ln(L) + \Delta V
\]

where \( \Delta k = 1 - d_0 - d_1 V_b \) and \( \Delta V = \ln(1+p_0+P_1 V_b+p_2 V_b^2) \).

For certain body bias \( V_b \), the framework performs timing analysis by propagating the delay from gate to gate as in [3], [4] using (14). Meanwhile we can maintain the node delay in a canonical form with different coefficients. Leakage power analysis is achieved by summing lognormal random variables: \( \ln(P) \) and \( \ln(L) \), with the models in (14). Assuming the body voltage for a cluster \( \max \{a_i\} \), \( b_i \), the correlation between \( D_{ABB} \) and \( \ln(L_{ABB}) \) can be easily evaluated as:

\[
\text{Cov}(D_{ABB}, \ln(L_{ABB})) = \sum_{i=1}^{m+1} \Delta k a_i b_i
\]

B. Yield Analysis and Optimization

Based on the biased gate models, we can perform statistical timing and power analysis and compute the correlation between delay and leakage power. Parametric yield of the circuit is defined as in [2]:

\[
Y = \Pr(D < D_{con}, \ln(P_L) < \ln(P_{con} - P_D))
\]

where \( D_{con} \) and \( P_{con} \) are constraints for delay and power, respectively, \( P_L \) is the leakage power and \( P_D \) is the dynamic power of the circuit. Both circuit delay and log of leakage are two gaussian random variables. The underlying problem in (16) is then the integral of a bi-variate normal distribution over a rectangular region. The five parameters \( \mu_D, \sigma_D, \mu_L, \sigma_L \) and \( \rho \) (the mean and standard deviation of circuit delay and log of leakage power, and their correlation coefficient) are used to define the bi-variate normal distribution.

For simplicity, (16) can be normalized as:

\[
Y = \Pr(x < a, y < b) = \frac{1}{2\pi\sqrt{1-\rho^2}} \int_a^b \int_{-\infty}^{\infty} e^{-\frac{x^2+2\rho xy+y^2}{2(1-\rho^2)}} \, dx \, dy
\]

where \( x = \frac{D-D_{con}}{\sigma_D} \) and \( y = \frac{\ln(P_L)-\mu_L}{\sigma_L} \) are normalized random variables, \( a = \frac{\ln(P_{con} - P_D)}{\mu_D} \) and \( b = \frac{\ln(P_{con} - P_D) - \mu_D}{\sigma_D} \) are the normalized constraints on delay and log of leakage power, and \( \rho \) is the correlation coefficient between the circuit delay and log of leakage. To evaluate this integral, [2] transformed the original rectangular region to a triangular region. The new region is then partitioned into several sub-domains and computed in sequence. However, this method may suffer from a high complexity of transformation and partitioning. To avoid these problems we propose the use of the Gaussian quadrature technique [21]–[23]. Gaussian quadrature is an efficient approach to compute integrals by a weighted sum of function values at specified abscissae within the domain of integration, and can reach analytical accuracy by a suitable choice of abscissae and weights. Reference [21] suggests a Gaussian quadrature model to compute the integral:

\[
\int_0^\infty \exp(-x^2) f(x) dx \approx \sum_{i=1}^{15} w_i f(x_i)
\]

where \( x_i \) and \( w_i \) are abscissae and weights that are fixed for the integral of the form above without any dependence on \( f(x) \).

With the substitutions of \( u = \frac{a-x}{\sqrt{2(1-\rho^2)}} \), \( v = \frac{(a-\mu_P)}{\sqrt{2(1-\rho^2)}} \), \( a_1 = \frac{a}{\sqrt{2(1-\rho^2)}} \) and \( b_1 = \frac{\sqrt{1-\rho^2}}{\sqrt{2(1-\rho^2)}} \), (17) can be simplified to:

\[
Y = \frac{1}{\pi} \int_0^\infty \int_0^\infty \exp(-u^2 - v^2) Y(u, v) \, du \, dv
\]

By applying the model in (18) to \( u \) and \( v \) separately, we obtain:

\[
Y = f(a_1, b_1, \rho) = \sum_{i=1}^{15} a_i b_j Y(x_i, y_j)
\]

Since \( x_i \) and \( w_i \) in (21) are fixed for any arbitrary function \( Y(u, v) \) [21], the computation time of (21) is independent of the problem size.

Based on the proposed yield analysis\(^2\), our yield optimization problem can be formulated as an unconstrained optimization problem where the objective function is (16) and the design variables are the body voltage of each cluster, as shown below:

\[
\max \Pr(D < D_{con}, \ln(P_L) < \ln(P_{con} - P_D))
\]

This problem is then solved by the optimizer Lancelot [19]. Lancelot numerically evaluates the objective function and gradient of the yield. Thus, the optimization formulation in this section can use high-order models or even table-look-up models for computing the intrinsic gate delay and leakage to guarantee the accuracy in optimization.

C. Gradient Computation and Complexity

Lancelot [19] requires the computation of the gradient of the yield with respect to the body voltage of each cluster. This can be estimated by increasing or decreasing the body voltage of a cluster by a small amount and then computing the yield difference due to the body voltage change. To improve the efficiency of this step, we suggest a power perturbation scheme instead of a full-circuit statistical power analysis.

Assuming the body voltage for a cluster \( k \) is changed by a small amount \( \Delta v \), the change in leakage power can then be written as:

\(^2\)The model in (18) requires that when \( \rho < 0 \), which is the typical case for log of leakage and delay, the constraints should be \( a \geq 0 \) and \( b \leq 0 \). The other constraint cases, \( \{a \geq 0, b \geq 0\}, \{a \leq 0, b \geq 0\} \) and \( \{a \geq 0, b \leq 0\} \), can be easily transformed to \( \{a \leq 0, b \leq 0\} \) by exploiting the underlying characteristics of bivariate normal PDF [21]–[23].
\[ \Delta P_L = \sum_{i \in k} \{P_{i,0} [1 + p_{i,0} + p_{1,i} (v_{b,k} + \Delta v)] + p_{2,i} \times \left( v_{b,k} + \Delta v \right)^2 \} - P_{i,0} \left( 1 + p_{i,0} + p_{1,i} v_{b,k} + p_{2,i} v_{b,k}^2 \right) \]

where \( P_{i,0} \) is the leakage with zero-body bias for gate \( i \), and \( p_{1,i} \) and \( p_{2,i} \) are the coefficients for the leakage model in (1). This can be further simplified to:

\[
\Delta P_L = \sum_{i \in k} P_{i,0} (p_{1,i} \Delta v + p_{2,i} \Delta v^2) + v_{b,k} \sum_{i \in k} 2P_{i,0} p_{2,i} \Delta v
\]

Since the body voltage increment \( \Delta v \) can be fixed in gradient analysis, \( v_{b,k} \) is the only variable in (24), which indicates that the coefficients \( \sum_{i \in k} P_{i,0} (p_{1,i} \Delta v + p_{2,i} \Delta v^2) \) and \( \sum_{i \in k} 2P_{i,0} p_{2,i} \Delta v \) can be computed in advance and used throughout the whole optimization process. We just need to perform \( N \) summations to compute the change in the leakage PDF for \( N \) clusters in gradient computation. The complexity is reduced from \( O(N N_g) \) for \( N \) full statistical power analysis to \( O(N) \), where \( N_g \) is the number of gates.

Timing perturbation is performed by a full statistical static timing analysis (SSTA). Once we obtain the delay and leakage-power PDFs of the perturbed circuit (the body voltage of the \( k_{th} \) cluster is changed from \( v_{b,k} \) to \( v_{b,k} + \Delta v \)), the yield of the perturbed circuit can be calculated by (21), and the change in yield is used to define the particular component of the yield gradient. Since yield analysis has a constant complexity, the overall algorithm complexity of this optimization framework is dominated by SSTA, the complexity of which is \( O(N (N_g + E)) \), where \( E \) is the number of edges of the timing graph. The number of the clusters is limited in real designs and is negligible compared to \( N_g \) and thus the framework maintains a linear complexity. Experimental results in section IV also validate that yield optimization takes only seconds even for a circuit with tens of thousands of gates.

### IV. Experimental Results

The proposed framework discussed in sections II and III were implemented in C and tested on ISCAS85 benchmark circuits and a Viterbi Decoder circuit (Vit1) that vary in size from 166 to 14539 gates. The circuits were synthesized using an industrial 1.2V 90nm triple-well dual-Vth technology. The two \( Vth \) values are 0.32V (0.33V) and 0.22V (0.24V) for NMOS (PMOS). Body voltage is varied between -0.5 and 0.5V. All standard cells in the library were characterized (using SPICE) at both the high- and low-\( Vth \) values. Only channel length variation is considered for simplicity. However, the overall approach can be extended to consider other sources of variability. We considered inter-die, spatially correlated intra-die, and random components of variation. Total 3\( \sigma/\mu \) for channel length variability is set to 15% and then split evenly among the three variation components.

#### A. Efficacy of Feature Extraction-Based Clustering

Reference [12] proposed a clustering algorithm based on an empirical affine weighting function. Table I compares the proposed feature extraction-based clustering algorithm (Feat.) with the work of [12] (Empir.) in terms of both resulting leakage and runtime. Column 2 lists the number of gates for each circuit, varying from 166 to 14539 gates. Columns 3-6 compare the mean/standard deviation and 95\textsuperscript{th} percentile leakage of the proposed method and the method from [12], respectively. The proposed approach improves upon the prior work in all measures and achieves a 10% and 17% reduction in the mean and standard deviation of the leakage, respectively. The last two columns of Table I compare the gate clustering runtime for the two methods. The runtime for the proposed method shows linear dependence on circuit size with a small slope, whereas the runtime for the empirical function-based method [12] increases exponentially. On average, the proposed method is 5.1 \( \times \) faster than the method in [12]. For the largest circuit Vit1, the proposed method achieves 18 \( \times \) speed-up.

#### B. Monte Carlo Convergence

The design space exploration step described in section II-B is executed only once in the framework but still involves solving a large number of QP problems to determine the body bias profile across process variability. To speed up this step, we employ the smart sampling approach in [24], which captures the importance of the samples to reduce the number of samples. Figure 9 shows the dependence of yield optimization results on the number of Monte Carlo samples for the six largest circuits in our set of benchmarks. The quality of the yield optimization results with 100 samples is similar to the results with 1000 samples. We therefore use 100 samples in the exploration step, as design space exploration is only required to outline general features instead of local details. Moreover, since the QP optimization for a

<table>
<thead>
<tr>
<th>CTK.</th>
<th>#gates</th>
<th>Leakage comparison (( \mu )W)</th>
<th>Time (sec.) for clustering</th>
</tr>
</thead>
<tbody>
<tr>
<td>Vit1</td>
<td>14539</td>
<td>246/110</td>
<td>396</td>
</tr>
<tr>
<td></td>
<td></td>
<td>5.9/3.1</td>
<td>7.2/3.5</td>
</tr>
<tr>
<td></td>
<td></td>
<td>3.5/1.2</td>
<td>4.2/1.6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>12.1</td>
<td>26.0</td>
</tr>
<tr>
<td></td>
<td></td>
<td>3.9/1.5</td>
<td>7.8/2.3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>17.8/6.3</td>
<td>28.3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>4.2/1.6</td>
<td>5.3</td>
</tr>
<tr>
<td></td>
<td></td>
<td>14.2/5.7</td>
<td>6.4</td>
</tr>
<tr>
<td></td>
<td></td>
<td>52</td>
<td>6.6</td>
</tr>
<tr>
<td></td>
<td></td>
<td>95%</td>
<td>13.4</td>
</tr>
</tbody>
</table>
given circuit sample is independent of the OP optimizations for other samples of the same circuit, this step can be easily parallelized to achieve further speedup.

C. Pre-Silicon Body Bias Selection Framework vs. Post-Silicon ABB

The proposed pre-silicon body bias selection framework chooses statistically optimal body voltages for the full ensemble of chips, while post-silicon ABB uses measurement results for a particular manufactured chip and deterministically selects the bias voltage for each cluster of that chip. It is clear that post-silicon ABB should provide higher yields at the cost of higher testing times and costs. This section quantifies the yield loss when using the proposed pre-silicon approach compared to a post-silicon ABB with the same clustering method described in section II.

Given a clustering, the yield of post-silicon ABB is computed by first generating 1000 chip samples which are then individually tuned to minimize leakage subject to a delay constraint. The number of chips that fail to simultaneously meet the leakage and delay targets is then calculated. The yield optimization results of our pre-silicon approach and post-silicon ABB are summarized in Table II. Column 2 lists the initial pre-optimized yield of each circuit for the target constraint \{\text{Delay}<\mu_D+\sigma_D, \text{Leakage}<\mu_L\}. Columns 3-10 display the yield optimization results and yield point improvement using our pre-silicon framework and post-silicon ABB for two- and three-cluster scenarios. Although post-silicon ABB achieves slightly higher yield than the proposed pre-silicon body bias selection framework, the difference degrades for larger number of clusters and larger circuits.

D. Pre-Silicon Body Bias Selection Framework vs. Traditional Pre-Silicon Approaches

We further evaluate the efficacy of our pre-silicon framework in Table III when compared to traditional pre-silicon methods (a statistical dual-\(V_{th}\) assignment approach [8] and a yield maximization approach using gate sizing [2]). Columns 3-10 list the yield optimization results and yield point improvement for the constraint \{\text{Delay}<\mu_D+\sigma_D, \text{Leakage}<\mu_L\} using our pre-silicon framework and two traditional pre-silicon statistical optimization methods (dual-\(V_{th}\) and gate sizing [2], [8]). The proposed approach with either two or three clusters potentially doubles the original yield of 39% (the optimized yield is 70% for two clusters and 81% for three clusters on average). Meanwhile the yield improvement is limited to 3.0 point on average for the statistical dual-\(V_{th}\) approach [8] and 7.7 point on average for gate sizing [2]. This further validates the statement in section I that the proposed pre-silicon body bias selection has continuous domain design variables and hence higher flexibility than the traditional pre-silicon approaches like gate sizing or dual-\(V_{th}\).

E. Yield Analysis and Optimization

This section discusses the accuracy and optimality of the proposed pre-silicon framework, as shown in Table IV when compared to Monte Carlo simulation. Columns 2-5 list the mean and standard deviation of the delay and leakage for the initial designs. The target constraint is set to \{\text{Delay}<\mu_D+\sigma_D, \text{Leakage}<\mu_L\}. Given this constraint we compute the original yield and compare it with a Monte Carlo model using 2000 samples, shown in columns 6-7. The absolute errors of the proposed yield analysis in section III-B vary from 0.9% to 5.7%, which is due to the computation approximation in SSTA, e.g., statistical maximization operation.

The optimized yield results and the yield point improvements are shown in Columns 8-11 for 11 designs. The optimized yield almost doubles the original yield with 69% for two clusters and 80% for three clusters on average. The improvements are consistent among all the benchmarks studied. We also perform a Monte Carlo sweep (MC-sweep) to determine whether the optimized yield obtained by the proposed framework is globally optimal. MC-sweep performs Monte Carlo simulations on all possible combinations of body voltages for a three-cluster configuration. The sweep increment is set to 0.1V for the sweep space \(V_{th,1} \times V_{th,2} \times V_{th,3}\). Columns 12-13 of Table IV show the maximum yield found by MC-sweep and the relative deviation of the proposed approach with respect to MC-sweep. The maximum deviation is 6.7%, which is due to the relatively coarse grid used for sweep. Columns 14-17 summarize the runtime for the critical stages of the proposed framework (including design space exploration (explo.), clustering (clust.) and yield optimization (optim.)) as well as the total runtime. The last column lists the ratio of the total runtime to the circuit size, indicating a linear relationship. Runtime is dominated for larger circuits by the exploration stage, which can be parallelized across machines as mentioned above.

F. Implications for Physical Design

Adaptive body bias incurring physical design overheads, including generation/distribution of the body voltages and extra well spacing. There are limited numbers of clusters (2-3 in our experiments) and as a result, these overheads can be reasonably bounded. The major impact of gate clustering on placement is then the extra well spacing between adjacent cells having different biases, which is imposed by triple-well-layout rules. As stated in section II, the proposed clustering method naturally captures the spatial correlation in the feature.
TABLE IV

<table>
<thead>
<tr>
<th>CKT</th>
<th>Initial design</th>
<th>Init. yield (%)</th>
<th>Optimum yield (%)/Impro.</th>
<th>MC-sweep(%)</th>
<th>Time for critical stages(sec.)</th>
<th>Total (sec.)</th>
<th>Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>μP</td>
<td>σD</td>
<td>μP</td>
<td>σD</td>
<td>Prop.</td>
<td>MC</td>
<td>2 cluster</td>
<td>2 cluster</td>
</tr>
<tr>
<td>c432</td>
<td>0.71</td>
<td>0.05</td>
<td>0.97</td>
<td>0.26</td>
<td>38.4</td>
<td>43.6</td>
<td>70.5</td>
</tr>
<tr>
<td>c499</td>
<td>0.08</td>
<td>0.04</td>
<td>3.80</td>
<td>1.02</td>
<td>29.2</td>
<td>28.6</td>
<td>65.8</td>
</tr>
<tr>
<td>c880</td>
<td>0.77</td>
<td>0.05</td>
<td>1.17</td>
<td>0.32</td>
<td>38.6</td>
<td>42.6</td>
<td>68.4</td>
</tr>
<tr>
<td>c1555</td>
<td>0.89</td>
<td>0.05</td>
<td>3.57</td>
<td>0.96</td>
<td>39.3</td>
<td>43.8</td>
<td>68.4</td>
</tr>
<tr>
<td>c1908</td>
<td>1.15</td>
<td>0.07</td>
<td>2.16</td>
<td>0.58</td>
<td>39.0</td>
<td>42.1</td>
<td>68.1</td>
</tr>
<tr>
<td>c2004</td>
<td>0.72</td>
<td>0.05</td>
<td>2.06</td>
<td>0.56</td>
<td>38.7</td>
<td>43.4</td>
<td>73.3</td>
</tr>
<tr>
<td>c3540</td>
<td>1.22</td>
<td>0.07</td>
<td>3.15</td>
<td>0.84</td>
<td>38.7</td>
<td>44.5</td>
<td>65.7</td>
</tr>
<tr>
<td>c3515</td>
<td>1.12</td>
<td>0.07</td>
<td>4.14</td>
<td>1.11</td>
<td>39.2</td>
<td>42.2</td>
<td>69.4</td>
</tr>
<tr>
<td>c6288</td>
<td>3.52</td>
<td>0.21</td>
<td>14.77</td>
<td>3.91</td>
<td>38.4</td>
<td>41.9</td>
<td>63.4</td>
</tr>
<tr>
<td>c7552</td>
<td>1.28</td>
<td>0.08</td>
<td>4.14</td>
<td>1.10</td>
<td>38.8</td>
<td>39.7</td>
<td>69.8</td>
</tr>
<tr>
<td>Vit1</td>
<td>2.41</td>
<td>0.14</td>
<td>149.44</td>
<td>39.54</td>
<td>39.1</td>
<td>41.8</td>
<td>78.6</td>
</tr>
</tbody>
</table>

Average yield point improvement

30 | 41

Fig. 10. Vit1 placement with physically contiguous cluster regions by CAPO [25] (different clusters are shown with different colors): (a) two clusters; (b) three clusters

V. CONCLUSION

In this paper we presented a gate-level parametric yield optimization framework using design time body bias selection. The approach considers the power and performance constraints as well as their correlation. A feature extraction-based clustering approach is proposed that achieves speedups of 5.1× on average and up to 18× for 11 benchmark circuits compared to a recently reported clustering strategy, with leakage savings of more than 10%. In addition the framework employs a fast yield analysis calculation method and an efficient power perturbation technique for optimization and achieves 41% yield improvement on average across 11 benchmark circuits.

VI. ACKNOWLEDGEMENT

The authors gratefully acknowledge the Semiconductor Research Corporation (SRC), Semiconductor Technology Academic Research Center of Japan (STARC) and National Science Foundation (NSF) for supporting this work.

REFERENCES