Low-power designs, especially microprocessors, have received a large amount of attention recently as portable and wireless applications gain marketshare. Also, even in the highest performance designs power has become an issue since the extremely high frequencies being attained (near 1 GHz) can easily lead to power dissipation in the many tens of watts. Dissipation of this amount of power requires heat sinks, resulting in higher costs and potential reliability problems. In this section, we begin by discussing the reasons power has become a significant issue. We then outline a hierarchical approach to power modeling that was first presented in [1]. This approach isolates the different components of power consumption on a chip and then attempts to model them individually.
In high-performance ASIC's and microprocessors, there are several key reasons why power dissipation is rising. First, the presence of larger numbers of devices and wires integrated on a larger chip results in an overall increase in the total capacitance found on a design. Second, the drive for higher performance leads to increasing clock frequencies and dynamic power is directly proportional to the rate of charging capacitances (in other words, the clock frequency). A third reason that may lead to higher power consumption is the more efficient use of gates. Dynamic power is not consumed when a gate does not switch. However, interest has risen in the circuit design area to make better use of the available gates by increasing the ratio of clock cycles that a gate actually switches. This increased circuit activity would also lead to rising power consumption.
Dynamic power is the largest component of the total chip power consumption
(the other components are short-circuit power and leakage power).
It occurs as a result of charging capacitive loads at the output of gates.
These capacitive loads are in the form of wiring capacitance, junction
capacitance, and the input (gate) capacitance of the fan-out gates.
The expression for dynamic power is:
We examine five key components of dynamic power consumption [1]:
Standard Cell Logic and Local Wiring
The calculation of this component of power differs from that presented
in [1]. We assume a hierarchical layout style like that discussed
in [4], where the designer uses blocks of 50 to 100,000 gates to limit
the impact of interconnect while maintaining reasonable design times.
A typical design will look something like that shown in the figure below.
A collection of these blocks will make up the logic portion of the design,
with additional macro blocks for on-chip memory, etc. In our power
analysis, we will look carefully at one of these blocks, or modules, that
constitute a building block of the overall design.
The first issue is to determine the size of a block. The user will specify the size in terms of gates; the default value is 50,000 gates in any technology. Values larger than 100,000 lead to power penalties while small blocks (< 25,000 to 50,000) will yield infeasible design times [4]. In addition, blocks of about 50 to 100,000 gates yield significant design flexibility in that they can contain many important building blocks of larger designs (e.g. DSP or microprocessor cores). Once the number of gates per module is determined, the process packing density is used to find the area of each block. The blocks are assumed to be square. Due to the nature of standard cell logic, the size of a gate is determined by lower level contacted metal pitches [5,6,7]. As discussed elsewhere in the BACPAC documentation, the area of a 2-input NAND cell is typically 4 metal pitches in width by 16 metal pitches in height [6]. The bottom two metal levels typically have similar contacted pitches for maximum density, so this area becomes 64 * MP^{2} where MP is the contacted metal pitch of the bottom layers. Using this 2-input NAND cell as a basis for standard cell logic design, we find that the ideal packing density is 1 / 16MP^{2} where packing density is in terms of transistors per unit area. Furthermore, place-and-route tools cannot achieve this ideal packing density, so an added factor must be added to account for white-space in a design [8]. This silicon efficiency factor, s_{e}, has a default value of 0.5 but could be higher in full-custom designs. Now, we find the process packing density becomes s_{e} / 16MP^{2}, or in a 0.25 mm process with contacted metal pitch of 0.7 mm, this is equal to 6.4 x 10^{6} transistors / cm^{2}. The 1997 National Technology Roadmap for Semiconductors cites an expected packing density of 8 x 10^{6} transistors / cm^{2} at this technology node. This value corresponds to a silicon efficiency factor of 0.625, rather than 0.5. So, the area of a single block is:
Interconnect Capacitance
To calculate the interconnect capacitance, we determine the wiring requirements
within the module and use the resultant total length with the capacitance
per unit length. The wiring requirements of a block are found by
extrapolating the critical path model to the entire module. Average
wirelengths that were determined for local routing are used in conjunction
with the expected number of nets in the block to determine total wirelength.
The entire process is detailed below.
Total switching device capacitance consists of gate oxide capacitance, overlap capacitance, and junction capacitance. In addition, we consider the impact of internal nodes of a complex logic gate. For example, the junction capacitance of the series-connected NMOS transistors in a NAND gate contributes to the total switching capacitance although it does not appear at the output node. For each gate in the module, we will determine the total device-related capacitance. Then, the module device capacitance simply becomes N_{gates} * C_{gate}.
From the delay analysis section, BACPAC has determined the optimal device sizing in local routing. Calculations are made for both with and without noise scenarios, so that comparisons can be made in terms of power vs. delay tradeoffs. All logic gates are assumed to be the optimal size -- this may result in an overestimate of sizing in some non-critical cases but will also underestimate some device sizing when longer wires are present in certain paths. Overall, we estimate that this assumption will be slightly pessimistic, leading to somewhat higher power numbers.
We look at each gate individually and calculate the input capacitance,
as well as the junction capacitance. Since we are already calculating
the input capacitance of each gate, we do not need to look at the fan-out
capacitance, as this would be redundant. Capacitances are calculated
as before:
Sample calculations
For a 0.25 mm microprocessor design, with a design hierarchy consisting of 15 50,000 gate modules, we wish to calculate the power consumption due to random logic and local / intermediate interconnections. We estimate an average W/L value of 15 and the contacted metal pitch in layers 1 and 2 is 0.7 mm.
We have previously found the area of a 50,000 gate block in this technology to be 2.94 mm^{2} with a silicon efficiency of 50%. We now calculate the total wiring capacitance using estimated values for average wirelength and fan-out distribution.
Fan-out distribution: 60%
fan-out = 1, L_{avg} = 75 mm
20% fan-out = 2, L_{avg} = 105 mm
13% fan-out = 3, L_{avg} = 135 mm
7% fan-out = 4, L_{avg} = 165 mm
FO_{avg} ~ 1.7
We find N_{nets} = 55,556 and calculate the total wiring requirements according to (3). The total wirelength required by the above fan-out distribution and wirelengths is 5.29 meters. Adding 10% overhead for metal 3 routing, we get a total length of 5.82 m. As a check, we determine the estimated available signal wiring within the module area on metals 1 and 2. Estimating that 30% of metal 1 and 20% of metal 2 is unavailable for signal routing due to V_{dd} and via blockage and a wiring efficiency of 50%, we find that metals 1 and 2 offer 5.36 meters of wiring resources. This is sufficient for the amount of wiring calculated (5.29 m). Assuming that metals 1, 2, and 3 have roughly similar capacitances per unit length (a good assumption at small linewidths), we can easily find the total wiring capacitance. For a nominal capacitance per unit length of 2 pF / cm, we obtain C_{wire} = 1.164 nF.
Device capacitances are calculated directly from (4) through (6), then multiplied by 50,000 to get a value of C_{device} = 1.338 nF. Total capacitance within the module is 2.5 nF. At 500 MHz and a switching activity factor of 0.15, the power consumption for a single 50,000 gate block is then 1.17 W. For the entire design, random logic and local and intermediate interconnections contribute 17.55 W of power. This can be roughly compared to the Alpha 21264, which is a 0.35 mm design of similar size-- the logic portions of this chip consume about 35 W, or 50% of the total chip power [10].
Global Interconnect
Global interconnects are defined as those wires which run between modules. Although the number of global wires is considerably smaller than local wires, their total length can be quite large. Therefore, we consider their impact on the total chip power consumption. Previously in BACPAC, we have calculated the length of an average global wire by applying Rent's rule to the global hierarchy. A brief review of this process follows. Given the design module size and the total expected number of logic transistors, the total number of modules is found. Each module is regarded as a gate in the nomenclature of Rent's rule. The global Rent's exponent is assumed to be comparable to that of the local Rent's exponent. At this point, Donath's model is applied to the global system where each block is a gate and the gate pitch is set by the relationship between logic area and number of blocks. A point-to-point average global wirelength is found, and this is scaled using an empirical formulation first suggested by IBM [8] to calculate the average global wirelength for the entire design, L_{g_avg}. It should be noted that only the logic area is used in these calculations; the memory area is not considered to impact the global wirelength.
Once a typical global interconnect is found, we seek to determine the
number of global wires in a design. Once this has been calculated,
the total global wirelength can be determined and power dissipation calculated.
Again, we apply Rent's rule to the global level of the design hierarchy.
We are aiming to calculate the number of pins or external connections for
each module. Rent's rule was initially formulated to calculate the
same thing, except at the chip-level. If we assume each 50,000 gate
module (or similar size block) can be viewed as a chip, then by determining
the number of pins on each block, we can determine the number of inter-modular
wires or nets in the design. The figure below will help to visualize
this step. According to Rent's rule, we have:
We are interested in the number of global nets rather than pins.
Converting from pins to nets requires us to divide by the average number
of pins in each net. This is equal to the average fan-out plus 1,
where the 1 accounts for the output of the originating gate. Thus,
the total number of global nets can be defined as:
Now that the number of global nets has been determined, we need to take
into account the impact of repeater capacitance. Since repeaters
are typically much larger than the average logic gate, their intrinsic
device capacitance can be quite large. We approximate the total number
of drivers for global wires by dividing the total global wirelength by
the maximum distance a wire can be run before buffering becomes advantageous.
The device capacitance (input + junction) is calculated for the optimal
driver size, which is normally determined for the top metal layers according
to [13]:
The total global interconnect power consumption is calculated as:
Sample calculations
We will re-examine the same 0.25 mm design
as above. We will assume a value for L_{g_avg} of 7 mm and
an average fan-out of 2.5 for global nets. In this case, we calculate
the number of external pins for each 50,000 gate module in the logic portion
of the processor. For a typical microprocessor Rent's exponent of
0.5, we find that each block has 783 external connections. Converting
this value to the total number of global nets results in a value of 3354
nets. Now, the total global wirelength is 7 mm * 3354 nets = 23.5
meters. We find the number of buffers required as 23.5 m / 5.65 mm
= 4106 buffers. The buffer size for an upper metal level with pitch
of 4 mm, thickness of 2.5 mm
and copper wiring is W_{n} = 93.5 mm.
With an activity factor of 0.15, we obtain:
Clock Distribution
The clock network in a high-performance design typically consumes a large amount of power. This is especially true in modern microprocessors where the clock can consume 20 to 40% of the total power. ASIC's normally have less stringent requirements on issues such as clock skew, leading to smaller amounts of clock power. Nonetheless, the clock distribution network should be considered in all discussions of power modeling. In [1] the authors consider one type of clock distribution network; the H-tree. There are several drawbacks to this form of clock distribution including the large buffering system required to drive the network at the root. With such a large buffering system, however, the likelihood of severe inductive ringing increases as the output resistance of the driver network will be significantly less than the characteristic impedance of the H-tree. In this manner, the simple H-tree can be seen to be a non-scaleable system for future designs [14]. Instead of the H-tree, in BACPAC we base our clock distribution models on the buffered H-tree, which has a key advantage. Since the wiring network is periodically buffered, much smaller linewidths can be used without sacrificing performance. This allows for less congestion on global metal layers, as well as smaller capacitances. By properly sizing the buffers and the wires, very low skew can be achieved. However, the impact of process variation (specifically in the buffer's effective channel length) is important as the clock skew is directly proportional to the degree of process control achieved. We now examine a generic buffered H-tree design and look at its contribution to power consumption.
A buffered H-tree with 4 clusters is shown in the figure below.
The cluster size is determined by the maximum amount of skew allowable
in the design. Typically, this is about 10% of the clock cycle.
In very high-performance microprocessors, controlling clock skew is vital
and designers have been able to develop networks that result in clock skew
that is less than 5% of the clock period. However, in ASIC's the
clock network is not as significant and clock skew may even exceed 10%
of the clock cycle time. In BACPAC, the designer can specify a target
for skew in terms of clock period. Once this is done, the largest
possible cluster size is determined by finding the delay from a latch in
the center of the cluster to one located in the corner. The total
distance that must be traversed in this instance is L, where L is the side-length
of a cluster. We assume that the clock network is implemented in
the top-level metal and, following common practice to limit process variation,
an underlying shield layer is used to minimize dielectric thickness variation
across the chip. In terms of wireability then, we concentrate on
the top level metal and then duplicate that on the previous layer.
In order to find the maximum L that meets the T_{skew} requirement,
we use Sakurai's delay expression [15] to find the delay between a point
just at the output of the buffer and the point in the corner of the cluster.
In this case, we expect the dominant source of delay to be the charging
of latch capacitances along the wire resistance. Since these latches
will not be lumped at the end of the line, but distributed along the length
L of the wire, we model this RC component as distributed, which differs
from Sakurai's initial formulation. In addition, the input-slope
dependent delay term is eliminated since it is not line-length dependent.
Finally, we institute a rudimentary form of process variation; we set worst-case
values for wiring resistance and capacitance as well as device resistance.
These worst-case values are expected to be 10% over nominal. We do
not assume that all destination capacitances have worst-case values since
this capacitance is distributed over a large amount of sinks, as opposed
to a single wire or driver. The likelihood of all sinks having worst-case
input capacitances is negligible. Our expression for T_{skew}
then becomes:
The total load capacitance is equal to the number of latches multiplied
by the input capacitance of each latch. A typical latch has 4 clocked
elements -- 2 NMOS and 2 PMOS devices. In this way, a minimum-sized
latch has an input capacitance equal to twice that of a minimum-sized inverter.
Furthermore, we assume that latches are sized equivalently to the logic
gates in the design to allow for sufficient speed. So, C_{latches}
in (14) becomes:
The above figure shows the evolution of the clock signal as it propagates
from the driver to the destination capacitances (latches). Line length
is 6.8 mm and total latch capacitance is 3.82 pF. Clock skew is ~
100 ps.
Once the maximum cluster size is determined, the chip itself must be
divided into a fixed number of such clusters. A symmetric H-tree
requires 2^{n} clusters where n is an even integer. Most
ASIC designs, with less stringent clocking requirements, may require only
4 clusters while microprocessors are typically divided into sixteen.
Designs requiring more than 16 clusters should be re-considered as the
amount of wiring and buffer capacitance will be quite large. The
number of clusters is then determined by dividing the total chip area by
the cluster area (L^{2}) and then rounding up to the next feasible
value (e.g. 4 or 16). For the remainder of this discussion we will
refer to an example design, where there are 16 clusters. Given that
the phase-locked loop (PLL) which generates the clock signal is typically
at the periphery of the chip, the clock network will require n + 1 levels
of hierarchy. Also, the number of buffers required will be 2^{n+1},
or simply 2 * N_{clusters}. In our example case, we have
now determined that we will need 32 buffers and each will have the optimal
driver size for the top level metallization. Additionally, we wish
to find the amount of wiring needed for this network. Since the routing
structure is highly regular, it is straightforward to determine the total
wiring in the n + 1 levels. However, we should also account for the
wiring after the last global driver, since this will be the majority of
the clock wiring. The total wiring in the network prior to the "within-cluster"
routing can be calculated by:
Therefore, the total "within-cluster" routing length is approximated by:
The total output load consists of more than latches -- the memory section of the design (if there is one) will contain a large number of transistors that need to be clocked, including pre-charge gates for the bit/bit_bar lines and bit line control circuitry. According to [1], we approximate the capacitance of clocked transistors in the memory arrays by:
Now, the total clock network capacitance can be calculated. It
is given by:
For our generic 0.25 mm high-performance microprocessor, we will estimate the power consumption of a buffered H-tree. For a chip area of 250 mm^{2}, logic depth of 14, and a total of 7 million transistors in the memory we estimate:
I/O Drivers
The power modeling of I/O drivers is relatively straightforward. It consists of first determining the number of signal pads which are in a design. Each of these pads has a large capacitance associated with it that corresponds to the off-chip capacitance connected to the pad. This value is relatively constant for a given packaging technology. Once this value is determined, the pad can be optimally driven by a cascaded chain of repeaters where each stage gets larger by a factor of f_{0}. This is the approach discussed first in [1] and subsequently in [11]. We make several important enhancements to these previous approaches. The first is to develop a more accurate model for the total driver chain capacitance which accounts for the intrinsic output capacitance of the buffers. In addition, we demonstrate that a pre-factor of 0.5 is not required as reported in [11] to account for the inverting nature of a chain of repeaters.
From data collected from numerous semiconductor vendors, the typical off-chip pad capacitance that needs to be driven is approximately 15 pF. This value can vary according to the user's process specifications. In our discussion, this off-chip capacitance will be referred to as the load capacitance, C_{L}. According to [17], the number of repeaters in a chain can be determined by examining two things: First, the ratio of C_{L} to the input capacitance of a minimum-sized inverter in the given process technology. This ratio is called Y. Second, the ratio of intrinsic device output capacitance to input capacitance, or C_{j} / C_{in}. This ratio is denoted by g. Once we know C_{L}, we can determine both these ratios. At this point, the optimal tapering factor should be determined recursively by using (24):
So, the total power due to I/O drivers is calculated as:
Sample calculations
For the same 0.25 mm microprocessor design
as we have considered throughout this section, we calculate the number
of signal pads using a Rent's exponent of 0.35 and an average fan-out of
2.3 (this gives the value of K, the average number of pins per gate) as
376. If we assume an off-chip capacitance of 15 pF, g
of 0.43 (from HSPICE), and V_{dd_I/O} of 3.3 V, we obtain the total
I/O power consumption of 7.4 W. This value assumes a switching activity
factor of 0.15. The intrinsic buffer capacitances contribute 2.79
W to this value. In some instances, designers use this value as the
I/O component of the on-chip power dissipation as the larger pad capacitance
is technically off-chip.
On-chip Memory Power
On-chip memory is typically arranged as an array of 6T SRAM cells with 2^{k} columns and 2^{n} rows. Therefore, the chip has a total of 2^{n+k} bits of memory or 2^{n+k} / 8 bytes. Modern microprocessors may contain a 32 kB instruction cache as well as a 32 kB data cache. ASIC designs may have somewhat smaller amounts of memory, such as 4 kB blocks of SRAM. In addition, microprocessors may contain larger level-2 caches that are an additional level of memory between the chip and the main memory. These blocks are usually very large, e.g. currently 128 kB to 1.5 MB. In addition to the memory array structure, there is supporting circuitry that allows the memory to function correctly. This circuitry contains address buffers, row decoders, column selectors, sense amplifiers, and other control logic. Of the total number of transistors associated with on-chip memory, usually about 80% of these are located in the cache itself (6T SRAM cells) and the other 20% make up this control logic. We now proceed with determining the power requirements of an on-chip memory array.
Read operations
During a read operation, several things occur.
The total capacitance for a read and write operation are given as:
As discussed in [1], the dominant part of the on-chip memory consumption is due to the charging of the capacitances described above. Power consumption in the supporting circuitry is typically smaller than the cell-based power. Therefore, we use (29) and (30) to find the memory power consumption. We find the average capacitance between the read and write operations and then assume a switching activity of 1 as the caches are accessed nearly every cycle. While microprocessor speed is not set by memory constraints, the efficiency of the chip is usually limited by the rate at which memory can be accessed. This implies that the level 1 and 2 caches will be quite busy accessing cells nearly all the time. In addition, it has been mentioned in the literature that the instruction cache in particular runs every cycle and the data cache is not far behind [16]. Therefore, a switching activity ratio of 1 is felt to be accurate.
Then, the on-chip memory power consumption is given by:
We assume our 0.25 mm microprocessor contains two level 1 caches, each of size 64 kB. We calculate L_{cell} to be 3 mm. From our previous empirical estimation of k (equation (20)), we approximate a 128 kB array as having 2^{11} columns and 2^{9} rows. The input capacitance of a minimum-sized NMOS device is 0.31 fF and drain capacitance is 0.5 fF. For a voltage swing of 500 mV (20% of the supply voltage for noise immunity), we obtain a read capacitance of 235.2 pF and a write capacitance of 238.1 pF. The average is used in (32):
This value is small which is expected as memory cache arrays do
not consume a large amount of power despite their high transistor count.
Appendix: Area-optimal Repeater Sizing
By multiplying a weighted area function (W^{1/3}) by the delay,
we have developed a new objective function to minimize. Since optimization
of delay alone usually results in overly large buffers with high power
requirements, we have used the weighted area function to determine an area-optimal
buffer size. The product of T_{d} and W^{1/3} is
then differentiated and the minimum value is found at:
References
[1] D. Liu and C. Svensson, "Power consumption estimation in CMOS VLSI
chips," IEEE Journal of Solid-State Circuits, vol. 29, pp. 663-670,
June 1994.
[2] A.P. Chandrakasan and R.W. Broderson, "Minimizing power consumption
in digital CMOS circuits," Proc. of the IEEE, vol. 83, pp. 498-523,
April 1995.
[3] G. Gerosa, et al., "250 MHz 5-W PowerPC microprocessor,"
IEEE
Journal of Solid-State Circuits, vol. 32, pp. 1635-1649, Nov. 1997.
[4] D. Sylvester and K. Keutzer, "Getting to the bottom of deep submicron,"
Proc.
of International Conference on Computer-Aided Design, in press, 1998.
[5] T.R. Bednar, R.A. Piro, D.W. Stout, L. Wissel, and P.S. Zuchowski,
"Technology-migratable ASIC library design," IBM Journal of Research
and Development, vol. 40, pp. 377-385, July 1996.
[6] IBM SA-12 ASIC family databook, revision 5, June 1998.
[7] R. Payne, "Metal pitch effects in deep submicron IC design," Electronic
Engineering, pp. 45-47, July 1996.
[8] G.A. Sai-Halasz, "Performance trends in high-end processors," Proc.
of the IEEE, vol. 83, pp. 20-36, Jan. 1995.
[9] W.E. Donath, "Placement and average interconnection lengths of
computer logic," IEEE Transactions on Circuits and Systems, vol.
26, p. 272-277, April 1979.
[10] M.K. Gowan, L.L. Biro, D.B. Jackson, "Power considerations in
the design of the Alpha 21264 microprocessor," Proc. of Design Automation
Conference, pp. 726-731, 1998.
[11] B.M. Geuskens, "Modeling the influence of multilevel interconnect
on chip performance," Ph.D. thesis, Rensselaer Polytechnic Institute, 1997.
[12] N. Vasseghi, K. Yeager, E. Sarto, and M. Seddighnezhad, "200-MHz
Superscalar RISC Microprocessor," IEEE Journal of Solid-State Circuits,
vol. 31, pp. 1675-1685, Nov. 1996.
[13] H.B. Bakoglu and J.D. Meindl, "Optimal interconnection circuits
for VLSI," IEEE Transactions on Electron Devices, vol. 32, pp. 903-909,
May 1985.
[14] A. Vittal and M. Marek-Sadowska, "Low-power buffered clock tree
design," IEEE Transactions on Computer-Aided Design, vol. 16, pp.
965-975, Sept. 1997.
[15] T. Sakurai, "Closed-form expressions for interconnection delay,
coupling, and crosstalk in VLSI's," IEEE Transactions on Electron Devices,
vol. 40, pp. 118-124, Jan. 1993.
[16] J. Montanaro, et al., "A 160-MHz, 32-b, 0.5W CMOS RISC microprocessor,"
Digital
Technical Journal, vol. 9, pp. 49-60, Jan. 1997.
[17] N. Hedenstierna and K.O. Jeppson, "CMOS circuit speed and buffer
optimization," IEEE Transactions on Computer-Aided Design, vol.
6, pp. 270-281, March 1987.