BACPAC - Berkeley Advanced Chip Performance Calculator

Dynamic Power Dissipation

Contents:

Low-power designs, especially microprocessors, have received a large amount of attention recently as portable and wireless applications gain marketshare.  Also, even in the highest performance designs power has become an issue since the extremely high frequencies being attained (near 1 GHz) can easily lead to power dissipation in the many tens of watts.  Dissipation of this amount of power requires heat sinks, resulting in higher costs and potential reliability problems.  In this section, we begin by discussing the reasons power has become a significant issue.  We then outline a hierarchical approach to power modeling that was first presented in [1].  This approach isolates the different components of power consumption on a chip and then attempts to model them individually.

In high-performance ASIC's and microprocessors, there are several key reasons why power dissipation is rising.  First, the presence of larger numbers of devices and wires integrated on a larger chip results in an overall increase in the total capacitance found on a design.  Second, the drive for higher performance leads to increasing clock frequencies and dynamic power is directly proportional to the rate of charging capacitances (in other words, the clock frequency).  A third reason that may lead to higher power consumption is the more efficient use of gates.  Dynamic power is not consumed when a gate does not switch.  However, interest has risen in the circuit design area to make better use of the available gates by increasing the ratio of clock cycles that a gate actually switches.  This increased circuit activity would also lead to rising power consumption.

Dynamic power is the largest component of the total chip power consumption (the other components are short-circuit power and leakage power).  It occurs as a result of charging capacitive loads at the output of gates.  These capacitive loads are in the form of wiring capacitance, junction capacitance, and the input (gate) capacitance of the fan-out gates.  The expression for dynamic power is:

(1)
In (1), C denotes the capacitance being charged/discharged, Vdd is the supply voltage, f is the frequency of operation, and a is the switching activity factor.  This expression assumes that the output load experiences a full voltage swing of Vdd.  If this is not the case, and there are circuits that take advantage of this fact, (1) becomes proportional to Vdd * Vswing.  A brief discussion of the switching factor a is in order at this point.  The switching factor is defined in this model as the probability of a gate experiencing an output low-to-high transition in an arbitrary clock cycle.  For instance, a clock buffer sees both a low-to-high and a high-to-low transition in each clock cycle.  Therefore, a for a clock signal is 1, as there is unity probability that the buffer will have an energy-consuming transition in a given cycle.  Fortunately, most circuits have activity factors much smaller than 1.  Some typical values for logic might be about 0.5 for datapath logic and 0.03 to 0.05 for control logic.  In most instances we will use a default value of 0.15 for a, which is in keeping with values reported in the literature for static CMOS designs [1,2,3].  Notable exceptions to this assumption will be in cache memories, where read/write operations take place nearly every cycle, and clock-related circuits.

We examine five key components of dynamic power consumption [1]:

• Standard cell logic and local wiring
• Global interconnect (mainly busses, inter-modular routing, and other control)
• Global clock distribution (drivers + interconnect + latches)
• I/O’s (drivers + off-chip capacitive loads)
• Memory (on-chip caches)
We now look at each component in detail.

Standard Cell Logic and Local Wiring

The calculation of this component of power differs from that presented in [1].  We assume a hierarchical layout style like that discussed in [4], where the designer uses blocks of 50 to 100,000 gates to limit the impact of interconnect while maintaining reasonable design times.  A typical design will look something like that shown in the figure below.  A collection of these blocks will make up the logic portion of the design, with additional macro blocks for on-chip memory, etc.  In our power analysis, we will look carefully at one of these blocks, or modules, that constitute a building block of the overall design.

The first issue is to determine the size of a block.  The user will specify the size in terms of gates; the default value is 50,000 gates in any technology.  Values larger than 100,000 lead to power penalties while small blocks (< 25,000 to 50,000) will yield infeasible design times [4].  In addition, blocks of about 50 to 100,000 gates yield significant design flexibility in that they can contain many important building blocks of larger designs (e.g. DSP or microprocessor cores).  Once the number of gates per module is determined, the process packing density is used to find the area of each block.  The blocks are assumed to be square.  Due to the nature of standard cell logic, the size of a gate is determined by lower level contacted metal pitches [5,6,7].  As discussed elsewhere in the BACPAC documentation, the area of a 2-input NAND cell is typically 4 metal pitches in width by 16 metal pitches in height [6].  The bottom two metal levels typically have similar contacted pitches for maximum density, so this area becomes 64 * MP2 where MP is the contacted metal pitch of the bottom layers.  Using this 2-input NAND cell as a basis for standard cell logic design, we find that the ideal packing density is 1 / 16MP2 where packing density is in terms of transistors per unit area.  Furthermore, place-and-route tools cannot achieve this ideal packing density, so an added factor must be added to account for white-space in a design [8].  This silicon efficiency factor, se, has a default value of 0.5 but could be higher in full-custom designs.  Now, we find the process packing density becomes se / 16MP2, or in a 0.25 mm process with contacted metal pitch of 0.7 mm, this is equal to 6.4 x 106 transistors / cm2.  The 1997 National Technology Roadmap for Semiconductors cites an expected packing density of 8 x 106 transistors / cm2 at this technology node.  This value corresponds to a silicon efficiency factor of 0.625, rather than 0.5.  So, the area of a single block is:

(2)
For example, for the same 0.25 mm process as above, a 50,000 gate module has an area of 2.94 mm2, or a side-length of 1.71 mm.  After calculating the physical size of each block, we then proceed to estimate the capacitance within each module.  This consists of both device and interconnect capacitances and we consider each individually.  Interconnections that leave a module to connect to other modules are termed global interconnect and are considered later.

Interconnect Capacitance

To calculate the interconnect capacitance, we determine the wiring requirements within the module and use the resultant total length with the capacitance per unit length.  The wiring requirements of a block are found by extrapolating the critical path model to the entire module.  Average wirelengths that were determined for local routing are used in conjunction with the expected number of nets in the block to determine total wirelength.  The entire process is detailed below.

• Given the module size in number of gates and the average fan-out (determined from fan-out distribution given by the user), the total number of nets in the module is determined by: Nnets = Npins / (foavg + 1), where Npins = Ngates * (fi + 1).  For example, if we assume 2-input NAND's as the logic gate building block, we have a fan-in (fi) of 2 and a total of 3 pins per gate.  In a 50,000 gate block with an average fan-out of 1.7, there are 55,556 nets.
• From Nnets and fan-out distribution, we determine the number of nets that have each possible fan-out.  Typically, about 60% of nets have fan-out of 1.  Therefore, in the example above, there are 33,334 nets with fan-out of 1.
• Average wirelengths within these modules was calculated earlier, based on Donath's model [9].  Using these values, we calculate the internal wiring requirements of the block according to:
(3)
• With Ltotal routed on metals 1 and 2, we also account for some amount of routing on intermediate wiring levels (e.g. metal 3) due to noise or delay considerations, etc.  Thus, we allot an additional 10% of Ltotal to metal 3 for these purposes.
• Given the interconnect capacitance per unit length for each level, we can calculate the total wiring capacitance within the module.
Device Capacitance

Total switching device capacitance consists of gate oxide capacitance, overlap capacitance, and junction capacitance.  In addition, we consider the impact of internal nodes of a complex logic gate.  For example, the junction capacitance of the series-connected NMOS transistors in a NAND gate contributes to the total switching capacitance although it does not appear at the output node.  For each gate in the module, we will determine the total device-related capacitance.  Then, the module device capacitance simply becomes Ngates * Cgate.

From the delay analysis section, BACPAC has determined the optimal device sizing in local routing.  Calculations are made for both with and without noise scenarios, so that comparisons can be made in terms of power vs. delay tradeoffs.  All logic gates are assumed to be the optimal size -- this may result in an overestimate of sizing in some non-critical cases but will also underestimate some device sizing when longer wires are present in certain paths.  Overall, we estimate that this assumption will be slightly pessimistic, leading to somewhat higher power numbers.

We look at each gate individually and calculate the input capacitance, as well as the junction capacitance.  Since we are already calculating the input capacitance of each gate, we do not need to look at the fan-out capacitance, as this would be redundant.  Capacitances are calculated as before:

(4)
(5)
(6)
Equation (4) is for a general case.  In the instance of a 2-input NAND, the total input capacitance of the gate will be 4*Cox + 3*Cov(N) + 2*Cov(P).  Recall that for a 2-input NAND, all devices are sized equally for symmetric rise/fall times.  The output capacitance is calculated with W as the effective PMOS width instead of 2*W, since the PMOS devices are laid out such that they share a common drain.  This drops the parasitic junction capacitance to that of a single device with width of W.  The internal capacitance here corresponds to the single junction capacitance between the series-connected NMOS devices.  Since the internal capacitance sees an energy-consuming transition only when the top NMOS device is on and the bottom is off, we need to multiply Cinternal by a factor of 0.5 (NAND2) to account for its effective switching frequency.  After finding the total device capacitance in an average gate, we multiply by the number of gates in a module to find the total device capacitance.

Sample calculations

For a 0.25 mm microprocessor design, with a design hierarchy consisting of 15 50,000 gate modules, we wish to calculate the power consumption due to random logic and local / intermediate interconnections.  We estimate an average W/L value of 15 and the contacted metal pitch in layers 1 and 2 is 0.7 mm.

We have previously found the area of a 50,000 gate block in this technology to be 2.94 mm2 with a silicon efficiency of 50%. We now calculate the total wiring capacitance using estimated values for average wirelength and fan-out distribution.

Fan-out distribution:      60%   fan-out = 1, Lavg = 75 mm
20%   fan-out = 2, Lavg = 105 mm
13%   fan-out = 3, Lavg = 135 mm
7%     fan-out = 4, Lavg = 165 mm

FOavg ~ 1.7

We find Nnets = 55,556 and calculate the total wiring requirements according to (3).  The total wirelength required by the above fan-out distribution and wirelengths is 5.29 meters.  Adding 10% overhead for metal 3 routing, we get a total length of 5.82 m.  As a check, we determine the estimated available signal wiring within the module area on metals 1 and 2.  Estimating that 30% of metal 1 and 20% of metal 2 is unavailable for signal routing due to Vdd and via blockage and a wiring efficiency of 50%, we find that metals 1 and 2 offer 5.36 meters of wiring resources.  This is sufficient for the amount of wiring calculated (5.29 m).  Assuming that metals 1, 2, and 3 have roughly similar capacitances per unit length (a good assumption at small linewidths), we can easily find the total wiring capacitance.  For a nominal capacitance per unit length of 2 pF / cm, we obtain Cwire = 1.164 nF.

Device capacitances are calculated directly from (4) through (6), then multiplied by 50,000 to get a value of Cdevice = 1.338 nF.  Total capacitance within the module is 2.5 nF.  At 500 MHz and a switching activity factor of 0.15, the power consumption for a single 50,000 gate block is then 1.17 W.  For the entire design, random logic and local and intermediate interconnections contribute 17.55 W of power.  This can be roughly compared to the Alpha 21264, which is a 0.35 mm design of similar size-- the logic portions of this chip consume about 35 W, or 50% of the total chip power [10].

Global Interconnect

Global interconnects are defined as those wires which run between modules.  Although the number of global wires is considerably smaller than local wires, their total length can be quite large.  Therefore, we consider their impact on the total chip power consumption.  Previously in BACPAC, we have calculated the length of an average global wire by applying Rent's rule to the global hierarchy.  A brief review of this process follows.  Given the design module size and the total expected number of logic transistors, the total number of modules is found.  Each module is regarded as a gate in the nomenclature of Rent's rule.  The global Rent's exponent is assumed to be comparable to that of the local Rent's exponent.  At this point, Donath's model is applied to the global system where each block is a gate and the gate pitch is set by the relationship between logic area and number of blocks.  A point-to-point average global wirelength is found, and this is scaled using an empirical formulation first suggested by IBM [8] to calculate the average global wirelength for the entire design, Lg_avg.  It should be noted that only the logic area is used in these calculations; the memory area is not considered to impact the global wirelength.

Once a typical global interconnect is found, we seek to determine the number of global wires in a design.  Once this has been calculated, the total global wirelength can be determined and power dissipation calculated.  Again, we apply Rent's rule to the global level of the design hierarchy.  We are aiming to calculate the number of pins or external connections for each module.  Rent's rule was initially formulated to calculate the same thing, except at the chip-level.  If we assume each 50,000 gate module (or similar size block) can be viewed as a chip, then by determining the number of pins on each block, we can determine the number of inter-modular wires or nets in the design.  The figure below will help to visualize this step.  According to Rent's rule, we have:

(7)
Here T is the number of terminals at the periphery of the module, K is a factor related to the number of terminals per gate, Ng is the number of gates in the block, and p is Rent's exponent.  After determining the number of external connections per module, it is straightforward to find the number of global pins in the design.

(8)

We are interested in the number of global nets rather than pins.  Converting from pins to nets requires us to divide by the average number of pins in each net.  This is equal to the average fan-out plus 1, where the 1 accounts for the output of the originating gate.  Thus, the total number of global nets can be defined as:

(9)
The following figure demonstrates this relationship between nets and pins.  In that example, there are a total of 6*4 = 24 global pins.  The average fan-out is 2, leading to a total of 24 / (2 + 1) = 8 global nets.  One of these nets is highlighted in the figure.  This approach to finding the number of global, or inter-modular, wires is similar to that taken in [11].  However, the physical nature of this approach is highlighted by our hierarchical design philosophy.  By designing with larger block sizes, we see that Ng in (9) increases while Nmodules drops.  Since p < 1 (typically ~ 0.5), the overall result is a decrease in the number of global wires.  These wires have effectively been absorbed by the larger module size.  An analogous argument can be made when examining smaller block sizes  -- the result is a rise in the number of inter-modular wires.

Now that the number of global nets has been determined, we need to take into account the impact of repeater capacitance.  Since repeaters are typically much larger than the average logic gate, their intrinsic device capacitance can be quite large.  We approximate the total number of drivers for global wires by dividing the total global wirelength by the maximum distance a wire can be run before buffering becomes advantageous.  The device capacitance (input + junction) is calculated for the optimal driver size, which is normally determined for the top metal layers according to [13]:

(10)
where Rdev and Cdev are device resistance and input capacitances for a minimum-sized inverter.  Wiring characteristics are normalized to a unit length.  This derivation, while correct, yields an area-inefficient buffering system.  A large reduction in driver size results in only a small delay penalty.  Considering that the number of global wires can be expected to rise while the optimal buffering distance (Lcrit) shrinks, we anticipate an increase in the total buffer capacitance in future designs.  In the interests of keeping the corresponding power dissipation small, we have developed a new optimal sizing criterion which concentrates on a product of delay and buffer width.  Instead of simply optimizing delay, we optimize the product of delay and the cubic root of device width -- this formulation has been shown to give predictably good results.  Since we are normally concerned with the maximum wirelength before buffering, we can see from the figures below that this new optimal buffering criterion yields 50% smaller area (and power) while resulting in only a 12% increase in delay.  While the new formulation is not complex, it can be approximated to yield S / 2 at Lcrit.  At shorter line lengths, it will give an even smaller buffer size.  However, we are most interested in buffer sizing at Lcrit, therefore we approximate the optimal-area bufer size by S / 2.  The exact expression to determine the area-optimal buffering size is included at the end of this page.  From this sizing criterion, the buffer capacitances are calculated and included in the global interconnect power consumption.  Lcrit can be calculated by [13]:

(11)

The total global interconnect power consumption is calculated as:

(12)
In (11), Cw is the wiring capacitance per unit length of the top metal levels.  Although global wires may be routed on intermediate wiring levels, this approximation greatly simplifies the calculations while incurring minimal error.  Nbuffer in (12) is equal to (Lg_avg * Nglobal_nets / Lcrit) and a is typically in the range of 0.1 to 0.2.

Sample calculations

We will re-examine the same 0.25 mm design as above.  We will assume a value for Lg_avg of 7 mm and an average fan-out of 2.5 for global nets.  In this case, we calculate the number of external pins for each 50,000 gate module in the logic portion of the processor.  For a typical microprocessor Rent's exponent of 0.5, we find that each block has 783 external connections.  Converting this value to the total number of global nets results in a value of 3354 nets.  Now, the total global wirelength is 7 mm * 3354 nets = 23.5 meters.  We find the number of buffers required as 23.5 m / 5.65 mm = 4106 buffers.  The buffer size for an upper metal level with pitch of 4 mm, thickness of 2.5 mm and copper wiring is Wn = 93.5 mm.  With an activity factor of 0.15, we obtain:

(13)
In this scenario, the drivers and the wires have roughly the same contribution to the total capacitance.  This component of capacitance can easily contribute 10% or more of the total chip power consumption.  Modeling of global interconnect in [1] underestimates the impact of device capacitances as well as the total number of global nets.  They consider only the system bus in global capacitance calculations, yielding a very small number of inter-modular wires.

Clock Distribution

The clock network in a high-performance design typically consumes a large amount of power.  This is especially true in modern microprocessors where the clock can consume 20 to 40% of the total power.  ASIC's normally have less stringent requirements on issues such as clock skew, leading to smaller amounts of clock power.  Nonetheless, the clock distribution network should be considered in all discussions of power modeling.  In [1] the authors consider one type of clock distribution network; the H-tree.  There are several drawbacks to this form of clock distribution including the large buffering system required to drive the network at the root.  With such a large buffering system, however, the likelihood of severe inductive ringing increases as the output resistance of the driver network will be significantly less than the characteristic impedance of the H-tree.  In this manner, the simple H-tree can be seen to be a non-scaleable system for future designs [14].  Instead of the H-tree, in BACPAC we base our clock distribution models on the buffered H-tree, which has a key advantage.  Since the wiring network is periodically buffered, much smaller linewidths can be used without sacrificing performance.  This allows for less congestion on global metal layers, as well as smaller capacitances.  By properly sizing the buffers and the wires, very low skew can be achieved.  However, the impact of process variation (specifically in the buffer's effective channel length) is important as the clock skew is directly proportional to the degree of process control achieved.  We now examine a generic buffered H-tree design and look at its contribution to power consumption.

A buffered H-tree with 4 clusters is shown in the figure below.  The cluster size is determined by the maximum amount of skew allowable in the design.  Typically, this is about 10% of the clock cycle.  In very high-performance microprocessors, controlling clock skew is vital and designers have been able to develop networks that result in clock skew that is less than 5% of the clock period.  However, in ASIC's the clock network is not as significant and clock skew may even exceed 10% of the clock cycle time.  In BACPAC, the designer can specify a target for skew in terms of clock period.  Once this is done, the largest possible cluster size is determined by finding the delay from a latch in the center of the cluster to one located in the corner.  The total distance that must be traversed in this instance is L, where L is the side-length of a cluster.  We assume that the clock network is implemented in the top-level metal and, following common practice to limit process variation, an underlying shield layer is used to minimize dielectric thickness variation across the chip.  In terms of wireability then, we concentrate on the top level metal and then duplicate that on the previous layer.

In order to find the maximum L that meets the Tskew requirement, we use Sakurai's delay expression [15] to find the delay between a point just at the output of the buffer and the point in the corner of the cluster.  In this case, we expect the dominant source of delay to be the charging of latch capacitances along the wire resistance.  Since these latches will not be lumped at the end of the line, but distributed along the length L of the wire, we model this RC component as distributed, which differs from Sakurai's initial formulation.  In addition, the input-slope dependent delay term is eliminated since it is not line-length dependent.  Finally, we institute a rudimentary form of process variation; we set worst-case values for wiring resistance and capacitance as well as device resistance.  These worst-case values are expected to be 10% over nominal.  We do not assume that all destination capacitances have worst-case values since this capacitance is distributed over a large amount of sinks, as opposed to a single wire or driver.  The likelihood of all sinks having worst-case input capacitances is negligible.  Our expression for Tskew then becomes:

(14)
The non-trivial variable in (14) is Clatches.  To determine the output capacitance due to the loading of latches, we must estimate the number of latches (sinks) for a wire spanning the length of a cluster.  To do so, we make the assumption that nearby latches are arranged in a standard-cell format.  Specifically, as shown in the below figure, we approximate that every ld cells is a latch, where ld is the average logic depth of the system.  Since the wire that leads to the farthest sink must move in both the horizontal and vertical directions, we expect to contact latches in both dimensions.  Given the typical dimensions of a standard cell gate (discussed elsewhere in BACPAC), we then approximate the number of latches encountered by the wire of length L as:

(15)
In (15), se is the silicon efficiency factor which is a measure of how much white-space a typical design has.  Its default value is approximately 0.5, although it may be higher for full-custom designs.  The logic depth can range depending on the application and design style; microprocessors may be in the range of 10-15, whereas ASIC's may exhibit logic depths in a very broad range (15 to 30 or more for lower performance designs).  MP is the contacted metal pitch of lower-level metals, which determines the cell size in modern standard cell libraries.  The factor of 3 at the beginning accounts for neighboring cells; we estimate that a wire running in upper level metals may provide the clock signal for 3 groups of latched logic (each group is a path).  This is illustrated in the following figure.

The total load capacitance is equal to the number of latches multiplied by the input capacitance of each latch.  A typical latch has 4 clocked elements -- 2 NMOS and 2 PMOS devices.  In this way, a minimum-sized latch has an input capacitance equal to twice that of a minimum-sized inverter.  Furthermore, we assume that latches are sized equivalently to the logic gates in the design to allow for sufficient speed.  So, Clatches in (14) becomes:

(16)
As an example, we set Tskew to be 100 ps for a high-performance 0.25 mm microprocessor.  With a target clock period of 2 ns, this represents 5% of the clock period.  With top metal level electrical characteristics of 44 W / cm and 1.62 pF / cm, we find that the optimal device sizing for the repeaters is Wn = 187 mm.  With a logic depth of 14 and a kopt value of 15, we calculate a cluster side length L of 5.8 mm.  The total latch load in this instance is 3.82 pF.  Without process variation, we obtain a maximum length of 6.2 mm.  This compares well with the HSPICE-derived value of 6.8 mm, which also neglects process variation.  In the HSPICE simulations, we assume that the latches are distributed along the length of the wire.  The simulation results are shown in the figure below with a clock frequency of 500 MHz.  Several points along the length of the wire are shown for reference -- the end of the line demonstrates skew of about 100 ps from the beginning of the line, with a reasonable rise time ( ~ 15% of the cycle time, or rise + fall times consume about 30% of the clock period).

The above figure shows the evolution of the clock signal as it propagates from the driver to the destination capacitances (latches).  Line length is 6.8 mm and total latch capacitance is 3.82 pF.  Clock skew is ~ 100 ps.

Once the maximum cluster size is determined, the chip itself must be divided into a fixed number of such clusters.  A symmetric H-tree requires 2n clusters where n is an even integer.  Most ASIC designs, with less stringent clocking requirements, may require only 4 clusters while microprocessors are typically divided into sixteen.  Designs requiring more than 16 clusters should be re-considered as the amount of wiring and buffer capacitance will be quite large.  The number of clusters is then determined by dividing the total chip area by the cluster area (L2) and then rounding up to the next feasible value (e.g. 4 or 16).  For the remainder of this discussion we will refer to an example design, where there are 16 clusters.  Given that the phase-locked loop (PLL) which generates the clock signal is typically at the periphery of the chip, the clock network will require n + 1 levels of hierarchy.  Also, the number of buffers required will be 2n+1, or simply 2 * Nclusters.  In our example case, we have now determined that we will need 32 buffers and each will have the optimal driver size for the top level metallization.  Additionally, we wish to find the amount of wiring needed for this network.  Since the routing structure is highly regular, it is straightforward to determine the total wiring in the n + 1 levels.  However, we should also account for the wiring after the last global driver, since this will be the majority of the clock wiring.  The total wiring in the network prior to the "within-cluster" routing can be calculated by:

(17)
In (17), the trunc( ) function performs an integer truncation; when i is even, it simply yields 0.5 * i.  When i is odd, it rounds down.  For a design with 5 levels, this yields a total "within-cluster" wirelength of 5 * Dc, where Dc is the chip-side length.  For a design with 4 clusters, as the one in a previous figure, Cwiring is equal to 2 * Dc.  We assume that all clock routing can be done in minimum-pitch top-level metals as the buffering of the signals at regular intervals will maintain good signal propagation characteristics.  As mentioned, the majority of clock wiring in a buffered H-tree will occur within the clusters themselves.  We now estimate this top-level routing (lower-level jogs to the actual devices should be short and are not considered herein) by allocating wires from the central driver to the perimeter of the cluster in all directions.  The figure below demonstrates this approximation.

Therefore, the total "within-cluster" routing length is approximated by:

(18)
This wirelength can be multiplied by the capacitance per unit length and summed with Cwiring from (17) to yield the total wiring capacitance load on the clock network.

The total output load consists of more than latches -- the memory section of the design (if there is one) will contain a large number of transistors that need to be clocked, including pre-charge gates for the bit/bit_bar lines and bit line control circuitry.  According to [1], we approximate the capacitance of clocked transistors in the memory arrays by:

(19)
Cin_nmos is the input capacitance of a minimum-sized (W/L = 1) NMOS transistor, while 2k is the number of columns in the memory array.  The addition 4 in the exponent correspond to the sizing of the PMOS pre-charge transistors (2X NMOS), the presence of a pre-charge gate for both bit and bit_bar lines, and the clocked transistors in the bit line controller.  We estimate k in this manner:

(20)
The main goal of this empirical formulation is to ensure that there are at least as many columns as rows and typically about twice as many.  The factor of 0.8 accounts for non-cache transistors in the memory arrays, while dividing this value by 6 yields the number of 6T cells, or bits.  As an example, for the Alpha 21164 microprocessor, there are approximately 7 million transistors in the memory arrays.  Equation (20) yields a k value of 10 for this system, meaning there are 1024 columns in the level 1 and 2 caches.  In this case, there are also 10 rows, resulting a square array.  In the case of the Alpha 21064, (20) gives a total of 29 columns and 28 rows.

Now, the total clock network capacitance can be calculated.  It is given by:

(21)
Since the clock signal has a rising and falling transition during each clock cycle, the switching activity factor for clock buffers, pre-charge gates, all wiring, and latch input gates is one.  In other words, there will be an energy-consuming transition in these circuits during every clock cycle.  This high a term is one of the main reasons that clock power consumption is a large part of total power.  The total clock power dissipation is then:

(22)
Sample calculations

For our generic 0.25 mm high-performance microprocessor, we will estimate the power consumption of a buffered H-tree.  For a chip area of 250 mm2, logic depth of 14, and a total of 7 million transistors in the memory we estimate:

(23)
This clock load, with activity factor of 1, leads to a total clock power consumption of 5.36 W at 500 MHz.  The majority of the clock capacitance is in the loading by latches.  This implies that highly pipelined designs with shallow logic depths will have very large capacitive loads since they contain more latches.  While the clock power consumption in this example is not extremely large, it is a significant portion of the total power (21% of dynamic power when not considering memory power (typically small) or I/O power) and it should be remembered that some of the very high power clock networks reported in the literature are not buffered H-trees, but clock grids.  These clock grids are very power-hungry while buffered H-trees strike a good balance between skew limitations and power efficiency.  For example, in a DEC StrongARM processor, optimized for low-power, the buffered H-tree clock network consumed only 10% of the total chip power [16].  For this reason, they are the clock distribution network of choice in both high-performance microprocessors and ASIC's.

I/O Drivers

The power modeling of I/O drivers is relatively straightforward.  It consists of first determining the number of signal pads which are in a design.  Each of these pads has a large capacitance associated with it that corresponds to the off-chip capacitance connected to the pad.  This value is relatively constant for a given packaging technology.  Once this value is determined, the pad can be optimally driven by a cascaded chain of repeaters where each stage gets larger by a factor of f0.  This is the approach discussed first in [1] and subsequently in [11].  We make several important enhancements to these previous approaches.  The first is to develop a more accurate model for the total driver chain capacitance which accounts for the intrinsic output capacitance of the buffers.  In addition, we demonstrate that a pre-factor of 0.5 is not required as reported in [11] to account for the inverting nature of a chain of repeaters.

From data collected from numerous semiconductor vendors, the typical off-chip pad capacitance that needs to be driven is approximately 15 pF.  This value can vary according to the user's process specifications.  In our discussion, this off-chip capacitance will be referred to as the load capacitance, CL.  According to [17], the number of repeaters in a chain can be determined by examining two things:  First, the ratio of CL to the input capacitance of a minimum-sized inverter in the given process technology.  This ratio is called Y.  Second, the ratio of intrinsic device output capacitance to input capacitance, or Cj / Cin.  This ratio is denoted by g.  Once we know CL, we can determine both these ratios.  At this point, the optimal tapering factor should be determined recursively by using (24):

(24)
Subsequently, f0 can be used to find the optimal number of repeaters in the cascaded chain, N0.

(25)
In the interests of saving area, while minimal performance will be lost, N0 is rounded down to the nearest even integer, so as to remain non-inverting.  Since the actual number of stages, N, will not be exactly equal to N0, we will re-evaluate the scaling factor and call it simply f:

(26)
At this point, we are ready to calculate the total capacitance for one pad, including the load capacitance and the internal buffer capacitances.  We must include all the intrinsic device capacitances including the junction capacitance in this analysis.  This point has been missing in previous approximations to this problem [1,11,17].  Knowing that we are scaling the input capacitance of each buffer from its minimum value to the value of the load capacitance, we can sum all capacitances as:

(27)
Equation (27) is shown graphically in the figure below with f = 3, N = 4, and g = 0.5.  In that instance, the total capacitance is 1.74CL.  The factor of g raises the total capacitance from 1.49CL to 1.74CL.  This new model has been verified using HSPICE for an 8-stage driver with tapering factor of 3.49 and load capacitance of 30 pF.  Both HSPICE and (27) predict a total switching capacitance of 1.57CL.

So, the total power due to I/O drivers is calculated as:

(28)
Here, the K(Ng)p term calculates the number of signal pads for the design.  The I/O pins of a design typically has a higher voltage than the core of the design in order to support compatibility with package-level and board-level circuitry.  Normally, this voltage is one generation behind (e.g. in 0.25 mm designs, Vdd_I/O = 3.3 V).  The switching activity factor accounts for the probability of 0 to 1 transitions at the pad itself.  A previous study has suggested that only half of the buffer capacitance needs to be considered as active during any one cycle since a 0 to 1 transition at the pad implies a 1 to 0, or non-energy consuming, transition at half the internal buffers [11].  However, one should recall that a 1 to 0 at the pad will then yield a 0 to 1 (energy-consuming) transition at internal stages which will be missed by the activity factor, a.  If we assume that the probability of rising and falling transitions at the pad are equivalent, then equation (28) holds.

Sample calculations

For the same 0.25 mm microprocessor design as we have considered throughout this section, we calculate the number of signal pads using a Rent's exponent of 0.35 and an average fan-out of 2.3 (this gives the value of K, the average number of pins per gate) as 376.  If we assume an off-chip capacitance of 15 pF, g of 0.43 (from HSPICE), and Vdd_I/O of 3.3 V, we obtain the total I/O power consumption of 7.4 W.  This value assumes a switching activity factor of 0.15.  The intrinsic buffer capacitances contribute 2.79 W to this value.  In some instances, designers use this value as the I/O component of the on-chip power dissipation as the larger pad capacitance is technically off-chip.

On-chip Memory Power

On-chip memory is typically arranged as an array of 6T SRAM cells with 2k columns and 2n rows.  Therefore, the chip has a total of 2n+k bits of memory or 2n+k / 8 bytes.  Modern microprocessors may contain a 32 kB instruction cache as well as a 32 kB data cache.  ASIC designs may have somewhat smaller amounts of memory, such as 4 kB blocks of SRAM.  In addition, microprocessors may contain larger level-2 caches that are an additional level of memory between the chip and the main memory.  These blocks are usually very large, e.g. currently 128 kB to 1.5 MB.  In addition to the memory array structure, there is supporting circuitry that allows the memory to function correctly.  This circuitry contains address buffers, row decoders, column selectors, sense amplifiers, and other control logic.  Of the total number of transistors associated with on-chip memory, usually about 80% of these are located in the cache itself (6T SRAM cells) and the other 20% make up this control logic.  We now proceed with determining the power requirements of an on-chip memory array.

During a read operation, several things occur.

• The word line of interest is charged up to Vdd.  This involves the charging of the wire capacitance as well as 2 access NMOS transistors for each cell in the row.
• Each column now passes its low value to either the bit or bit_bar line.  Since current is not being drawn from Vdd, this is not an energy-consuming transition.  However, it should be noted that, due to the presence of the sense amplifiers, the voltage swing on the bit or bit_bar lines is not a full Vdd.  State-of-the-art SRAM blocks see a swing of approximately 100 mV, which greatly reduces the power consumption in the next step.  However, in high-performance VLSI designs, it is not always possible to use such a low swing.  Doing so may be difficult due to noise considerations on large design or the static power consumed by sense amplifiers may become very large.  The voltage swing on the bit line is a user-input variable in BACPAC.
• When the clock goes low, the bit and bit_bar lines are both pre-charged high again.  Only one of these lines has been switched at all, so only of these lines per column will result in the power consumption of the PMOS pre-charge gate supplying current.  The total capacitance that is charged is equal to the bit line wiring capacitance plus one drain capacitance for each row in the array.
During the write operation, the operation is slightly different.  The capacitances being charged are mostly the same except that one of the bit lines must see a full Vdd swing to write that value into the specified SRAM cell.  In reality, the swing must simply be larger than the switching voltage (~ Vdd / 2).  This larger voltage swing on one line does not change the effective switching capacitance greatly since there may be thousands of bit lines in the array.

The total capacitance for a read and write operation are given as:

(29)
(30)
Equations (29) and (30) account for the smaller voltage swing on certain capacitors by reducing their effective capacitance by Vswing / Vdd.  Lcell is defined as the side-length of a 6T SRAM cell in the process.  Cin is the gate capacitance of a minimum-sized NMOS transistor while Cdrain is the junction capacitance for the same size device.

As discussed in [1], the dominant part of the on-chip memory consumption is due to the charging of the capacitances described above.  Power consumption in the supporting circuitry is typically smaller than the cell-based power.  Therefore, we use (29) and (30) to find the memory power consumption.  We find the average capacitance between the read and write operations and then assume a switching activity of 1 as the caches are accessed nearly every cycle.  While microprocessor speed is not set by memory constraints, the efficiency of the chip is usually limited by the rate at which memory can be accessed.  This implies that the level 1 and 2 caches will be quite busy accessing cells nearly all the time.  In addition, it has been mentioned in the literature that the instruction cache in particular runs every cycle and the data cache is not far behind [16].  Therefore, a switching activity ratio of 1 is felt to be accurate.

Then, the on-chip memory power consumption is given by:

(31)
Sample calculations

We assume our 0.25 mm microprocessor contains two level 1 caches, each of size 64 kB.  We calculate Lcell to be 3 mm.  From our previous empirical estimation of k (equation (20)), we approximate a 128 kB array as having 211 columns and 29 rows.  The input capacitance of a minimum-sized NMOS device is 0.31 fF and drain capacitance is 0.5 fF.  For a voltage swing of 500 mV (20% of the supply voltage for noise immunity), we obtain a read capacitance of 235.2 pF and a write capacitance of 238.1 pF.  The average is used in (32):

(32)

This value is small which is expected as memory cache arrays do not consume a large amount of power despite their high transistor count.

Appendix: Area-optimal Repeater Sizing

By multiplying a weighted area function (W1/3) by the delay, we have developed a new objective function to minimize.  Since optimization of delay alone usually results in overly large buffers with high power requirements, we have used the weighted area function to determine an area-optimal buffer size.  The product of Td and W1/3 is then differentiated and the minimum value is found at:

(A1)
This expression gives a smaller value of W than the original formulation in [13]; the delay is consequently higher but the area and power savings are much more considerable.  At line lengths that approach Lcrit, this formula gives areas that are typically 50 to 65% smaller than [13].  The delay penalty remains under 20% until the line length is about 1/4 of Lcrit.  At a line length of Lcrit / 4, the delay is not substantial.  As a simpler approximation to (A1), we have found that at Lcrit, the Wopt_area is 50.8% of Wopt as calculated in [13].  Therefore, we approximate the area-optimal buffer size in BACPAC as that given in [13] multiplied by 0.5.  For these reasons, we feel (A1) gives a better picture of an optimal buffering size in DSM VLSI designs.

References

[1] D. Liu and C. Svensson, "Power consumption estimation in CMOS VLSI chips," IEEE Journal of Solid-State Circuits, vol. 29, pp. 663-670, June 1994.
[2] A.P. Chandrakasan and R.W. Broderson, "Minimizing power consumption in digital CMOS circuits," Proc. of the IEEE, vol. 83, pp. 498-523, April 1995.
[3] G. Gerosa, et al., "250 MHz 5-W PowerPC microprocessor," IEEE Journal of Solid-State Circuits, vol. 32, pp. 1635-1649, Nov. 1997.
[4] D. Sylvester and K. Keutzer, "Getting to the bottom of deep submicron," Proc. of International Conference on Computer-Aided Design, in press, 1998.
[5] T.R. Bednar, R.A. Piro, D.W. Stout, L. Wissel, and P.S. Zuchowski, "Technology-migratable ASIC library design," IBM Journal of Research and Development, vol. 40, pp. 377-385, July 1996.
[6] IBM SA-12 ASIC family databook, revision 5, June 1998.
[7] R. Payne, "Metal pitch effects in deep submicron IC design," Electronic Engineering, pp. 45-47, July 1996.
[8] G.A. Sai-Halasz, "Performance trends in high-end processors," Proc. of the IEEE, vol. 83, pp. 20-36, Jan. 1995.
[9] W.E. Donath, "Placement and average interconnection lengths of computer logic," IEEE Transactions on Circuits and Systems, vol. 26, p. 272-277, April 1979.
[10] M.K. Gowan, L.L. Biro, D.B. Jackson, "Power considerations in the design of the Alpha 21264 microprocessor," Proc. of Design Automation Conference, pp. 726-731, 1998.
[11] B.M. Geuskens, "Modeling the influence of multilevel interconnect on chip performance," Ph.D. thesis, Rensselaer Polytechnic Institute, 1997.
[12] N. Vasseghi, K. Yeager, E. Sarto, and M. Seddighnezhad, "200-MHz Superscalar RISC Microprocessor," IEEE Journal of Solid-State Circuits, vol. 31, pp. 1675-1685, Nov. 1996.
[13] H.B. Bakoglu and J.D. Meindl, "Optimal interconnection circuits for VLSI," IEEE Transactions on Electron Devices, vol. 32, pp. 903-909, May 1985.
[14] A. Vittal and M. Marek-Sadowska, "Low-power buffered clock tree design," IEEE Transactions on Computer-Aided Design, vol. 16, pp. 965-975, Sept. 1997.
[15] T. Sakurai, "Closed-form expressions for interconnection delay, coupling, and crosstalk in VLSI's," IEEE Transactions on Electron Devices, vol. 40, pp. 118-124, Jan. 1993.
[16] J. Montanaro, et al., "A 160-MHz, 32-b, 0.5W CMOS RISC microprocessor," Digital Technical Journal, vol. 9, pp. 49-60, Jan. 1997.
[17] N. Hedenstierna and K.O. Jeppson, "CMOS circuit speed and buffer optimization," IEEE Transactions on Computer-Aided Design, vol. 6, pp. 270-281, March 1987.

BACPAC home      Start BACPAC