# Clock Buffer and Flip-flop Co-optimization for Reducing Peak Current Noise

Joohan Kim and Taewhan Kim

Department of Electrical and Computer Engineering, Seoul National University

*Abstract*—For high-speed digital circuits, the activation of all flip-flops that are used to store data should be strictly synchronized by clock signals delivered through clock networks. However, due to the high frequency of simultaneous switching of clock pins in flip-flops, a high peak power/ground noise (i.e., voltage drop) is induced at the clock boundary. To mitigate the current noise, we employ four different types of hardware component that can implement a set of flip-flops and their driving buffer as a single unit, which was previously used for reducing clock power consumption. (The idea for the generation of the four types of clock boundary component was that one of the two inverters in a driving buffer and one of the two inverters in each of its driven flip-flops can be nullified without altering the circuit functionality.) Consequently, we have a flexibility of selecting (i.e., allocating) clock boundary components in a way to reduce peak current under timing constraint. We formulate the component allocation problem of minimizing peak current into a multi-objective shortest path problem and solve it efficiently using an approximation algorithm. We have implemented our proposed approach and tested it with ISCAS benchmark circuits. The experimental results confirm that our approach is able to reduce the peak current by 27.7%∼30.9% on average.

## I. INTRODUCTION

In a synchronous digital circuits, a clock signal is delivered through clock distribution network to sequential elements. By the clock signal, all sequential elements (e.g., flip-flops) switch simultaneously at clock edges. This simultaneous switching causes a high peak current on the power/ground line, resulting in voltage fluctuation on the line. This is called as simultaneous switching noise (SSN) or power/ground bounce. The high peak current weakens the circuit performance and undermines the reliability of system [1].

Since clock buffers consume the current at the clock edges, a large amount of current is generated around the clock edges, which lets the clock buffers be one of the major sources of power/ground noise. For this reason, many researches have made efforts to divert peak current by exploiting clock skew scheduling (e.g., [2]–[6]), in which they tried to disperse peak current by manipulating delay under clock skew constraint. However, for designs with bounded clock skew constraint, the applicability is strictly limited. Clock buffer polarity assignment (e.g., [7]–[10]) is another technique used for reducing power/ground noise, which exploits the fact that a current peak of buffer (i.e., positive polarity) and inverter (i.e., negative polarity) appears at different clock triggering edge (rising and falling). An illustrative example is shown in Figs. 1(a) and (b), which are an initial clock tree of circuit s5378 with four sets of flip-flops and a clock tree obtained by simply replacing the



Fig. 1. Peak current profile for a buffered clock tree of circuit S5378. (a) An initial clock tree. (b) A clock tree produced by replacing two buffers in (a) with inverters. (c) The current (charging) flows for (a) and (b) caused by sink buffers/inverters and flip-flops.

two sink buffers in Fig. 1(a) with inverters. Mixing buffers and inverters at the boundary of clock tree is intended to disperse the power/ground noise from/to  $I_{DD}/I_{SS}$  at rising/falling edge of clock signal. (It also requires to replace the flip-flops driven by inverters with negative-edge triggered flip-flops.) To see how much the current is charged over time around clock tree boundary, we have conducted an HSPICE simulation for the clock trees in Figs. 1(a) and (b). The yellow and green dotted curves in Fig. 1(c) show the changes of the charging current i.e., power noise over time at the flip-flops and sink buffers in Fig. 1(a), respectively. It shows a very high current at the flipflops. The blue and black solid curves in Fig. 1(c) show the changes of the charging current over time at the flip-flops and sink buffers/inverters in Fig. 1(b), respectively. In comparison with the dotted curves, it is shown that the peak current of the solid curves is considerably reduced.

Note that even though the technique can be applied to clock skew bounded designs as well as designs with clock skew



Fig. 2. Four types of flip-flops [11] that can be exploited in our algorithm of clock tree boundary optimization. The four types exhibit different clock latencies and power consumptions.

scheduling, it is not able to control the current caused by the flip-flops on the boundary of clock tree. This work overcomes the limitation by extending the concept of polarity assignment on the clock tree boundary to further reduce the peak current at the clock tree boundary including flip-flops.



(a) Tapping points for measuring current (b) A simple clock tree

Fig. 3. Current measure points in flip-flops (used for Fig. 4) and a clock tree for analyzing the distribution of peak current (used for Fig. 5).



Fig. 4. Current profiles on (a) internal clock inverters (CI), (b) master latch (ML), (c) slave latch (SL), and (d) driving clock buffer/inverter. The two fluctuations in each profile correspond to the clock rising and falling edges.

## II. PRELIMINARY

For our extended polarity assignment, we employ four types of flip-flops devised by the work in [11], which are labelled as  $FF_{2inv}^+$ ,  $FF_{2inv}^-$ ,  $FF_{1inv}^+$ , and  $FF_{1inv}^-$  in Fig. 2, produced by nullifying inverters and/or exchanging inverters with opposite

polarity.  $FF_{2inv}^+$  and  $FF_{1inv}^+$  are flip-flops with positive polarity, and  $FF_{2inv}^-$  and  $FF_{1inv}^-$  are flip-flops with negative polarity. Note that since the four types exhibit different clock latencies and current profiles, a proper allocation of flip-flops would lead to a further reduction of power/ground noise. (Note that the conventional polarity assignment exploits  $FF_{2im}^+$  and  $FF_{2im}^$ only.)

To see how much current flows at various locations on every flip-flop type, we perform an HSPICE simulation for the flip-flop circuit in Fig. 3(a) where the clock inverter (CI), master latch (ML), and slave latch (SL) are the current tapping location. Figs. 4(a)∼(d) show, for each of four types of flip-flops, the current profile measured on CI, ML, SL, and leaf clock buffer(B)/inverter(I), respectively. The unmatched current curves in Fig. 4 clearly indicate that the peak current on the clock tree boundary can be further reduced if the four types of flip-flop are selectively assigned.

## III. MOTIVATIONAL EXAMPLE

We illustrate how the peak current changes according to comapping of four leaf clock buffers *e*1, *e*2, *e*3, and *e*<sup>4</sup> and their driven sets of flip-flops  $R_1$ ,  $R_2$ ,  $R_3$ , and  $R_4$  in the clock tree in Fig. 3(b) where each set contains a single flip-flop.

TABLE I THE PEAK VALUES OF *IDD* AND *ISS* CURRENTS FOR EVERY POSSIBLE CO-MAPPING OF CLOCK BUFFERS AND FLIP-FLOPS IN FIG. 3(B).

|                 |                             |                 |                 |          | Peak $(uA)$ | max      |          |
|-----------------|-----------------------------|-----------------|-----------------|----------|-------------|----------|----------|
| $R_1$           | R <sub>2</sub>              | $R_3$           | $R_4$           | $I_{DD}$ | <i>Iss</i>  | (uA)     | $\%$     |
| $FF_{2inv}^+$   | $FF_{2inv}^+$               | $FF_{2inv}^+$   | $FF_{2inv}^+$   | 322.4    | 359.7       | 359.7    |          |
| $FF_{2inv}^+$   | $FF_{2inv}^+$               | $FF_{2inv}^+$   | $FF^{-}_{2inv}$ | 282.4    | 333.3       | 333.3    | $-7.34$  |
| $FF_{2inv}^+$   | $FF_{2inv}^+$               | $FF^{-}_{2inv}$ | $FF^{-}_{2inv}$ | 307.2    | 347.7       | 347.7    | $-3.34$  |
| $FF_{2inv}^+$   | $FF^{-}_{2inv}$             | $FF^{-}_{2inv}$ | $FF^{-}_{2inv}$ | 279      | 346.1       | 346.1    | $-3.78$  |
| $FF^{-}_{2inv}$ | $FF^{-}_{2inv}$             | $FF^{-}_{2inv}$ | $FF^{-}_{2inv}$ | 319      | 358.0       | 358.0    | $-0.47$  |
| $FF^{-}_{2inv}$ | $FF^{-}_{1inv}$             | $FF^{-}_{1inv}$ | $FF^{-}_{1inv}$ | 287.7    | 291.2       | 291.2    | $-19.04$ |
| $FF^{-}_{2inv}$ | $FF_{1inv}^+$               | $FF^{-}_{1inv}$ | $FF^{-}_{1inv}$ | 232.2    | 291.6       | 291.6    | $-18.93$ |
| $FF_{1inv}^+$   | $FF^-_{\mathrm 1inv}$       | $FF^{-}_{1inv}$ | $FF^{-}_{1inv}$ | 256.3    | 294.3       | 294.3    | $-18.18$ |
|                 |                             | .               | .               | .        | $\cdots$    | $\cdots$ | .        |
| $FF_{2inv}^+$   | $FF_{\text{lim}}^+$         | $FF_{1inv}^+$   | $FF_{1inv}^+$   | 235.3    | 343.6       | 343.6    | $-4.48$  |
| $FF^{-}_{1inv}$ | $FF^{-}_{1\underline{inv}}$ | $FF^{-}_{1inv}$ | $FF^{-}_{1inv}$ | 292.4    | 357.2       | 357.2    | $-0.70$  |

Table I lists the peak values of *IDD* and *ISS* currents for all possible co-mappings of  $(e_1, R_1)$ ,  $(e_2, R_2)$ ,  $(e_3, R_3)$ , and  $(e_4, R_4)$ . The distribution of those peak current values can be represented as shown in Fig. 5, from which we can see that



Fig. 5. The distribution of peak current values for every possible co-mapping of leaf clock buffers and flip-flops in Fig. 3(b), partially or fully exploring the mapping types in Fig. 2.

the lowest peak current is obtained when the four types of flip-flops in Fig. 2 are fully explored, reducing the peak value by 19.04%, while exploring  $FF_{2inv}^+$  and  $FF_{2inv}^-$  only reduces the peak by 7.34%.

## IV. PROBLEM FORMULATION

Problem 1 (BOUNDARYNOISEMIN) : *Given a buffered clock tree T with a set E of (already allocated) leaf buffering elements, a library B of buffers, a library I of inverters, a set R of (already allocated) flip-flops driven by the cells in E, clock skew bound constraint* δ*, clock slew rate constraint* κ*, and circuit grid points P, time sampling slots S, replace the cells in E and*  $\mathcal{R}$  *by finding mapping functions*  $\phi_1 : E \mapsto B \cup I$  $and \phi_2 : \mathcal{R} \mapsto \{FF_{2inv}^+, FF_{1inv}^+, FF_{2inv}^-, FF_{1inv}^-\}$  *that minimize the quantity of*

$$
\max_{p \in P, s \in S} \left\{ \sum_{e_i \in E, f_j \in R} current(\phi_1(e_i), \phi_2(f_j), p, s) \right\} \qquad (1)
$$
\n
$$
s. \quad t. \quad \max_{f_i \in \mathcal{R}} \{t_i\} - \min_{f_i \in \mathcal{R}} \{t_i\} < \delta,
$$
\n
$$
\max_{e_j \in E} \{s_j\} < \kappa
$$

*where*  $t_i$  *and*  $s_i$  *are the clock arrival time at flip-flop*  $\phi_2(f_i)$  *and the output slew rate of the driving buffer or inverter*  $\phi_1(e_i)$ *, respectively. The term current* $(\phi_1(e_i), \phi_2(f_i), p, s)$  *is the value of peak current at a time sampling slot s on a grid point p caused by the switching of e<sup>i</sup> and f<sup>j</sup> when they are assigned*  $W$ *ith*  $\phi_1(e_i) \in \{B \cup I\}$ *, and*  $\phi_2(f_j) \in \{FF^+_{2inv}, FF^+_{1inv}, FF^-_{2inv}$  $FF^{-}_{1inv}$  }*.* 

#### V. THE PROPOSED ALGORITHM

## *A. An Overview*

Fig. 6 shows the flow of our proposed algorithm of BOUNDARYNOISEMIN. The inputs to our framework are a synthesized buffered clock tree, and clock skew and slew constraints  $\delta$  and  $\kappa$ , from which the preprocessing of gathering all the clock arrival times to each flip-flop, which are extracted<sup>1</sup>



Fig. 6. Flow of our proposed BOUNDARYNOISEMIN algorithm.

by mapping every leaf buffer and flip-flop to different cell types, is performed, followed by generating current profile and sampling through HSPICE simulation.

We first sorted the grid points on a circuit according to the initial peak current value, and apply our assignment algorithm in a greedy manner, from the grid point with the highest current peak to the point with the lowest peak.

We transform the flip-flop type assignment problem of a circuit into an instance of min-max problem, which is then transformed to an instance of multi-objective shortest path (MOSP) problem, for which we use a polynomial approximation algorithm devised by Warburton in [12] whose time and is bounded by  $O(rn^3(n/\epsilon)^{2r})$  and space by  $O(rn(n/\epsilon)^r)$ where  $r$  is the arc weight dimension and  $n$  is the number of vertices in MOSP graph. Our formulation to the MOSP problem is described in Sec. V-C. When solving the MOSP problem, we adopt a concept of superposition of current flows, described in Sec. V-B. In addition, the derivation procedure of an instance of MOSP problem is illustrated in Sec. V-C, and the heuristic of selecting a target grid points for reducing peak current is described in Sec. V-D. Finally, integrating clock power minimization into our framework is described in Sec. V-E.

# *B. Superposition of Current Flows*

**Definition 1**  $(R(i), R(i, j), I(i, j))$ :  $R(i)$  is defined to be the set of flip-flops that are directly driven by clock buffer  $b_i$ . (We call  $R(i)$  the set of sink flip-flops of buffer  $b_i$ .) Let  $p_j$  be a junction point in power mesh *P*. Then,  $R(i, j)$  is defined to be a maximal subset of  $R(i)$  such that every flop-flops in the subset pulls current through  $p_j$ . (We call  $R(i, j)$  the set of sink flipflops of buffer  $b_i$  on power grid  $p_j$ .) Finally,  $I(k, j)$  is defined to be the current profile pulled by a flip-flop  $\overline{f_k}$  through power grid point  $p_j$ <sup>2</sup>

Note that every flip-flop is directly driven by exactly one clock buffer. That is,  $R(i_1) \cap R(i_2) = \phi$  if  $i_1 \neq i_2$ , but it may pull current through more than one junction point in power mesh.

<sup>&</sup>lt;sup>1</sup>We used Synopsys IC Compiler for our experiment.

<sup>&</sup>lt;sup>2</sup>We assume that the currents are pulled only through power grid points.

Superposition of current flow on a junction  $p_i$  in power mesh *P* states that the total current flow flowing at  $p_i$  equals to the sum of the amount of currents that are pulled through  $p_i$  by the cells connected to *p<sup>i</sup>* .



Fig. 7. An illustration of superposition of currents. The current shown at a junction equals to the sum of currents pulled by the cells connected to the junction

By following the current superposition theorem, the total amount of current flow pulled by flip-flops on a power grid *p<sup>j</sup>* can be expressed as:

$$
\sum_{f_k \in \mathcal{F}} I(k,j) = \sum_{b_i \in \mathcal{B}} \sum_{f_k \in R(i,j)} I(k,j) \tag{2}
$$

where  $\mathcal F$  and  $\mathcal B$  are the set of flip-flops and the set of sink clock buffers, respectively.

Based on *Eq*.(2), we extract a current profile of each  $R(i, j)$  and perform the superposition of current profiles. Fig. 7 illustrates our procedure of deriving the current profiles on power junctions.

# *C. Formulation to Instance of MOSP Problem*

For a target power grid point  $p_i$  and the sets of  $R(i)$  in which their flip-flop types in *R* have not been determined, we want to find a mapping of  $R(i)$  to flip-flop types that minimizes the peak current on  $p_i$ . For example, suppose we have three flip-flop groups  $R(1)$ ,  $R(2)$ , and  $R(3)$  on  $p_i$  and their flip-flop types are not determined yet. Further, we suppose that the mapping candidates without violating the clock skew bound constraint have been extracted, as shown in Table II. We call such mapping candidates *feasible mappings*. For example, mapping candidate 1 for  $R(1)$  indicates that the initial driving buffer is converted to an inverter with size of X8 and its driven flip-flops to  $FF^{-}_{1inv}$ . Note that a driving buffer may be converted to an inverter according to the assignment of flip-flop type to its driven set of flip-flops.

TABLE II AN ILLUSTRATION OF FEASIBLE MAPPINGS FOR THREE SETS OF FLIP-FLOPS, EACH OF WHICH IS DRIVEN BY BUFFERS  $b_1$ ,  $b_2$ , AND  $b_3$  IN THE INITIAL CLOCK TREE.

|      | Driving BUF/INV sizing + Driven FF type |                         |                        |  |  |  |  |  |  |  |  |  |
|------|-----------------------------------------|-------------------------|------------------------|--|--|--|--|--|--|--|--|--|
|      | mapping candidate 1                     | mapping candidate $2$   | mapping candidate $3$  |  |  |  |  |  |  |  |  |  |
| R(1) | $INV_$ <sub>X8</sub> + $FF_{1inv}^-$    | $INV\_X4 + FF_{2inv}^-$ | BUF_X4 + $FF_{2inv}^+$ |  |  |  |  |  |  |  |  |  |
| R(2) | $INV\_X8 + FF_{1inv}^-$                 | $BUF\_X4 + FF1inv+$     |                        |  |  |  |  |  |  |  |  |  |
| R(3) | BUF_X4 + $FF_{1inv}^+$                  | BUF_X4 + $FF_{2inv}^-$  |                        |  |  |  |  |  |  |  |  |  |

Then, we construct an *L*-layered network *G*(*V*,*A*) for the sets, say  $R(1), \dots, R(L)$ , of flip-flops that are directly driven by *L* clock buffers. Each vertex in *V* indicates a distinct mapping candidate and all vertices corresponding to set  $R(i)$ ,  $i = 1, 2, \dots, L$  are arranged in the *i*-th layer in *G*. Then, for every pair of vertices between two consecutive layers in *G*,



Fig. 8. Conversion to a network graph  $G(V, A)$  for the mapping candidates in Table II. The peak current minimization problem is then translated to find a solution for an instance of multi-objective shortest path problem.



Fig. 9. An illustration of sampling on current waves in (a) power line and (b) ground line. The maximum value of each range will be chosen. The sampling step size determines the number of sampling slots, i.e., the value of *s*.

we create an arc  $(v_i, v_j) \in A$ . Finally, we add two dummy vertices called *src* and *dest*, placing at the top and the other at the bottom of the *L*-layered network *G*. Fig. 8 show the graph  $G(V, A)$  corresponding to the mapping candidates in Table II. The weight of arc  $(v_i, v_j)$  is a vector of size *s* whose elements indicate the current values at the *s* sampling slots on the current profile caused by the mapping corresponding to  $v_j$ . An illustrative example of sampling on current waves is shown Fig. 9. The number of sampling slots is determined by designers. Using more sampling slots would measure more value of peak current at the expense of computation time. Then, we want to find, among all possible paths from *src* to *dest* in *G*, a path that minimizes the largest value among the elements in the element-wise vector sum on the path. This problem is referred to as the multi-objective shortest path (MOSP) problem. (If  $s = 1$ , it becomes the ordinary shortest path problem.) The lines with blue color indicate the minimum peak current path. Its vector sum, for example  $\langle 30, 12 \rangle$ , means that peak current of 30*uA* is the least height that can be achieved by the feasible mappings. The resulting mapping are INV\_X8 +  $FF^{-}_{1inv}$ , BUF\_X4 +  $FF^{+}_{1inv}$ , and BUF\_X4 +  $FF^+_{\text{1inv}}$  for  $R(1)$ ,  $R(2)$ , and  $R(3)$ , respectively. In practice, for large values of *s* and *L*, finding an exact or bounded solution considering all power grid points is a time consuming process.

#### TABLE III

CURRENT PEAK DATA OF ALL THE GRID POINTS ON CIRCUIT S1423 WITH 3X3 POWER/GROUND LINES WHEN VDD=0.95V IS APPLIED. THE LAST COLUMN INDICATES THE SETS OF FLIP-FLOPS WHOSE CURRENT SOURCE COME FROM THE CORRESPONDING GRID POINTS.

| grid point | power (uA) | ground $(uA)$ | $I_{peak}(-,-)$ (uA) | $R(\cdot)$ |
|------------|------------|---------------|----------------------|------------|
| (0, 0)     | 156.8      | 214.4         | 214.4                | R(1)       |
| (0, 1)     | 450.9      | 497.1         | 497.1                | R(1)       |
| (0, 2)     | 140.5      | 213.7         | 213.7                | R(5)       |
| (0, 3)     | 212.2      | 321.3         | 321.3                | R(0), R(5) |
| (1, 0)     | 513.2      | 575.0         | 575                  | R(4)       |
| (1, 1)     | 187.0      | 305.8         | 305.8                | R(1), R(4) |
| (1, 2)     | 53.5       | 71.4          | 71.4                 | R(5)       |
| (1, 3)     | 293.0      | 273.0         | 293                  | R(0), R(5) |
| (2, 0)     | 465.4      | 709.9         | 709.9                | R(4), R(6) |
| (2, 1)     | 48.1       | 70.9          | 70.9                 | R(6)       |
| (2, 2)     | 463.8      | 656.4         | 656.4                | R(2)       |
| (2, 3)     | 242.8      | 391.6         | 391.6                | R(2), R(3) |
| (3, 0)     | 144.2      | 213.1         | 213.1                | R(6)       |
| (3, 1)     | 592.1      | 667.9         | 667.9                | R(6)       |
| (3, 2)     | 105.6      | 165.4         | 165.4                | R(2), R(3) |
| (3, 3)     | 547.5      | 757.4         | 757.4                | R(3)       |

## *D. Selecting Target Power Grid Points*

To speedup the mapping process of flip-flops, we employ a greedy approach for the selection of a power grid points on which we want to minimize the peak current.

**Definition 2** ( $I_{peak}(i)$ ,  $N_{tot}(i)$ ,  $N_{self}(i)$ ) : Let  $p_i$  be a power grid point.  $I_{peak}(p_i)$  is defined to be the value of peak (i.e., max.) current on  $p_i$  extracted on an initial clock tree with flipflops,  $N_{tot}(p_i)$  is defined to be the total number of buffers such that *some* of their driven flip-flops pull current through *p<sup>i</sup>* , and  $N_{self}(p_i)$  (<  $N_{tot}(p_i)$ ) be the number of buffers such that *all* the driven flip-flops pull current *entirely* through *p<sup>i</sup>* .

For example, Table III shows the values of  $I_{peak}(\cdot)$  extracted by HSPICE simulation for circuit s1423 in ISCAS'89. In addition, from the last column in Table III the values of  $N_{tot}(p_i)$  and  $N_{self}(p_i)$  for each grid point  $p_i$  are computed.

Given the values of  $I_{peak}(\cdot)$ ,  $N_{tot}(\cdot)$ , and  $N_{self}(\cdot)$  for all power grid points, we select the grid point  $(p_i)$  that minimizes the quantity of:

$$
C(p_i) = w \cdot I_{peak}(p_i) + (1 - w) \cdot \frac{N_{self}(pi)}{N_{tot}(p_i)}
$$
(3)

where  $w$  ( $0 \le w \le 1$ ) is a weighting factor.

The first term in *Eq*.(3) prefers, as next target, the power grid point that is very likely to expose highest peak current while the second term in  $Eq.(3)$  ensures that our mapping assignment is effective in lowering down the peak current.

Once the mapping assignment for a target grid point is done, we update the current profiles on all the grid points that have not been processed. Then, the selection process repeats until there is no grid point with unmapped flip-flops.

## *E. Consideration of Reducing Power Consumption*

Our clock boundary optimization methodology can be extended to support minimizing power consumption under current noise constraint, described in Fig. 10. Initially we start from the result of peak noise minimal-mapping produced



Fig. 10. Flow of post power minimization.

by our BOUNDARYNOISEMIN. Then, at each iteration, we attempt to remap a set of flip-flops as long as it results in reducing power consumption while still meeting timing and current noise constraints. Note that depending on the relative importance of noise, power, and timing, the trade-off between them can be explored by restructuring the flow diagram.

## VI. EXPERIMENTAL RESULTS

Our proposed algorithm called BOUNDARYNOISEMIN was implemented in C++ and Python language on a Linux machine with i5-4670 CPU and 8GB RAM. ISCAS'89 benchmark circuits were synthesized with Synopsys Design compiler and Synopsys IC compiler using Nangate 45nm Open Cell Library [13]. Initially, clock trees were synthesized by IC compiler with the slew and skew constraints of 100*ps*. After extracting RC information of routed clock tree, HSPICE simulation is conducted with a power/ground mesh.

We used power delivery network (PDN) models modelled as an RL network to clearly observe power/ground noise by simultaneous switching. Each RL segment of the on-chip power/ground mesh has parameters of  $R = 0.21 \Omega /$ *um* and  $L = 0.5 fH/m$ . Each cell on a clock tree is connected to the closest grid point of power/ground mesh.

TABLE IV OFF-CHIP PDN PARAMETERS FOR HSPICE SIMULATION

| $R_{s,pcb}$ | $0.094$ $m\Omega$ | $R_{s, pcb}$  | $0.166$ $m\Omega$       |
|-------------|-------------------|---------------|-------------------------|
| $L_{s,pcb}$ | 21pH              | $L_{s,pcb}$   | $0 \text{ }\mathrm{pH}$ |
| $R_{s,pkg}$ | 1 $m\Omega$       | $C_{s,pcb}$   | 240 $m\Omega$           |
| $L_{s,pkg}$ | 120pH             | $R_{p,pkg}$   | $0.54 \; m\Omega$       |
| $R_{bump}$  | $20 \; m\Omega$   | $R_{bump}$    | 5.61 $pH$               |
| $L_{bump}$  | 30pH              | $C_{\it pkg}$ | $26 \mu F$              |

The off-chip PDN is also modeled with the model and parameters in [14], for which the parameters are summarized in Table IV. The off-chip supply voltage is transferred to the on-chip power/ground mesh through an RLC circuit. Each of four power/ground bumps of off-chip PDN is connected to a corner on the on-chip power/ground mesh.

The input clock signal has 30*ps* of slew and frequency of 500MHz. Clock skew and slew constraints are both 100*ps*. To measure the current peak in the PDN, HSPICE simulation

TABLE V

COMPARISON OF PEAK CURRENT AND POWER/GROUND NOISE FOR INITIAL CIRCUITS AND ONES PRODUCED BY [10] AND OUR BOUNDARYNOISEMIN.

|                    | Base                      |                  |           | 10] (Polarity assignment only) |              |            |           | Ours (Buffer and FF co-optimization) |              |            |          |                | Improv. over Base (%) |              |            |            | $[10]$ $(\%)$<br>improv. over |              |            |          |            |
|--------------------|---------------------------|------------------|-----------|--------------------------------|--------------|------------|-----------|--------------------------------------|--------------|------------|----------|----------------|-----------------------|--------------|------------|------------|-------------------------------|--------------|------------|----------|------------|
| Circuits           | Noise $(mV)$<br>Peak (uA) |                  | Peak (uA) |                                | Noise $(mV)$ |            | Peak (uA) |                                      | Noise $(mV)$ |            | Time     | Peak $(\%$     |                       | Noise $(\%)$ |            | Peak $(\%$ |                               | Noise $(\%)$ |            |          |            |
|                    | vdd                       | <b>VSS</b>       | vdd       | <b>VSS</b>                     | vda          | <b>VSS</b> | vdd       | <b>VSS</b>                           |              | <b>VSS</b> | vdd      | <b>VSS</b>     | (sec                  | vdd          | <b>VSS</b> | vda        | <b>VSS</b>                    | vda          | <b>VSS</b> | vdd      | <b>VSS</b> |
| s1423              | 672                       | 790              | 30.28     | 31.21                          | 56.          | 717        | 12.4      | 12.99                                |              | 701        | 7.49     | 9.35           | 13.3                  | $+33.78$     |            | $+75.25$   | $+70.03$                      | $+20.68$     | $+2.23$    | $+39.61$ | $+27.99$   |
| s15850             | 899                       | 1168             | 40.7      | 42.53                          | 820          | 942        | 31.82     | 34.82                                | 820          | 942        | 31.82    | 34.82          | 20.0                  | $+8.79$      | $+19.35$   | $+21.84$   | $+18.13$                      | 0.00         | 0.00       | 0.00     | 0.00       |
| s5378              | 238                       | 1907             | 27.10     | 137.30                         | 1229         | 1827       | 138.00    | 149.70                               | 1212         | 1685       | 95.25    | 101.60         | 47.5                  | $+2.10$      | $-11.64$   | $-25.06$   | $+26.00$                      | $+1.38$      | $+7.77$    | $+30.98$ | $+32.13$   |
| s13207             | 2043                      | 2837             | 77.54     | 81.45                          | 1749         | 2610       | 42.74     | 45.48                                | 433          | 2405       | 31.16    | $44.0^{\circ}$ | 189.0                 | $+29.86$     | $+15.23$   | $+59.81$   | $+45.89$                      | $+18.07$     | $+7.85$    | $-27.09$ | $+3.10$    |
| s38584             | 3902                      | 6545             | 64.04     | 113.20                         | 3619         | 6406       | 69.13     | 17.20                                | 3633         | 6395       | 60.32    | 107.00         | 2597                  | $+6.89$      | $+2.29$    | $+5.81$    | $+5.48$                       | $-0.39$      | $+0.17$    | $+12.74$ | $+8.70$    |
| s3841 <sup>-</sup> | 5456                      | 752 <sub>1</sub> | 90.12     | 167.50                         | 4744         | 7597       | 83.16     | 157.40                               | 3863         | 742.       | 72.11    | 139.70         | 487                   | $+29.20$     | $+1.33$    | $+19.98$   | $+16.60$                      | $+18.57$     | $+2.32$    | 13.29    | $+11.25$   |
| s35932             | 5288                      | 7236             | 84.35     | 151.60                         | 4719         | 6927       | 76.28     | 141.70                               | 3904         | 6919       | 76.93    | 133.90         | 3798                  | $+26.17$     | $+4.38$    | $+8.80$    | $+11.68$                      | $+17.27$     | $+0.12$    | $-0.85$  | $+5.50$    |
| Average            |                           |                  |           |                                |              |            |           |                                      | $+19.54$     | $+9.36$    | $+30.94$ | $+27.69$       | $+10.80$              | $+2.92$      | $+17.55$   | $+12.67$   |                               |              |            |          |            |

was executed on the ISCAS'89 benchmark circuits. After we get the new cell type mapping by BOUNDARYNOISEMIN so that the current peak is minimized, we replace the buffers and flip-flops in the initial circuit to the new cell types. Finally RC extraction and HSPICE simulation are performed again on the new circuit.

The simulation results are summarized in Table V. We attached 3x3 power and ground mesh to each benchmark circuit, where every cell is connected to its nearest junction of power/ground mesh. We measure peak current and power/ground noise on every junction in the mesh, and take maximum value of them. Each column of *Base*, *polarity assignment only* [10], *Ours* represents the initial clock tree, the results produced by applying the polarity assignment in [10], and the results by our BOUNDARYNOISEMIN, respectively. The *Peak* and *Noise* columns indicate the maximum peak current and maximum peak-to-peak voltage fluctuation appearing at the junctions in power/ground mesh, respectively.

The highest current peak occurs at the ground line in every case, hence voltage noise on *V<sub>DD</sub>* is larger than that on *VSS*. As shown in Table V, our proposed algorithm BOUNDARYNOISEMIN outperforms the polarity assignment method in [10] where BOUNDARYNOISEMIN reduces peak current and power/ground noise by 9.36%∼19.54% and 27.69%∼30.94%, respectively, over the initial clock tree, and by 2.92%∼10.80% and 12.67%∼17.55%, respectively over the work in [10]. Overall, our proposed mapping solution consistently reduces the peak current and power/ground noise under the clock skew and slew constraints over the design optimized by [10] as well as the initial designs. Fig. 11 shows the current maps before and after the application of BOUNDARYNOISEMIN to circuit s1423.

Acknowledgment: This work was supported by NRF grant funded by MSIP (2015R1A2A2A01004178), Korea.

## VII. CONCLUSION

This work proposed an algorithm for clock tree boundary optimization with the objective of minimizing peak current. The key enabler was exploiting the four types of flip-flop mapping at the clock tree boundary. We formulated the mapping problem of minimizing peak current into a multiobjective shortest path problem and solved it efficiently using an approximation algorithm. Through testing benchmark circuits, it was shown that our algorithm was able to reduce the peak current by 27.7%∼30.9%. In addition we suggested



Fig. 11. Peak current maps (*uA*) of circuit s1423, before and after the application BOUNDARYNOISEMIN.

an extended design flow of integrating the peak current noise minimization with the clock power minimization.

#### **REFERENCES**

- [1] L. H. Chen, M. Marek-Sadowska, and F. Brewer, "Buffer delay change in the presence of power and ground noise," *IEEE TVLSI*, no. 3, 2003.
- [2] P. Vuillod, L. Benini, A. Bogliolo, and G. De Micheli, "Clock skew optimization for peak current reduction," in *ISLPED*. IEEE Press, 1996.
- [3] A. Vittal, H. Ha, F. Brewer, and M. Marek-Sadowska, "Clock skew optimization for ground bounce control," in *ICCAD*, 1996.
- [4] W.-C. Lam, C.-K. Koh, and C.-W. Tsao, "Power supply noise suppression via clock skew scheduling," in *ISQED*, 2002.
- [5] S.-H. Huang, C.-M. Chang, and Y.-T. Nieh, "Fast multi-domain clock skew scheduling for peak current reduction," in *ASP-DAC*, 2006.
- [6] A. Vijayakumar, V. C. Patil, and S. Kundu, "An efficient method for clock skew scheduling to reduce peak current," in *VLSID*, 2016.
- [7] Y.-T. Nieh, S.-H. Huang, and S.-Y. Hsu, "Minimizing peak current via opposite-phase clock tree," in *DAC*, 2005.
- [8] R. Samanta, G. Venkataraman, and J. Hu, "Clock buffer polarity assignment for power noise reduction," *TVLSI*, no. 6, 2009.
- [9] H. Jang, D. Joo, and T. Kim, "Buffer sizing and polarity assignment in clock tree synthesis for power/ground noise minimization," *TCAD*, no. 1, 2011.
- [10] D. Joo and T. Kim, "A fine-grained clock buffer polarity assignment for high-speed and low-power digital systems," *TCAD*, no. 3, 2014.
- [11] J. Kim and T. Kim, "Boundary optimization of buffered clock trees for low power," *Integration, the VLSI Journal*, 2017.
- [12] A. Warburton, "Approximation of pareto optima in multiple-objective, shortest-path problems," *Oper. Res.*, 1987.
- [13] NANGATE. (2011) Nangate 45nm Open Cell Library. http://www.nangate.com/.
- [14] M. S. Gupta, J. L. Oatley, R. Joseph, G. Y. Wei, and D. M. Brooks, "Understanding voltage variations in chip multiprocessors using a distributed power-delivery network," in *DATE*, 2007.