# **Ultra-Low Swing CMOS Transceiver for 2.5-D Integrated Systems**

Przemyslaw Mroszczyk and Vasilis F. Pavlidis Advanced Processor Technologies Group School of Computer Science, The University of Manchester, UK E-mail: przemyslaw.mroszczyk@manchester.ac.uk, vasileios.pavlidis@manchester.ac.uk

# Abstract

This paper presents the design of a low swing transceiver for chip-to-chip communication in 2.5-D integrated systems using a passive interposer. High speed and low power operation is achieved through a new dynamic low swing tunable transmitter (DLST-TX) and inverter-based tunable receiver (INVT-RX) circuits. The novelty of the proposed solution lies in the digital trimming for PVT corners and random parameter variability allowing significant reduction of the voltage swing down to 120 mV with single ended signaling. The compensation method has negligible impact on the circuit performance and silicon area, not typically achievable by device geometry scaling. The proof-ofconcept transceiver is implemented in a 65 nm CMOS technology and exhibits up to 4× higher energy efficiency at 1 Gb/s speed for 2.5 mm long chip-to-chip interconnect, as compared to state-of-the-art full swing communication schemes operating under the same conditions. The transceiver is suitable for parallel interfaces in 2.5-D integrated systems.

# Keywords

Low swing, mismatch cancellation, I/O design, passive interposer, 2.5-D integration, digital trimming

## **1. Introduction**

An ongoing pursuit for even higher levels of integration in modern CMOS technologies combined with the idea of vertical stacking allows more complex systems with a smaller silicon footprint. Although the power of the logic blocks tends to decrease with the technology feature size, intra- and inter-chip communication becomes increasingly more energy demanding. This adverse impact of scaling on the communication links stems from a continuous decrease in width and the distance between the metal wires while their length does not always reduce since the physical size of a system typically does not decrease [1]. As a result, the higher resistance and capacitance of the metal wires contribute towards increased signal attenuation, crosstalk, and latency in long interconnects requiring stronger, larger in size, and more power consuming drivers. This requirement becomes even more critical in 2.5- and 3-D integrated systems with bump bonding or Through Silicon Via (TSV) interfaces contributing large capacitive loads from additional electrostatic discharge (ESD) protection circuits and due to increased parasitic coupling to the substrate. Such interfaces typically do not scale with the technology feature size since they have to provide sufficient reliability for electrical, mechanical, and thermal stresses during the manufacturing and packaging processes [2]. As a result, these interfaces can limit system performance hindering full exploitation of the small geometry potential and vertical integration. Therefore, in such integration schemes, energy efficient inter-chip communication often becomes a primary objective towards low power system design.

One of the most effective techniques with significant power savings in wireline communication is to reduce the voltage swing and, hence, the energy required for digital signal transmission over a capacitive (long) interconnect. The design of such interfaces is not straightforward and requires circuits capable of coping with voltage levels in between the strict digital "zero" and "one" [3].

The vast majority of the solutions in the literature implement the required level conversion using additional power supply and multi-threshold voltage devices [4]. Several design approaches for single supply low swing transmitters are reported in [5]-[7], inherently relying on the threshold voltage of a MOS transistor as a reference in low swing generation. These solutions, however, are particularly efficient in older technology nodes where the threshold voltage of a MOS transistor, and hence the generated low signal swing, is almost an order of magnitude lower than the core supply voltage, enabling significant energy savings [8].

The last group of transmitters employs dynamic, selftimed circuits generating a low voltage swing as a result of charge injection to the capacitive load [9]-[11]. These circuits exhibit low complexity and high power efficiency. They operate with a single supply voltage and do not require an external voltage reference. Despite these advantages, such circuits are very sensitive to process and environmental parameters variation and may suffer from the output DC level drift if no charge leakage prevention is applied [12].

The majority of the receivers are based on decision circuits with crossed coupled pairs capable of restoring the low swing signal to the nominal logic levels; however, they usually require differential signaling or a reference signal [12]. Another group of receivers employs a CMOS inverter as an amplifier with the switching threshold adjusted in the middle of the signal swing. Such adjustment is usually done either by geometry or supply voltage scaling [10]. Inverter-based receivers exhibit simplicity and very high speed of operation but require the DC component of the input signal very near to their switching threshold [13].

This paper presents a low swing transceiver for chip-tochip communication in 2.5-D integrated systems. High speed and low power operation is achieved by employing the proposed dynamic low swing tunable transmitter (DLST-TX) and inverter-based tunable receiver (INVT-RX) circuits. The novelty of the proposed solution lies in the implementation of digital trimming to compensate process parameter variability, allowing significant reduction of the voltage swing and, consequently, the energy per bit with the single ended signaling.

The paper consists of five sections. The interconnect model and test circuit are described in Section 2. The transmitter and receiver designs are presented in Section 3 and Section 4, respectively. The trimming procedure is discussed in Section 5. The simulation results are provided in Section 6 and conclusions are drawn in Section 7.

#### 2. Interconnect model and test architecture

The cross section of a typical inter-chip link based on 2.5-D integration is illustrated in Fig. 1. Two or more bare dies are bump bonded on top of an interposer that electrically connects these dies and also provides mechanical support. The DLST-TX module on chip A drives the interconnect with the signal swing reduced to about 120 mV. This low swing signal is detected by the INVT-RX module on chip B and restored back to the nominal voltage levels.

The schematic diagram of the transceiver test circuit is shown in Fig. 2a consisting of the signal generator, transmitter, interconnect model, and the receiver. The additional buffers on the input of the transmitter and the output of the receiver are added to model the driving strength and load of the core logic circuits. The model of the interconnect including the ESD protection circuits, microbumps, and the wire is depicted in Fig. 2b. The parameters of the lumped  $\pi$  model of the passive wire are evaluated individually for each wire length *L* based on the post-layout extracted *RC* models. Note that a distributed wire model marginally improves the accuracy of the results given the short interconnect length and baseband operation mode.

One- and two-stage ESD protection circuits for the micro-bump bonding process are utilized for the transmitter and receiver, respectively, occupying about half of the I/O cell area. Note that for the minimum 20  $\mu$ m pad pitch, the size of the corresponding cell in an I/O array is limited to 20  $\mu$ m × 20  $\mu$ m, with about half of this area occupied by the ESD protection circuit. The area available for the TX/RX module is therefore limited to about 10  $\mu$ m × 20  $\mu$ m. Both interconnect and transceiver are implemented and characterized in a 65 nm CMOS technology. The operation of the system is verified for standard process corners (TT, FF, SS, SNFP, and SPFN) at the nominal conditions (V<sub>DD</sub> = 1.2 V, 25°C) and for TT corner assuming, +/-10% V<sub>DD</sub> and 25°C–75°C temperature variation, respectively.

#### 3. Transmitter

This section presents the design of the proposed dynamic low swing transmitter and explains its operation. The schematic diagram of the transmitter core circuit and the corresponding signal waveforms are shown in Fig. 3a and Fig. 3b, respectively. In the steady state, the output of the NAND gate (PG) remains high, the output of the NOR gate (NG) remains low, and the output transistors  $M_{PB}$  and  $M_{NB}$ are turned off. When a low-to-high or high-to-low transition occurs on the input TXIN, the corresponding rising or falling slope propagates through the delay line and changes the state



Figure 1: Cross section of the considered 2.5-D integrated system.



Figure 2a: Transceiver test circuit.



Figure 2b: Interconnect model.

of the internal signal TXIN accordingly after the propagation time TD. As a result, for the rising edge of TXIN, the NAND gate generates a negative pulse (i.e. 1-0-1 transition) switching M<sub>PB</sub> on for the period of TD. Similarly, for the falling edge of TXIN, the NOR gate generates a positive pulse (i.e. 0-1-0 transition) switching M<sub>NB</sub> on, for the period of TD. In practice, the delay time TD is very short (~0.1 ns) while the load capacitance of the interconnect is usually high (~0.3 pF for a 500 µm long wire with ESD and micro bumps). Therefore, the output buffer only pre-charges or discharges the capacitance of the channel within a small voltage range +/- $\Delta$ V around a certain constant level V<sub>DC</sub>. Note that for V<sub>DC</sub>  $\approx$  V<sub>DD</sub>/2, both transistors exhibit the highest driving strength and the DC level in the channel is near the switching threshold of the inverter-based receiver.

The DC level depends on the symmetry of the PG and NG pulses, and the large signal transconductance and leakage current of the  $M_{\rm NB}$  and  $M_{\rm PB}$  transistors. Since these parameters depend on the process and environmental factors, they cannot be precisely evaluated at the design stage. In practice, the DC level can vary significantly as a result of process corners, mismatch, and voltage and temperature variations.

In order to adjust the  $V_{DC}$  voltage, a trimming technique is implemented adding a set of small size transistors  $M_{NB1}$ - $M_{NB6}$  and  $M_{PB1}$ - $M_{PB6}$  in parallel with  $M_{NB}$  and  $M_{PB}$ , respectively, as shown in Fig. 4. These transistors can be selectively activated through switches  $M_{NSW1}$ - $M_{NSW6}$  and  $M_{PSW1}$ - $M_{PSW6}$  to equalize the strength of the output buffer pair and, hence, to adjust the  $V_{DC}$  voltage. The number of the trimming stages and the size of the additional transistors are chosen to ensure the trimming range of the  $V_{DC}$  voltage equals +/-200 mV. This range covers the fluctuations of  $V_{DC}$ voltages in PVT corners (+/-120 mV) and +/-3 $\sigma$  variability caused by mismatch in nominal conditions ( $\sigma_{VDC} \approx 68$  mV).



Figure 3: a) Transmitter core circuit, b) signal waveforms.

The +/-3 $\sigma$  coverage is chosen rather arbitrarily ensuring 99.7% of trimming success in statistical terms. The variability of  $\Delta V$  is much lower both for the mismatch ( $\sigma_{\Delta V} \approx 3.5 \text{ mV}$ ) and for the PVT corners (+/-3 mV). This smaller variability is due to the self-compensation effect in the core transmitter circuit. For example, for the "symmetric" SS corner, the output buffer drives the interconnect capacitance with a smaller current but for a longer time since the generated delay TD is longer. For "asymmetric" corners, e.g. FNSP, the reduced driving strength of M<sub>PB</sub> is compensated by a slightly longer negative PG pulse since the pull-down network in the NAND gate is stronger while the pull-up network is weaker.

An additional weak keeper circuit, shown in Fig. 5, is proposed to compensate for the charge leakage and, hence, to prevent data corruption when both  $M_{NB}$  and  $M_{PB}$  are in the off state (e.g. when a long sequence of zeros or ones is transmitted, or when a low speed transfer is required for power savings). During normal operation (ENREF = 0), for TXIN = 0, the output TXKEEP connects to  $V_{DC} - \Delta V$  tap of the resistor string through  $M_{NT}$  while for TXIN = 1, it connects to  $V_{DC} + \Delta V$  tap through the  $M_{PT}$  transistor. The  $\Delta V$  voltage is adapted through tuning of the TD delay time whereas the  $V_{DC}$  voltage can be adjusted through  $M_{NB}$  and  $M_{PB}$  scaling. The transmitter circuit design (Fig. 4) ensures  $V_{DC} \approx 600$  mV and  $\Delta V \approx 60$  mV without the weak keeper buffer at nominal conditions with 1 mm long interconnect.

The resistors RP1–RP4 are determined assuming that the midpoint of the string is at  $V_{DD}/2$  and the voltage drop on RP2 and RP3 is roughly 60 mV, corresponding to the +/-60 mV low swing signal generated by the core transmitter circuit. The DC current of the resistor string is about 12  $\mu$ A. Note that the weak keeper buffer with fixed resistors is sufficient since the variability of  $\Delta V$  is negligible while the level of  $V_{DC}$  is adjusted by trimming.

As a result, one resistor string can be shared among several transmitter circuits reducing the area and power of a parallel interface. For ENREF = 1, the output of the transmitter is connected to the mid tap point of the resistor string  $V_{DD}/2$  used as reference for the receiver during the trimming procedure (see Section 5). The DLST-TX circuit (excluding the resistor string) occupies 48  $\mu$ m<sup>2</sup> (95  $\mu$ m<sup>2</sup> with the resistor string), which is only ~12% (24%) of the total I/O cell area.

## 4. Receiver

This section describes the proposed inverter-based receiver circuit INVT-RX. The schematic diagram of the receiver front-end amplifier is shown in Fig. 6a. The receiver consists of an inverter ( $M_{NR}$  and  $M_{PR}$ ) and a set of additional





Figure 5: Weak keeper buffer circuit.

transistors in the pull-up and pull-down network connected in parallel through individual switches. These additional transistors are used to equalize the strength of the pull-up and pull-down network to trim the switching threshold. The switching threshold  $V_{\rm ST}$  is defined here as the crossover point between the input and output voltage on a DC transfer characteristic.

In order to trim the switching threshold  $V_{ST}$  to a given voltage (e.g.  $V_{DD}/2$ ), a corresponding reference voltage VREF =  $V_{DD}/2$  has to be applied to the input of the inverter. The trimming starts with all the n-MOS switches (SWN1-SWN6) on and all the p-MOS switches (SWP1-SWP6) off corresponding to the switch code 0 (see Fig. 6b). The n-MOS switches are subsequently switched off, gradually decreasing the strength of the pull-down network to its minimum for the code 6. Then, the p-MOS switches are subsequently switched on, gradually increasing the strength of the pull-up network to its maximum for the code 12.

Such a tuning approach allows an intrinsically monotonic sweep of the switching threshold voltage. During this sweep, the output state transitions from 0 to 1, denoting that the circuit crosses the equilibrium point where the switching threshold is equal to VREF. In the example shown in Fig. 6b, the output state transitions from 0 to 1 for the input reference which equals  $V_{DD}/2$  for the code 6 (i.e. in the middle of the tuning range). Note that VREF has to be within the tuning range of the circuit for the trimming process to succeed.

The schematic diagram of the complete INVT-RX module is presented in Fig. 7. The receiver is composed of the front-end amplifier with six additional trimming stages and an output buffer logically inverting the received signal and restoring its swing to the nominal rail-to-rail range. The number of the trimming stages and the size of the additional transistors in the front-end amplifier are chosen to ensure the tuning range of the amplifier is wide enough to compensate



**Figure 6:** a) Front-end receiver amplifier, b) switching threshold trimming process.



Figure 7: INVT-RX module.

for the PVT corners and mismatch variation of the  $V_{ST}$ . In order to ensure the correct operation of the front-end amplifier, the variability of the switching threshold should be reduced significantly below the magnitude of the input signal swing  $\Delta V$  while the  $V_{ST}$  voltage should be set close to  $V_{DC}$ . With trimming, the random variability of  $V_{ST}$  can be reduced by a factor of 4, from  $\sigma_{VST} \approx 15$  mV down to 3.6 mV. Based on the simulation results, the proposed trimming technique allows correct operation of the front-end amplifier at the nominal speed of 1 Gb/s with the input low swing signals within the range of +/- 60 mV.

The drawback of the inverter-based receiver is the DC current of the front-end stage resulting from the input bias  $V_{DC}$ . In the proposed INVT-RX module, the DC supply current varies between 5  $\mu$ A – 25  $\mu$ A depending on the process corner, being the major contributor to the static power dissipation of the transceiver (the second biggest contributor is the resistor string in the transmitter drawing ~12  $\mu$ A). Solutions aiming to further reduce this current can be found in literature, therefore they are not considered in this paper [5], [13]. The INVT-RX module occupies 26  $\mu$ m<sup>2</sup> which is about ~7% of the total I/O cell area.

### 5. Transceiver trimming

This section describes the transceiver trimming algorithm and its implementation in digital hardware. The diagram representing the flow of the transceiver trimming is shown in Fig. 8. The variables SWRX and SWTX are 12-bit vectors directly controlling the switches of the TX and RX modules such that the lower half (less significant) of the bit vector controls the n-MOS switches and the upper half (most significant) controls the p-MOS switches. For the monotonic sweep, SWRX and SWTX operate as shift registers with all the bits initially set to one. Zeros are shifted in from the right hand side at each iteration (i.e. first the n-MOS switches are deactivated, and then the p-MOS switches are activated by

turning the gate voltage to zero). First, the switching threshold  $V_{ST}$  of the RX module is trimmed for VREF =  $V_{DD}/2$  provided to the input of the receiver from the DLST-TX when ENREF = 1. As RXOUT transitions high, the trimming terminates and the state of the SWRX register is stored. The trimming of the transmitter follows a similar procedure with a monotonic sweep of V<sub>DC</sub> through SWTX. To verify the operation of the transmitter, each configuration of SWTX is checked based on the result of a bit error rate (BER) test performed on a pseudo-random bit sequence sent through the link. The comparison is done on a bit-to-bit basis assuming that a bit can be sent and received within one clock cycle. The bit corresponding to the current iteration number in the 13-bit long ERROR register is set or reset, depending on the BER test result. Note that several bits in the ERROR register may be set to 0 meaning that there are several error-free SWTX configurations allowing operation according to the applied BER test.

Since the sweep of SWTX is intrinsically monotonic, the zero bits denoting an error-free transmission should always be clustered together in the ERROR register. In the example case shown in Fig. 8, the ERROR register has a cluster of three zeros corresponding to the three possible SWTX configurations resulting in error-free transmission. In the evaluation of SWTX, the configuration from the 7<sup>th</sup> iteration is used since it is the farthest from other error prone configurations.

A dedicated digital controller realizing the trimming algorithm is implemented in VHDL and simulated in the mixed-signal environment with the transceiver circuit. In the simulations, the controller operates at 1 GHz clock allowing trimming at the nominal transmission speed of 1 Gb/s.



Figure 8: Flow diagram of the transceiver trimming process.

#### 6. Simulation results

The simulations results showing the performance of the link and the trimming mechanism in the selected PVT corners for 1 mm long interconnect are listed in Table 1. The energy per bit is estimated based on 1,000 bit long sequences

Table 1: Simulation results for the PVT corners.

| Corner                  | Error-free TX configurations | Delay<br>[ps] | Energy<br>[fJ/bit] |  |
|-------------------------|------------------------------|---------------|--------------------|--|
| TT                      | 4/13                         | 670           | 66                 |  |
| FF                      | 7/13                         | 530           | 81                 |  |
| FNSP                    | 5/13                         | 620           | 70                 |  |
| SNFP                    | 5/13                         | 650           | 70                 |  |
| SS                      | 1/13                         | 930           | 55                 |  |
| TEMP 50°C <sup>1)</sup> | 4/13                         | 675           | 67                 |  |
| TEMP $75^{\circ}C^{1}$  | 4/13                         | 690           | 70                 |  |
| $V_{DD} = 1.08 V^{1}$   | 2/13                         | 900           | 49                 |  |
| $V_{DD} = 1.32 V^{1}$   | 6/13                         | 550           | 89                 |  |

transmitted at 1 Gb/s speed after the trimming. The delay is measured between the 50% slope of the TX input signal and 50% slope of the RX output signal, individually for the rising and falling edge, and the average of the two measurements is reported. In the SS corner, there is only one successful configuration for the TX switches due to the degraded speed of the RX front-end amplifier in this corner. The delay of the link varies between 0.53 ns (FF corner) to 0.93 ns (SS corner). The energy per bit varies between 55 fJ/bit (SS corner) to 89 fJ/bit (V<sub>DD</sub> = 1.32 V corner).

The simulation results showing the energy efficiency and the energy delay product (EDP) for the proposed low swing and the reference full swing transceiver in nominal conditions are presented in Fig. 9 and Fig. 10, respectively. The interconnect length ranges from 500  $\mu$ m to 2.5 mm. In the simulations, two modes of link operation are considered: CLOCK and DATA. In the CLOCK mode, the transceiver carries a 500 MHz clock signal while in the DATA mode the transceiver carries a pseudo-random bit sequence at 1 Gb/s speed. Although, in the literature, the CLOCK mode is typically preferred for circuit benchmarking [6], [7], [12], the DATA mode more accurately demonstrates the behavior of a parallel link since only one clock lane typically accompanies a wide parallel data bus.

Based on the most common approach in the literature, the performance gain is measured as the ratio of the energy per bit or EDP of the reference full swing (FS) and the low swing (LS) solution for the same interconnect. The energy and EDP ratios versus interconnect length are illustrated in Fig. 11. Note that the EDP ratio drops below 1 for interconnects shorter than ~300  $\mu$ m (DATA EDP trace extrapolation). This result means that the full swing link becomes more efficient than the low swing for short interconnects. Such a behavior is expected due to the energy overhead of the low swing circuit dominating below a certain "critical" channel load [10].

A summary of the implementation and performance figures of the full and low swing transceivers with 1 mm long interconnect operating at 1 Gb/s speed in DATA mode are reported in Table 2. The standard deviation figures refer to the random variability caused by the fabrication mismatch obtained from 100 Monte Carlo simulation runs. In such conditions, the low swing transceiver exhibits almost  $2.5 \times$  lower power and 23% smaller EDP as compared to the full swing solution. The performance figures grow up to  $4 \times$  and 60%, respectively, for 2.5 mm interconnect and 500 MHz



Figure 9: Energy vs. interconnect length.



Figure 10: EDP vs. interconnect length.



Figure 11: Performance gain vs. interconnect length.

CLOCK mode. The power overhead refers to the case where the TX module drives the RX module directly with no interconnect. In this case, the load of the transmitter is below the "critical" load and the power overheads dominate [10]. The major contributors to the idle current in the low swing solution are the resistor bias (~12  $\mu$ A), and front-end RX amplifier (~8  $\mu$ A). The remaining 2-3  $\mu$ A is the leakage current. Although the area of the low swing module is over 4× larger than the area of the full swing transceiver, it comprises only 30% of the total I/O cell area not impeding practical realizations of high density parallel interfaces.

Table 2: Performance comparison in DATA mode.

| Parameter         | Full Swing                                | Low Swing                                  |  |  |  |  |  |
|-------------------|-------------------------------------------|--------------------------------------------|--|--|--|--|--|
| Technology        | 65 nm bulk CMOS                           |                                            |  |  |  |  |  |
| Supply            | 1.2 V                                     |                                            |  |  |  |  |  |
| Interconnect      | 1 mm chip-to-chip over passive interposer |                                            |  |  |  |  |  |
| Nominal speed     | 1 Gb/s (DATA mode)                        |                                            |  |  |  |  |  |
| Voltage swing     | 1.2 V                                     | $126 \text{ mV} (\sigma = 7 \text{ mV})$   |  |  |  |  |  |
| Energy per bit    | 148 fJ/bit                                | 60 fJ/bit                                  |  |  |  |  |  |
| EDP               | 53 fJ·ns                                  | 41 fJ∙ns                                   |  |  |  |  |  |
| Delay             | $0.36 \text{ ns} (\sigma = 5 \text{ ps})$ | $0.68 \text{ ns} (\sigma = 21 \text{ ps})$ |  |  |  |  |  |
| Power overhead    | 22W                                       | 47 μW                                      |  |  |  |  |  |
| (no interconnect) | 25 μw                                     |                                            |  |  |  |  |  |
| Idle current      | 3 54 4                                    | ~22 µA                                     |  |  |  |  |  |
| (no transmission) | ~ 3.3µA                                   |                                            |  |  |  |  |  |
| Area              | $30 \ \mu m^2$                            | $121 \ \mu m^2$                            |  |  |  |  |  |

A comparison with other works in the literature is presented in Table 3. Note that the state-of-the-art solutions report performance figures in the CLOCK mode. Most of these circuits are implemented in mature technologies leading to efficiency gain due to the higher supply voltages. Realizations in advanced nodes typically do not exhibit significant performance gains, mainly due to considerably smaller difference between the generated low swing and the power supply voltage [10].

Further performance gain for the proposed solution can be achieved assuming that the resistor bias can be shared between several transceivers in a parallel interface. The energy efficiency gain can be increased in this way from 4× to 4.6×. A further increase up to 5× is expected if the DC current of the front-end amplifier is reduced. Note that the maximum theoretical efficiency gain of a low swing solution is limited by the ratio of the supply voltage V<sub>DD</sub> and swing voltage V<sub>LS</sub>. For V<sub>DD</sub> = 1.2 V and V<sub>LS</sub> = 120 mV, the maximum theoretical gain cannot exceed 10× [13].

| Reference              | [6]  | [7]  | [10]  | [11] | [12] | This work |
|------------------------|------|------|-------|------|------|-----------|
| Tech [nm]              | 180  | 130  | 45    | 130  | 250  | 65        |
| Supply [V]             | 1.8  | 1.0  | 1.0   | 1.2  | 2.5  | 1.2       |
| Swing [V]              | 1.03 | 0.49 | 0.5   | 0.5  | 0.4  | 0.12      |
| Energy [pJ/bit]        | 4.0  | 3.52 | 0.152 | N/A  | 4.17 | 0.06      |
| <b>Efficiency Gain</b> | 1.4  | 1.85 | 1.7   | 1.3  | 4    | 4         |
| Speed [Gb/s]           | N/A  | 1    | 2     | 1    | 1    | 1         |

Table 3: Comparison to state-of-the art solutions.

## 7. Conclusions

In this paper a low swing transceiver for chip-to-chip communication in a 65 nm CMOS technology is presented. The proposed solution exhibits up to  $4\times$  higher energy efficiency at 1 Gb/s speed for 2.5 mm long chip-to-chip interconnect, as compared to state-of-the-art full swing signaling schemes. The transceiver can be used in parallel I/O interfaces in 2.5- and 3-D integrated systems.

#### 8. Acknowledgments

We thankfully acknowledge the support of the European Commission under the Horizon 2020 Framework Programme for Research and Innovation through the ExaNoDe project (grant agreement 671578).

#### 9. References

- [1]. J. S. Clarke *et al.*, "Process Technology Scaling in an Increasingly Interconnect Dominated World," *Proc. of the IEEE Symposium On VLSI Technology*, June 2014.
- [2]. V. F. Pavlidis, I. Savidis, and E. G. Friedman, *Three-Dimensional Integrated Circuit Design 2<sup>nd</sup> Edition*, Morgan Kaufmann Publishers, 2017.
- [3]. H. Zhang, V. George, and J. M. Rabaey, "Low-Swing On-Chip Signaling Techniques: Effectiveness and Robustness," *IEEE Transactions on Very Large Scale of Integration (VLSI) Systems*, Vol. 8, No. 3, pp. 264–272, June 2000.
- [4]. S. H. Kulkarni and D. Sylvester, "High Performance Level Conversion for Dual V<sub>DD</sub> Design," *IEEE Transactions on Very Large Scale of Integration (VLSI) Systems*, Vol. 12, No. 9, pp. 926–936, September 2004.
- [5]. J. C. Garcia Montesdeoca, J. A. Montiel-Nelson, and S. Nooshabadi, "CMOS Driver-Receiver Pair for Low-Swing Signaling for Low Energy On-Chip Interconnects," *IEEE Transactions on Very Large Scale* of Integration (VLSI) Systems, Vol. 17, No. 2, pp. 311– 316, February 2009.
- [6]. M. Ferretti and P. A. Beerel, "Low Swing Signaling Using a Dynamic Diode-Connected Driver," *Proc. of the European Solid-State Circuits Conference*, Sep. 2001, pp. 369–372.
- [7]. J. C. Garcia, J. A. Montiel-Nelson, and S. Nooshabadi, "Adaptive Low/High Voltage Swing CMOS Driver for On-Chip Interconnects," *Proc. of the IEEE International Symposium on Circuits and Systems*, pp. 881–884, May. 2007.
- [8]. R. Golshan and B. Haroun, "A Novel Reduced Swing CMOS Bus Interface Circuit for High Speed Low Power VLSI Systems", Proc. of the IEEE International Symposium on Circuits and Systems, pp. 351 – 354, June 1994.
- [9]. C. K. Kwon, K. M. Rho, and K. Lee, "High Speed and Low Swing Interface Circuits Using Dynamic Over-Driving and Adaptive Sensing Scheme," *Proc. of the International Conference on VLSI and CAD*, pp. 388– 391, October 1999.
- [10].S. Fang and E. Salman, "Low Swing TSV Signaling Using Novel Level Shifters with Single Supply Voltage," *Proc. of the IEEE International Symposium* on Circuits and Systems, pp. 1965 – 1968, May 2015.
- [11].F. H. A. Asgari and M. Sachdev, "A Low-Power Reduced Swing Global Clocking Methodology", *IEEE Transactions on Very Large Scale of Integration (VLSI) Systems*, Vol.12, No. 5, pp. 538 – 545, May 2004.
- [12].B. D. Yang and L. S. Kim, "High-Speed and Low-Swing On-Chip Bus Interface Using Threshold Voltage Swing Driver and Dual Sense Amplifier Receiver," *Proc. of the European Solid-State Circuits Conference*, pp. 105 – 108, Sep. 2000.
- [13].C. Svensson, "Optimum Voltage Swing on On-Chip and Off-Chip Interconnect," *IEEE Journal of Solid-State Circuits*, Vol. 36, No. 7, pp. 1108 – 1112, July 2001.