### Double Pass-Transistor Logic for High Performance, Low Latency Wave Pipeline Circuit

### **Santimoy Mandal**

Dept. of Electronics and Communication Engineering, RVS college of Engineering and Technology Jamshedpur, India

### **Shyam Sundar Prasad**

Dept. of Electronics and Communication National Institute of Technology, Jamshedpur, India

#### Abstract—

High throughput and low latency designs are required in modern high performance especially for signal processing applications. Existing logic families cannot provide both of them simultaneously. We propose Double Pass Transistor Logic (DPL) which can be used as a universal logic to provide finest grain pipelining without affecting overall latency or increasing the area. It does not require any special process steps and hence, can be realized in a normal process technology as against the CPL proposed by Yano et al [2] which uses threshold voltage adjustment of selected devices. The design procedure is described for (a) low latency, (b) high throughput and (c) low area requirements.In addition to the various advantages, envisioned that DPL designs can also be used to build ultra-high speed pipelined system without pipelining latches, viz., wave pipelined where the throughput achievable systems, beyond that permitted by the delay pipeline stage.

### I. INTRODUCTION

High speed adders and multipliers are required to meet the demands of signal processing and multimedia applications. Wavepipelining or "maximal rate pipelining" [1] is a design method that can increase the throughput of a combinational circuit. In conventional pipelining, the combinational circuit is broken into smaller blocks or pipeline stages and synchronizing elements like D-flip flops are used as storage elements. The maximum speed is limited

By the number of pipe stages, the size of pipe stages and the complexity of the clock distribution network. In the wave pipelining approach, flip flops are not used as storage elements between pipeline stages. Instead, the internal capacitances of the gates are used for storing the intermediate values [1] [3] [4]. There is considerable area reduction and minimization of power due to the elimination of storage elements. This also eliminates clock distribution and clock skew problems as no clock signal is required within the combinational block. New inputs can be applied to the circuit before the outputs are available, effectively allowing multiple waves of data to propagate coherently through the circuit.

Wave pipelining requires all paths from the inputs to the outputs to be balanced. This is achieved by

inserting active delay buffers in the paths in which there are less number of gates than the longest path from the input to the output. The rough tuning method [6] ensures that the gate count along all the paths is the same. However, rough-tuned circuit is still not balanced as there is bound to be different delays due to different fan-outs. The absence synchronizing elements in the wave pipelined circuit could lead to collision between adjacent waves of data. The clock period should be such that the waves do not collide with each other giving enough time for the gates to complete its task. The pipe stages in a wave pipelined circuit are composed of single gates and the load capacitances of the gates are used for storage. The load capacitance may vary for different gates in the same stage depending on the fan-outs. Different load capacitances result in different rise and fall times for the driver gates. This delay variation is reduced by fine tuning [5] [6]. Fine tuning involves sizing of the transistors in the output inverters of the driver gate to balance the delay. Once fine tuned, the circuit can be clocked at its maximum speed limited only by the delay.

Section II discusses the timing constraints of wave pipelining and the necessary features in basic gates to be designed for wave pipelining. Section III gives an overview of the existing logic styles for wave pipelining. The limitations of the logic styles and the tuning methods are also discussed. Section IV presents the performance of basic gates highly suitable for wave pipelining. The power analysis of 8 bit multiplier is represent in section V.Section VI presents conclusion and further research direction.

## II. TIMING CONSTRAINTS IN WAVEPIPELINING

Wave pipelined circuits can be clocked at a much higher frequency than conventional pipelining because its maximum rate is limited only by the path delay difference instead of the maximum path delay. The minimum clock period for a wave pipelined circuit [7] can be represented by

 $Tcp \ge MAX [\Delta Tp + 2\Delta C + Tsh + Trf, \Delta Tx + \Delta C + Tms + Trf]$ 

Where Tcp is the clock period of the circuit,  $\Delta tp$  is the difference between the longest and shortest paths in the circuit,  $\Delta C$  is the worst case clock skew, Tsh is the setup plus hold time for the registers, Trf is the

worst case rise/fall time at the last logic stage,  $\Delta Tx$ , is the propagation delay of the longest path from the input to signal X at any intermediate node, and Tms is the minimum time that X must be stable for the next stage of logic to operate correctly. The operating speed is limited by the delay between the shortest and the longest path and not on the total delay of the circuit as in conventional pipelining. The goal of the design process would be to reduce  $\Delta Tp$  and  $\Delta Tx$  as much as possible while the other Parameters have known methods to reduce them.

## III. EXISTING LOGIC STYLES FOR WAVEPIPELINING

For a balanced wave pipelined circuit, the gates designed should not have input dependent delay or fan-out dependent delay. All the gates in a particular logic family should have the same delay. Conventional static CMOS is the most preferred logic among designers because of its high reliability. A 2 inputs NAND gate is shown in Fig.1. The architecture of the basic gates result in input dependent and functionality dependent delays. Several design styles were proposed by researchers satisfying the timing constraints of wave pipelining.



Fig.1 Different CMOS NAND logic style

### A. Dual rail logic styles

Normal Process Complementary Logic(NPCPL)[9], Wave pipeline Transmission Gate Logic(WTGL)[3], are the dual rail logic styles used for wave-pipelining. NPCPL and WTGL are based on pass transistors and DRSCMOS is based on static CMOS. In NPCPL, a basic building block is used to develop all basic gates by properly choosing the input

signals Ai, Aj and B(for an AND/NAND gate(XY/ XY) Ai=X, Aj=Y and B=Y). The poor conduction of logic 1 by NMOS transistors in NPCPL result in voltage degradation and poor noise margin.WTGL gates use transmission gates to obtain full logic swing and better noise margin but static power dissipation is there because here the use NMOS.WTGL and NPCPL are fast because of the high logic functionality and low input capacitance of separate circuit paths for each possible input combination, thus eliminating pass transistors. Dual rail logic styles are multi-functional in nature and all the basic gates have the same delay. System designed with dual rail styles can be rough tuned because of the similarity in the basic architecture and the availability of "DELAY" gates. All the gates have output inverters for fine tuning purposes. WTGL and NPCPL have unbalanced input capacitances resulting in complex.

#### B. Double Pass transistor logic (DPL)

Suzuki et al. [8] proposed the double pass transistor logic [9] that overcomes all the problems of CPL, namely, voltage degradation and noise margin.DPL gates give improved circuit performance at reduced supply voltage because of the use of both NMOS and PMOS transistors. DPL gates are symmetrical whereby the load in any DPL gate is equally among the inputs.DPL distributed XOR/XNOR gate is perfectly symmetrical. The symmetrical arrangement and the double transmission property suggest that the DPL gates will perform very efficiently in wave pipelined circuits. The PMOS and NMOS transistors are used such that dual current path is set up for each input combination resulting in smallest equivalent resistance for DPL gates compared to other logic styles. In WTGL, here are two paths but the same input is passed along both the paths. The inputs are different in the two paths in DPL thereby distributing the load among the inputs. DPL was claimed to be the most energy efficient logic style among the discussed logic styles by Uming KO et.al. [10]. The symmetrical input loading, double transmission property and the energy efficiency of DPL gates make the DPL logic family the best suited logic style for wave pipelining.

### IV. PERFORMANCE OF BASIC GATES

The power \* delay product is a good measure for comparing the logic styles that are to be used in low power, high speed digital systems. The basic gates of all the logic styles were designed using TANNER EDA V.13 with TSMC 0.18µm CMOS technology at 2V rail to rail power supply. Table I gives a measure of the power\*delay product of various styles used in wave pipelining. Power measurement was done using the non invasive power measurement technique suggested by Kang [12]. The power\*delay product of the various styles show that DPL has the lowest power\*delay product among the dual rail logic styles. The single rail logic styles have

low power because of the Lower number of transistors and less switching activity.

Table I

| logi<br>c                 | Rise<br>time<br>Tr(ps) | Fall time Tp(ps) | Tr-<br>Tp(ps) | τphl      | τplh  | Power<br>Dissip<br>ation(<br>mW) | PDP         |
|---------------------------|------------------------|------------------|---------------|-----------|-------|----------------------------------|-------------|
| DP<br>LN<br>AN<br>D       | 44.46                  | 44.19            | 0.27          | 32.6<br>5 | 29.50 | .344                             | 10.68       |
| NP<br>CP<br>L<br>NA<br>ND | 34.78                  | 56.34            | 21.56         | 42.4      | 61.32 | .118                             | 6.12        |
| WT<br>GL<br>NA<br>ND      | 29.56                  | 27.53            | 2.03          | 29.2<br>2 | 28.31 | .34                              | 9.741       |
| DV<br>L<br>NA<br>ND       | 69. <mark>77</mark>    | 70.53            | 0.56          | 51.6<br>2 | 69.02 | 6.40                             | 386.0<br>48 |
| CM<br>OS<br>NA<br>ND      | 40.73                  | 41.92            | 1.19          | 33.1<br>6 | 27.43 | 2.03                             | 55.68       |

Give a measure of the power\*delay product of logic styles used in wave pipelining. Power measurement was done using the non invasive power measurement technique suggested by Kang. Though the power delay product of the WTGL and NPCPL logic has low but in NPCPL logic need threshold level restorer and low noise margin and in WTGL it has constant static power dissipation due to PMOS.

#### A. Modification to the DPL gates

The design goal for easier fine tuning is to have balanced input capacitance, that is, the inputs of the gate should be perfectly symmetrical. The DPL AND/NAND gates and the DPL OR/NOR gates are not perfectly symmetrical. All the inputs in these gates are connected to the gates of one NMOS and one PMOS transistor but source connections are either to PMOS or NMOS. The drain capacitances of the NMOS and PMOS transistors are not the same because of the difference in sizes of the transistors and the process parameters. Hence the gates are modified so that GND and supply connections are replaced by primary inputs. Delay gate is necessary to develop a complete library of basic gates. The delay gate has just one input unlike the other gates. Hence fewer transistors would be enough to design this gate. For achieving dual current path for a DELAY/DELAY gate, transmission gates should be used. Dual current

paths require that the transistors are on all the time. Hence the transistors should be driven by the supplies and are not controlled by the inputs. The MUX/DMUX gate is the only gate where perfect symmetry could not be achieved. This is because the multiplexer is a three input gate. The select input drives only the gates of the transistors and the other two inputs have the same capacitance.

### B. Performance of DPL basic gates

The power \* delay product is a good measure for comparing the logic styles that are to be used in low power, high speed digital systems. The basic gates of all the DPL logic styles were shown in Table II designed using the layout editor TANNER in 0.18 micron technology and the simulations were done using 2V supply in TSpice.

Table II

| logi<br>c            | Rise<br>time<br>Tr(ps | Fall time Tp(p s) | Tr-<br>Tp(p<br>s) | τph<br>l  | τplh       | Powe<br>r<br>Dissi<br>pation<br>(mW) | PD<br>P   |
|----------------------|-----------------------|-------------------|-------------------|-----------|------------|--------------------------------------|-----------|
| DPL<br>NA<br>ND      | 44.46                 | 44.19             | 0.27              | 32.6<br>5 | 29.50      | .344                                 | 10.6<br>8 |
| DPL<br>AN<br>D       | 59.68                 | 72.56             | 12.88             | 27.2<br>5 | 42.06      | .118                                 | 8.17      |
| DPL<br>OR            | 50.14                 | 24.32             | 25.82             | 90.9<br>6 | 52.25      | .114                                 | 8.13<br>3 |
| DPL<br>XN<br>OR      | 42.69                 | 43.93             | 1.24              | 82.5<br>8 | 107.0<br>7 | .147                                 | 13.9<br>3 |
| DPL<br>XO<br>R       | 47.01                 | 64.35             | 17.34             | 26.9<br>0 | 23.60      | .113                                 | 2.85      |
| DPL<br>NO<br>R       | 52.47                 | 36.00             | 16.47             | 117.35    | 75.50      | .361                                 | 34.8<br>0 |
| DPL<br>MU<br>X       | 44.49                 | 40.95             | 3.54              | 89.5<br>4 | 86.85      | .112                                 | 10.7<br>5 |
| DPL<br>DE<br>MU<br>X | 25.27                 | 43.05             | 17.78             | 15.5<br>7 | 26.56      | .336                                 | 7.07      |
| DPL<br>DE<br>LA<br>Y | 81.27                 | 78.74             | 2.98              | 200       | 203.4      | .227                                 | 45.7<br>8 |

### C. Wallace tree multiplier

Several popular and well-known schemes, with the objective of improving the speed of the parallel multiplier, have been developed in past. In 1964, C.S. Wallace observed that it is possible to find a structure, which performs the addition

operations in parallel; thus resulting in less delay. A Wallace tree [13][14] is an implementation of an adder tree designed for minimum propagation delay. Rather than completely adding the partial products in pairs like the ripple adder tree does, the Wallace tree sums up all the bits of the same weights in a merged tree. A Wallace tree is an efficient hardware implementation of a digital circuit that multiplies two integers.

#### D. Fast Parallel Multipliers

- Different Counter and Compressor Families were compared. The best way is to build a compressor of the maximal size (i.e. the entire size of the multiplier) as shown in Fig.2.
- The Essence of the optimal tree is optimal wiring and NOT the use of counter/compressor family.

Tuning of the Final Adder into the signal arrival profile is more important than the speed of the Final Adder.



Fig.2 Fast parallel multiplier (8bit×8bit)

## V. POWER ANALYSIS FOR MULTIPLIER 8-BIT

For DPL logic the power dissipation of the fast parallel multiplier is .903 milliWatt in the operating frequency at 10 GHZ.

## VI. CONCLUSIONS AND FURTHER RESEARCH DIRECTIONS

In this paper we have presented a Double Pass Transistor Logic (DPL) for low latency and high throughput applications. We have shown that DPL offers the best speed of operation comparable to CPL. It permits pipelining to the finest grain with negligible overhead of area and latency as opposed to other logic families where an increase in pipelining throughput is encumbered with heavy area and latency penalty. With DPL, it is possible to exploit both latency and throughput simultaneously to the maximum realisable extent. Because of its modularity and higher logic functionality, DPL bears the potential for a sea-of-gates realization. As a further research direction we are working towards exploiting DPL design for fabrication.

#### REFERENCES

- [1] G. Cotten, L.W., "Maximum rate pipeline system," in Proceedings of the 1969 AFIPS Spring Joint Computer Conference, pp.581-586,
- [2] K. Yano et al, "A 3.8 ns 16 x 16-b Multiplier Using Complementary Pass-Transistor Logic," .IEEE Jl. of Solid-state Circuits, SC-25, April 1990, pp.388-395py," in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350.
- [3] Zhang, X., and Sridhar, R. "CMOS Wave Pipelining Using Transmission Gate Logic," in Proc. of the IEEE International ASIC Conference and Exhibit, Rochester, NY, pp. 92-95, 1995.
- [4] Femandez, G.E., "Dual Rail Static CMOS Architecture for Wave Pipelining," Masters Thesis, Department of Electrical and Computer Engineering, SUNY Buffalo, 1995.
- [5] Wong et al., "Inserting Active Delay Elements to Achieve Wave Pipelining,", Proc. Int. Conf. CAD 89, pp.270-73, 1989.
- [6] Dipankar Talukdar and R. Sridhar, "An analytical approach to fine tuning in CMOS Wavepipelining", ASIC, September 1996.
- [7] C. Gray, W.Liu, and R. Cavin 111, "Timing Constraints for Wavepipelining System," IEEE Transactions on Computer-Aided Design of IC and Systems, 13, 1994, pp 987-1004.
- [8] Makato Suzuki et al., "A 1.5 ns 32-b CMOS ALU in Double Pass-Transistor Logic', IEEE Journal of Solid State Circuits, Vol. 28., No. 11, Nov 1993.
- [9] Ghosh, D., and Nandy, S. "NPCPL: Normal Process Complementary Pass Transistor Logic fro Low Latency, High Throughput Application," Proc. VLSI Design 93, pp 341-46, Jan 1993.
- [10] Uming KO, Poras T. Balsara, and Wai Lee, "Low-Power Design Techniques for High-Performance CMOS Adders," IEEE Transactions on VLSI Systems, Vol. 3, No. 2, June 1995.

- [11] Liu, W. et al. "A 250MHz Wave Pipelined Adder in 2 CMOS," in IEEE Journal of Solid State Circuits, September 1994.
- [12] Sung MO Kang, "Accurate Simulation of Power Dissipation in VLSI Circuits," IEEE Journal of Solid State Circuits, Vol SC-21., No 5., Oct 1986.
- [13] V. G. Oklobdzija and D. Villeger, "Improving Multiplier Design By Using Improved Column Compression Tree And Optimized Final Adder In CMOS Technology," IEEE Transactions on VLSI Systems, Vol.3, No.2, June, 1995, 25 pages.
- [14] Z. Shun, O. A. Pfander, H.-J. Pfleiderer, and A. Bermak, "A VLSI ar-chitecture for a run-time multi-precision reconfigurable Booth multiplier," in Proc. 14th IEEE Int. Conf. Electron., Circuits, Syst., Dec. 2007, pp. 975–978.