Design for Arithmetic Based Dynamic Binary-To-RNS Conversion Using Multi modulo Operations

Ganji Sarikayadav¹, Dr.V.Thrimurthulu², S.Ali Asgar³

¹II M.Tech VLSI SD Student, CR EngineeringCollege, Tirupathi, Chittoor(Dist) A.P, India,
²Professor, Head of ECE Dept., CR Engineering College, Tirupathi, Chittoor (Dist) A.P, India,
³Assistant Professor of ECE Dept., CREEngineering College, Tirupathi,Chittoor(Dist) A.P, India,

Abstract
To design a new ROMLESS (Read-Only-Memory less) structure for binary-to-RNS (Residue Number System) conversion using modulo \(2^n \pm k\) is proposed. The proposed read-only-memoryless structure is only based on constant multipliers and adders. The development structure is efficient for larger values of \(n\) and existing system is inefficient for larger values of \(n\). Experimental results obtained for an proposed conversion structures significantly improve the forward conversion efficiency with at metric improvement above 100% regarding the related state of art. Delay improvements 2.17 times with less area can be achieved if a proper selection of the \(2^n \pm k\) moduli is performed.

Keywords: Arithmetic, binary-to-RNS,forward conversion, residue number system.

1. INTRODUCTION
In this paper Residue Number System is unconventional and non-Weighted Number System in which the additions, subtractions and multiplication are inherently carry free. This increase in calculation speed and a reduction in its power consumption.RNS uses remainders to represent numbers. Its main characteristics offer the potential for high-speed and parallel processing based on carry-free arithmetic [1]. This defined by the moduli set that supports each particular RNS. The RNS moduli set is set up by defining the moduli,where mi represents positive relatively prime integers. A number \(X\) is represented in RNS by its residues \(x_i = (X \mod mi)\), where \(x_i\) is theremainder of the division of \(X\) by \(mi\). Conversion from weighted number system to Residue Number System (forward conversion or binary -to-RNS),and vice versa (reverse conversion or RNS-to-binary), is required inorther to implement a complete RNS-based processing system. RNS is basically used on computational intensive applications such as filtering, convolution, digital signal processing correlation, fast Fourier transform computation and cryptography [2–4].

First research was done on Residue Number System was mainly focused on the three modulus set \(\{2^n-1, 2^n, 2^n+1\}\) [5].Recently, different RNS moduli sets have been proposed in order to increase the dynamic range (DR) and/or reduce the width of the RNS channels, such as \(\{2^n-3, 2^n-1, 2^n+1, 2^n+3\}\) in [6] and [7], \(\{2^n-1, 2^n+1\}\) in [8], \(\{2^n-1, 2^n, 2^n+1, 2^n+1\}\) in [9], and \(\{2^n-1, 2^n, 2^n+1, 2^n+1\}\) in [10]. More recently, the moduli set with a DR up to \((8n + 1)\) bits has been proposed.

The moduli \(\{2^n \pm k\}\) with unrestricted \(k\) values, is useful to obtain larger Residue Number System moduli sets [4], resulting in circuits with improved metrics. Larger moduli sets for the same DR, the operators are smaller, and more compact arithmetic units are achieved, with circuit area requirements and reduced delay. Most of the forward converters presented in the state of the art are limited in terms of the number of bits per channel are unable to scale for larger DR, given their exponential growth with the number of bits per channel, mostly having structures based on LUT (lookup tables), such as ROMs (read only memories).

The proposed structures are based on weighted reduction, considering a serial–parallel, serial or fully parallel approaches. Considering the periodicity property was proposed.The periodicity property can be used only to improve the conversion when shorter values are found for modulo \(\{2^n \pm k\}\). More recently, proposed a novel conversion structure in that overcomes this dependency using the distributive property instead of periodicity, allowing for unrestricted modulo values. In proposed multimoduli architectures, using a weight selection algorithm, with binary additions and ROMs. These architectures allow the same arithmetic operations for different moduli within the same structure, but suffer from the same lack of scalability for larger DR. Furthermore, since the brief herein presented is focused on simple single-modulo conversion structures, these are not herein considered. In the analysis and implementation
of a computer-aided design (CAD) tool is presented, capable of generating a structural description of binary-to-RNS converters, for a DR up to 21 bits. The proposed method actually allows achieving forward converters for any DR, as herein shown.

A novel ROM-less generic forward conversion structure for a DR of m = jn-bit, using \( \{2^n \pm k\} \) moduli, is proposed, considering \( n \geq 2 \). The proposed approach splits the \( jn \) input bits into \( j \) input sets, and computes the respective residue value using modular additions and constant multiplications. The use of constant multipliers in the proposed scheme does not impose the exponential area increase as the ROM-based topologies proposed in the related state of the art. To evaluate the performance of the proposed structure, the experimental results were obtained. These results suggest that the proposed forward conversion topology allows improvements of 85\% in area and 15\% in delay when compared with the ROM-based modulo \( \{2^n \pm k\} \) binary-to-RNS converters proposed. Moreover, improvements up to 50\% on delay can be achieved when comparing with the ROM-less-based converters proposed with a minimal increase in circuit area resources, 10\% higher on average.

This brief discussion as follows. Section II introduces the formulation adopted to design the modulo \( \{2^n \pm k\} \) binary-to-RNS conversion structure described in Section III. Section IV presents the experimental results, and compares the proposed topology with the related state of the art. The conclusion is presented in Section V.

II. RELATED WORK

Considering a integer X with m-bit inputs, represented as \( \{X_{m-1}, \ldots, X_1, X_0\} \) with

\[
X = \sum_{i=0}^{m-1} 2^i X_i
\]

a forward converter for modulo \( \{2^n \pm k\} \) transforms X into a residue value r with \( w_{\text{mod}} \) bits, \( r_{\text{remod}+1}, \ldots, r_1, r_0 \), with \( w_{\text{mod}} = \lfloor \log_2(2^n \pm k) \rfloor \) \( w_{\text{mod}} = n \) for modulo \( \{2^n \pm k\} \), and \( w_{\text{mod}} = n + 1 \) for modulo \( \{2^n \pm k\} \), when 0 < k \( \leq 2^n \). The residue modulo \( \{2^n \pm k\} \) of the input value X can be achieved by computing the integer division of X by \( \{2^n \pm k\} \), but it is a costly operation to obtain the remainder. The approach here considered to derive \( X \) is based solely on simple modular arithmetic operations. Given the property

\[
(2^n \pm k)X_0 \equiv (2^n \pm k \oplus k)X_0 \equiv (2^n \pm k)X_0 \mod 2^n
\]

and with \( X_0 = \lfloor X \rfloor \mod 2^n \), and considering X as the binary representation of an integer to be converted, with a DR of jnbits, where \( X_0 = \lfloor \text{msb} \oplus \text{lsb} \rfloor \), represents the msb to lsb bits of integer X. The residue modulo \( \{2^n \pm k\} \) of the value X with jnbits can be computed as \( \langle X \rangle \equiv \langle X \rangle \mod 2^n \).
and carry vectors. The final reduction is accomplished using a modified modulo adder and a lookup table. For larger moduli and DPs, the usage of the lookup table makes it once more not scalable. In another weight selection-based conversion approach is proposed. However, in this case, the resulting structure is based solely on full adders, allowing it to scale to larger moduli channels and DPs. The authors also made available a CAD tool to facilitate the implementation of the proposed conversion structures. However, the developed tool only considers DR up to 21 bits. In order to obtain conversion structures for the considered DPs, an augmented version of this CAD tool was developed, considering the proposed method. The version implemented by us now allows for the unrestricted DR, and was designed to use the same optimized arithmetic structures used in the herein proposed conversion structures. The conversions structures generated by the new tool are herein referred to as Soudris.

Given the inadequacy of most of the related state-of-the-art structures, for larger values of \( n \), two new generic binary-to-RNS conversion structures are herein proposed for modulo \( [2^n \pm k] \), only requiring arithmetic operations. These conversion units only require constant multiplication and addition units: 1) Proposed I structure targets a more compact structure with a serialized modular reduction approach, which is an extension of the previous work presented but allowing now to compute X modulo \( [2^n \pm k] \) with DR of \( j n \) bits instead of the previous 4n bits; 2) Proposed II structure targets faster conversion considering a parallel approach. In these proposed structures, a restriction to the width of \( (o_k)_{ij} \) is considered, allowing to simplify the modular reduction step, namely \( o_k = [\log_2 (k)] \leq n/2 \). Even with this restriction of \( o_k \), the \( k \)-constant multiplexer value in (2) and (3) can be greater than \( 2^n \pm k \) for \( i \geq 2 \), since for \( k < 2^{n/2} \) results in \( k < 2^{n/2} \). In these cases the resulting modulo constant \( (k^i)_{2^{\text{max}}} \) can be precomputed and reduced modulo \( [2^n \pm k] \), providing a constant value with \( o_k \) bits, where \( o_k = [\log_2 ((k^i)_{2^{\text{max}}})] \). The maximum width of \( k \)-values is represented by \( o_{\text{max}} = \max(o_k) \), \( 0 \leq i \leq j - 1 \).

Proposed II structure for modulo \( [2^n \pm k] \) forward conversion is based on a parallel approach. In this approach, each partial operation \( (k^i \cdot X_i) \) is first reduced modulo \( [2^n \pm k] \), and only then added to obtain the final residue value. Thus, computation is performed in two stages: the first stage computes the constant multiplication vectors and the second stage performs the addition of all those vectors. This structure performs the modular reduction of each calculation, using the modulo \( [2^n \pm k] \) Carry-Save Adder instead of adding all terms and reducing them iteratively at the end, as in Proposed I.

The considered modular \( k \)-constant multiplication block computes

\[
\langle (k^i)_{2^{\text{max}}} X_i \rangle_{2^{\text{max}}} = (k^i_1 \cdot \omega_{k-1} + \cdots + \omega_{k-n+1} \cdot X_i)_{2^{\text{max}}} = (P^1_{\omega_1 \cdot \omega_{k-1}} + \cdots + P^1_{\omega_{k-n+1} \cdot \omega_{k-n}})_{2^{\text{max}}}
\]

Formulas (6) and (7) for the binary RNS converter module \( [2^n \pm k] \).

However, when \( \omega_k \cdot \omega_k > n \), the value \( P^2 \) has more than \( n \) bits, requiring an additional reduction step, computed as

\[
\langle (k^i)_{2^{\text{max}}} X_i \rangle_{2^{\text{max}}} = (\tau_k \cdot \Pi^1_{\omega_1 \cdot \omega_{k-1}} + \cdots + \Pi^1_{\omega_{k-n+1} \cdot \omega_{k-n}})_{2^{\text{max}}}
\]

Fig. 1. Proposed binary-to-RNS converter modulo \( [2^n \pm k] \).

Thereafter, the \( k \)-constant modular multiplication can be implemented with two constant multipliers when \( (0_k + o_k) \leq n \), or with an additional constant multiplier and a 3.2 modulo compressor, \( \omega_k \cdot \omega_k > n \). The resulting modular \( k \)-constant multiplication block is shown in Fig. 1. The additional resources for the more complex implementation case are represented with dashed lines. After the computation of \( \langle (k^i)_{2^{\text{max}}} X_i \rangle_{2^{\text{max}}} \), the resulting carry-and-save vectors are fed into a modular adder tree to obtain the final result \( \langle X \rangle_{2^{\text{max}}} \). In the particular case of the binary-to-RNS modulo \( [2^n \pm k] \), an additional input is required in the modular adder tree to compute the addition of the correction factor \( \text{cst} \), from (6).

IV. EXPERIMENTAL RESULTS

In order to evaluate fully the proposed binary-to-RNS converters and the related state of the art, all structures were described in Very High Speed Integrates Circuits Hardware Description Language.
and mapped to an application-specified integrated circuit technology, in particular for the United Microelectronics Corporation (UMC) 0.13-μm CMOS technology from UMC. The ROM results have been obtained by synthesizing the available ROM sizes in the synchronous via-1 ROM compiler for the UMC 0.13-μm high-speed logic process technology, and the others were estimated based on the real obtained values. Both synthesis and mapping results were performed using Design Vision E-2010.12-SP4 from Synopsys.

In order to take into account the impact of different values for n and k, the presented experimental results were obtained for a variation of n ∈ [6, 32]. For the value k, the best case is obtained for k = 3, and for the worst case, the values k = 2^n - 1 and (k^i)|2^a_k = 2^n - 1 are considered. The worst case value for k correspond to the worst case scenario for Proposed I and Proposed II structures implying the most complex and costly multipliers. Since for the Soudris structure the worst case scenario cannot be easily determined, and to simplify the analysis, the same value of k, k = 2^n - 1, is used as the worst case value. For the Premkumar structure, the presented values are for k = 3, since no significant variations occur for other values of k.

In order to properly evaluate the cost of the several conversion structures, Residue Number System with four and eight moduli channels are considered, resulting in a DR of 4n and 8n, respectively. The obtained experimental results for circuit area and delay are shown in Fig. 2(a), (b),(d), and (e).

As expected from the theoretical analysis the ROM-based conversion structure proposed imposes significantly worse area metrics even for small DR. As the DR increases, and with it the size of the ROM, the area of ROM-based conversion structures significantly increases, due to the exponential increase of the ROM area. The Premkumar structure has worst area metrics, as the Piestrak converter, for small modulo and DR. However, the Premkumar structure has less area than the Piestrak for larger modulo, for modulo channels with more than 10 bits. Nevertheless, the Premkumar structure requires up to 6.85 times more area resources than Proposed II.

The Soudris conversion structure always requires more area resource for k = 3 when compared with Proposed II. However, when considering the worst case of k, the Soudris converters require less circuit area than the proposed converters. Note that the worst case scenario for the herein proposed conversion structures is not realistic since it considers all the constant multipliers in the most complexform. Still, less area demanding conversion units can be obtained with the proposed conversion structures, when compared with the related state of the art, in particular for larger values of n. Given the parallel approach of Proposed II structure, when compared with Proposed I, slightly higher area requirements are observed. From the obtained area results, it can also be concluded that the area increases with the number of moduli channels (j), since more constant multipliers and adders are needed, in particular for the more parallel Proposed II structure. The obtained experimental results suggest that the ROM-based related art (Piestrak) can in fact be faster, but only for values of n up to 18 bits. The modular adder-based conversion structure from Premkumar has worst delay metric than Piestrak for n up to 18 bits, but achieves better delay metrics for n greater than 20 bits. The Soudris conversion structure has the worst delay for the considered structures, nevertheless for n greater than 22 bits better delay metrics than the Piestrak conversion structure can be achieved. Regarding the proposed conversion structures, Proposed I conversion structure is able to achieve lower delay metrics than the related state of the art for n as low as 8 bits. Proposed II structures always faster for n greater than 18 bits, even for the worst case of k, for acceptable area costs. The delay improvement inversion point depends on the number of modulo channels (j) and the considered k values. For example, the Premkumar converter allows to achieve better delay metric regarding ours in the worst cases. However, requiring on average 5.42 times more area resources. On average, the obtained results, for k = 3 and for the worst value of k, suggest that Proposed I structure requires on average 91%–87% less area, and is on average 10% faster to 16% slower than the Piestrak and Premkumar conversion structures, respectively. Regarding the Soudris structure, it allows for a 12% faster conversion at a cost of 44% more area resources. The experimental results suggest that Proposed II structure is on average 85%–80% smaller, and 30%–5% faster than the Piestrak and the Premkumar structures, respectively. When compared with the Soudris structure, the experimental results suggest that Proposed II structure is on average 40% faster at a cost of 56% more circuit area. Note that this average is obtained from k = 3 and k = worst case, which does not allow to fully interpret the resulting metrics. To better understand the variation of these metrics with the variation of the k value, the Premkumar, the Soudris, and Proposed II conversion structures were synthesized for all possible k values for n = 20, considering a DR of 4n bits and modulo (2^n - k), as shown in Fig. 2(c) and (f). From these results, it can be seen that the Premkumar structure has a stable value for the delay metric, meaning that the obtained results are less dependent.
of the values of \( k \).

(a) DR 4n modulo \( \{2^n \pm k\} \), (b) DR 8n - modulo \( \{2^n \pm k\} \), and (c) \( j = 4, n = 20 \), varying \( k \) and delays, (d) DR 4n-modulo \( \{2^n \pm k\} \), (e) DR 8n-modulo \( \{2^n \pm k\} \), and (f) \( j=4, n=20 \), varying \( k \).

Fig. 2. Experimental results for binary-to RNS converter with areas.

However, this invariance is achieved at significantly higher area costs, up to 18 times more area resources than Proposed II, being outside of the considered scale in Fig. 2(c). The obtained results also suggest that Proposed II structure is always faster than Soudris, in particular for smaller values of \( k \).

![Image](image.png)

Fig. 3. Ratio of the energy consumption for \( j=4, n = 20 \), varying \( k \).

For the first 100 \( k \) values, the structure herein proposed achieves an average delay improvement of 48% when compared with the Soudris structure, at a cost of 36% more area resources. If the best 16 \( k \) values of each structure are considered, the delay of Proposed II structure is on average 2.17 times faster with just 5% additional area resources, resulting in an AT performance metric improvement of 108%. When considering the energy consumption per conversion, Proposed I structure suggests the largest variation, according to the selected \( k \) value, as shown in Fig. 3.

Nevertheless, the obtained results suggest that with an adequate selection of the moduli set, (i.e., the \( k \) values) significantly less energy is required for the conversion. For example, for a moduli set with four channels, the Soudris and the Premkumar structures require 128% and 594% more energy, respectively, than Proposed II structure, with a modulo conversion requiring an average of 110 nJ. The results also suggest that, if the moduli set is extended to 16 channels, the Soudris and the Premkumar structures will require 50% and 264% more energy than Proposed II structure, respectively. From this, it can be concluded that Proposed II conversion structure allows for significantly more efficient conversion metrics, in particular the \( k \) values are adequately selected.

V. CONCLUSION

In this brief, two generic and scalable modulo \( \{2^n \pm k\} \) binary-to-RNS conversion structures are proposed for \( jn \)-bit DRs. The proposed approach splits the \( jn \) input bits into \( j \) input sets, and computes the respective residue value using modular additions and constant multiplications, implementing ROM-less structures. To assess the gains achieved by the proposed structures, the experimental results were obtained for 4\( n \)- and 8\( n \)-bit DRs. The obtained experimental results suggest that the proposed conversion approach allows for average delay improvements between 10% and 40%, with a worst average area cost between 44% and 56%, regarding the best existing state of the art. Nevertheless, if a proper selection of the used \( k \) values is performed, 2.17 times faster conversion operations with only 5% extra area resources can be achieved, with AT performance metric improvements above 100%.

ACKNOWLEDGEMENT

I express my sincere thanks to my guide Mr. S. ALI ASGAR, M. Tech, Assistant Professor of ECE DEPARTMENT, and to my head of department Dr. V. THRIMURTHULU M.E., Ph.D., M. IETE, M. ISTE, Professor & Head of ECE Department, CREC, TIRUPATHI, for their valuable guidance and useful suggestions, which helped me in the project work.

REFERENCES


AUTHORS:

G. Sarika Yadav received her B. Tech degree in Electronics & Communication Engineering from SV Engineering College for Women, Tirupati (A.P), India, in the year 2013. Currently pursuing her M. Tech degree in VLSI System Design at Chadalawada Ramanamma Engineering College, Tirupati (A.P), India. Her area of research includes low power VLSI design.

Dr. V. Thrimurthulu M.E., Ph.D., MIETE, MISTE Professor & Head of ECE Dept. He received his Graduation in Electronics & Communication Engineering AMIETE in 1994 from Institute of Electronics & Telecommunication Engineering, New Delhi, Post Graduation in Engineering M.E specialization in Microwaves and Radar Engineering in the year Feb, 2003, from University College of Engineering, Osmania University, Hyderabad, and his Doctorate in Philosophy Ph.D. from Central University, in the year 2012. He has done his research work on Ad-Hoc Networks.

S. Ali Asgar was born in Tirupati, Andhra Pradesh, India. He received B. Tech degree in Electronics & Instrumentation engineering from Sree Vidyanikethan engineering college, Rangampeta in the year 2007. He did his M. Tech. in Digital Signals in 2012 from SJCET, Yenniganur (A.P), India. He is currently working as Associate Professor in Chadalawada Ramanamma Engineering College, Tirupati (A.P), and India.