A Charge-Accumulation Based High-Performance CMOS Circuit

There is no doubt that complementary metal-oxide semiconductor (CMOS) circuits with wide fan-in suffers from degraded performance. In this paper, a circuit that depends on charge accumulation is proposed as an alternative to conventional CMOS design. The proposed scheme is investigated quantitatively and verified by simulation using predictive technology model (PTM) of the 45 nm CMOS technology with a power-supply voltage, V DD , equal to 1 V. Although the proposed scheme suffers from more sensitivity to process variations compared to static CMOS, the comparative analysis and simulation results confirm the superiority of the proposed scheme from the points of view of speed, area, power consumption, and unity-noise gain. It is verified that the proposed scheme has a smaller area, power consumption, and delay compared to the conventional CMOS design when the number of inputs, n , exceeds four, two, and three, respectively. The impacts of process variations, component mismatches, and technology scaling are also investigated. The speed advantage gained from the proposed scheme is expected to be more obvious when operating in the subthreshold region. A figure of merit including the unity-noise gain, area, power consumption, and time delay is defined and the proposed scheme showed superior performance compared to the conventional CMOS logic when n exceeds four. Finally, the proposed scheme was compared with various previous schemes.


1.INTRODUCTION
Static CMOS circuits have been dominant during the last three decades due to its low static-power consumption and high noise immunity [1]. However, increasing the fan-in of conventional CMOS circuits causes the performance of these circuits to degrade. Specifically, increasing the number of the inputs causes the number of either the serially connected PMOS transistors in the pull-up network (PUN) or the number of the serially connected NMOS transistors in the pull-down network (PDN) to accordingly increase, thus lengthening the low-to-high or the high-to-low propagation delay. The need to charge the parasitic capacitances associated with the NMOS and PMOS devices also increases the dynamic-switching power consumption. The matter becomes worse when adopting pseudo NMOS or dynamic CMOS logic families due to the dc current and the keepercontention current, respectively [2]. This is typically the case with wide fan-in NAND and NOR gates. Comparators, multiplexers, and microprocessor circuits are types of applications that require wide fan-in [3]. In this paper, a quick survey of the previous solutions to this problem is given and an alternative design that is based on charge accumulation is proposed.
The remainder of this paper is organized as follows: Section 2 provides a quick survey of the previous solutions to the problem of the degraded performance of wide fan-in CMOS circuits with their pros and cons and the proposed solution is presented qualitatively in Section 3 with the comparative analysis presented in Section 4. The impact of process variations and component mismatches on the proposed scheme is discussed in Section 5. Section 6 includes the simulation results, discussions, and comparisons with other schemes. Finally, the paper is concluded in Section 7.

2.LITERATURE REVIEW
In this section, some of the previously proposed schemes used to speed-up the response of wide fan-in CMOS circuits are discussed with their pros and cons illustrated. Before delving into these schemes, refer first to Figs. 1 and 2 for the circuit schematic of the standard domino logic and the timing diagram of the clock signal, CLK, respectively [4]. During the precharge phase, the CLK signal is at logic "0," thus activating the PMOS header transistor, QP, deactivating the NMOS footer transistor, QN, and charging the dynamic-node capacitance. If the inputs are allowed to change during this phase, there will be no change in the status of the output.
During the evaluation phase, the CLK signal is at logic "1," thus activating QN. Depending on the status of the inputs, the dynamic-node capacitance discharges or remains charged. If there is no discharging path during the PDN, the keeper keeps the dynamic node charged at VDD in spite of the leakage current in the PDN. The PMOS keeper, however, slows down the discharging process due to its contention current.
In [5], the strength of the PMOS keeper in a typical domino logic was changed by changing its threshold voltage by double capacitive body biasing so that both the leakage power can be reduced and the speed can be enhanced [6 -8].
In [9 and 10], a footer voltage feedforward domino was proposed in which the footer transistor is dispensed, however, two parallel paths can be activated during the evaluation phase using charge sharing. According to this technique, the speed is enhanced by a feedforward path and the noise immunity is enhanced using the self-reverse biasing [11]. The techniques used to resolve the trade-off between noise immunity and speed in wide fan-in domino logic can be classified into two categories; using a conditional keeper or raising the voltage of the source terminals of the PDN transistors [9]. The first category depends on controlling the strength of the PMOS keeper while the second category utilizes the stacking effect [12] or modifying the PDN [13]. Keeper types include conventional feedback keeper [14], XOR-based keeper [15], and conditional keeper [16]. The first one is a single keeper with contention current, although simple in design, it has the slowest speed due to the contention current. The second type is a single keeper without contention, thus resulting in the highest speed, however, at the expense of wrong output in case of charge sharing. The third type is a dual keeper which has the most complicated design [17].
In [18], a conditional isolation keeper was utilized that reduces the dynamic-switching power consumption by reducing the voltage swing associated with the dynamic node and separating the dynamic node from the PDN by an NMOS device, thus reducing the parasitic capacitance. M. Nasserian et al. have proposed controlling the voltage swing of the dynamic node [19], thus decreasing the power consumption of wide fan-in gates without degrading the other metrics such as speed, area, or noise immunity [20].
In [21], a technique was proposed that depends on performing a comparison of mirrored current of the PUN with its worst-case leakage current. According to this technique, the parasitic capacitance of the dynamic node was isolated from the PDN, thus resulting in a smaller keeper contention current, power consumption, and delay. Other keeper designs can be found in [6, 22 -29]. In [30 and 31], the dynamic-switching power consumption was reduced by using charge sharing between a small dummy capacitor that was charged to a low-swing output voltage and a predischarged capacitance. The small swing at the output can then by detected and restored to rail-to-rail voltage swing using a proper sense amplifier.
In [32], a diode-connected transistor is serially connected to the transistors in the PDN, thus limiting the subthreshold leakage by the stacking effect [33]. Also, a current mirror was added to speed-up the evaluation. In [34], two modifications were made. The first one is using a buffer to delay the operation of the PMOS keeper, thus inhibiting its contention current. The second modification is adopting a variation-coupled keeper in order to compensate for the variation of the leakage current with process variations [35]. In [36], a diode-connected NMOS transistor was connected in series with the PMOS keeper in order to reduce the contention current during the evaluation phase. In [37], a current mirror was adopted in order to enlarge the discharging current of the PDN and speed-up the operation.
In [38], a negative capacitance was adopted in conjunction with the node with the highest parasitic capacitance, thus improving the timing yield. S. Narang has proposed varying the supply voltage in order to compensate for the negative bias temperature instability [39]. P. K. Pal et al. have proposed using a voltage comparison circuit in order to achieve a smaller power dissipation and higher speed by reducing the current of stacked transistors and the number of switching nodes [40]. Also, dual-threshold voltage can be adopted [41]. Feedthrough logic was used in [42] in order to improve the performance by partially evaluating the voltage in the computational block before developing the final steady-state output. In [43], a voltage-comparison based domino logic was proposed to compare between the voltages of upper and lower nodes of the PDN by a proper voltage comparator resulting in lower power consumption and higher noise immunity without significant delay increment for wide fan-in gates.
An improved keeper that is based on a graphical representation of the trade-off between the performance and noise margin was suggested in [35]. One of the most important applications that have wide fan-in is the comparator. In [44], a parallel prefix structure was used to implement a large bitwidth comparator without unnecessary transitions. According to this design, the comparator was constructed by locally interconnecting a limited number of CMOS gates that does not exceed five and four for the fan-in and the fan-out, respectively. Also, the circuit shown in Fig.  3 was proposed in [45]. It depends on initially charging CL, then deciding to keep it charged or discharge it depending on the voltage divider consisting of the always activated PMOS transistor and the NMOS branches.
Other realizations of wide fan-in CMOS gates can be found in [46 -48]. In [47], the inputs are applied to NMOS transistors resulting in a current that is proportional to the number of the activated inputs, then entering a current race. This scheme, although faster, consumes larger power. In [47], charge sharing is performed between two capacitors, the size of one of them depends on the number of the activated inputs. In [48], a pulse with an adjusted width is adopted in order to result in a proper output in a high fan-in circuit. In [49 and 50], the delay variability was reduced by adopting a dual keeper with clock control; this was achieved by reducing the loop gain of the feedback circuitry. In [51], the contention current was reduced using a clock-delayed dual-keeper technique. In addition, by virtue of the stack effect utilized in the keeper circuitry, the size of the keeper could be increased to enhance the robustness of the circuit without sacrificing the speed. In [52], the keeper was controlled using a controlling network; 31.42% and 31.91% reductions in the power consumption and power-delay product, respectively, were reported for 32-inputs OR gate. In [53], the contention current was eliminated at the beginning of the evaluation phase by modifying the keeper; specifically, an NMOS device was added in series to the keeper. In [54], a multiplexer was used for gating the clock signal, thus reducing the power consumption. Garg et al. have proposed using the stack effect to reduce the noise effect and leakage currents, however at the cost of a delay penalty due to the addition of an inverter between the dynamic node and the output node [55]. In [56], a delay network containing an odd number of inverters was used to control the keeper.
Finally, in [57], a floating-gate MOS transistor whose equivalent resistance depends on the number of the activated inputs was adopted. In a nutshell, the domino CMOS logic suffers from the degraded performance with wide fan-in in addition to the trade-off between noise immunity and speed. In the next section, the proposed scheme is presented. As will be observed, the proposed scheme has a speed advantage with this advantage more clear with wide fan-in. This is due to the fact the proposed scheme depends on the parallel operation in the input paths instead of the series sequential operation of the conventional CMOS logic and domino logic.

3.THE PROPOSED SCHEME
Refer to Fig. 4 for illustrating the proposed scheme. According to this scheme, there are two phases; the predischarge and evaluation phases. The circuit operates as follows: During the predischarge phase, the CLK signal is activated, thus turning on the associated NMOS transistors and discharging any remnant charge on any of the capacitors with value C and discharging the parasitic capacitance at the inverter input, CL. The reason for this will be clear shortly.
The floating capacitors can be implemented as interpoly (metal parallel plate) capacitors or fringe capacitors. If any one of the inputs were activated during the predischarge phase, there will be no response at the VCL node due to the deactivation of the series NMOS transistors in the input paths activated by the CLK signal. Now, during the evaluation phase, the CLK signal will be at logic "0" and consequently, if any one of the inputs is activated, then the capacitor in the corresponding branch will charge CL by a specific amount. The larger the number of the activated inputs, the larger will be the voltage developed across CL. In order for the scheme to operate properly, the threshold voltage of the inverter, Vthinv, must be adjusted such that if all the inputs except at least one is activated, then the generated voltage across CL will be smaller than Vthinv and thus the output voltage will be at logic "1." On the other hand, if all the inputs are activated, then the generated voltage across CL must be larger than Vthinv with the result that Vout will be at logic "0" as it must be. If one of the inputs is activated in a certain cycle then deactivated in the next cycle, then the charge remnant on the corresponding capacitor with value C will affect the voltage across CL and the output voltage may be erroneous. So, it is necessary to initially discharge all these capacitors. This is the reason why the CLK signal is used along with the corresponding NMOS transistors to discharge any remnant charge. For the same reason, CL must be initially discharged. If it were not for the series NMOS transistors in the input paths activated by the CLK signal, activation of any one of the inputs will create a contention path with the discharging of CL, thus slowing down the discharging process or resulting in no discharging. Thus, there is a need to cut the input paths during the predischarge phase. Alternatively, the NMOS transistor discharging CL can be sized properly and the n NMOS transistors related with the CLK signal can be dispensed. In this case, the sizing of the discharging transistor must be done adopting the worst case; that is, assuming all the n inputs are activated during the predischarge phase. Finally, the reason behind connecting the inputs to both the gate and drain terminals of the pass devices is as follows: If one of the inputs is at logic "0," then the corresponding pass device will be deactivated, thus not affecting the charge accumulated across CL. The proposed circuit can thus operate as an alternative to the NMOS stacks. Any even number of inverters can be used as a buffer in order to obtain a rail-torail voltage swing at the output node.
Note also that all the n-input paths are identical. This is due to adopting the assumption that all the n inputs have the same probability of occurrence and thus there is no preference for an input over the others. It is obvious that the voltage difference at CL between the cases of all-activated-inputs and all-except-one-activated inputs decreases with increasing n until we arrive at the state that this difference is smaller than that caused by process variations; an obvious degradation in the performance of the proposed scheme for large values of n. A solution to this dilemma can be found in Fig. 5. In this figure, the inputs are decomposed between two identical circuits, each one of them is similar to that of Fig. 4. Then, the outputs of these two circuits are ORed together so that the final output will be at logic "0" only when all the inputs are activated. In this case, the robustness of the circuit is enhanced, however, at the cost of the additional delay of the OR gate.
The version of the proposed scheme which acts as an alternative to the PMOS stack is the same as that in Fig. 4 with only one modification. The threshold voltage of the inverter must be smaller than VCL in case only one input is activated. This requires lowering Vthinv by a large amount which requires using a huge NMOS transistor in the inverter in order for the circuit to operate properly for all the input combinations. As an alternative, the circuit shown in Fig. 6 can be used to replace the PMOS stack. Finally, the proposed scheme can also be extended for large values of n using the circuit shown in Fig. 7 where a NOR gate is added.

Operation in the Subthreshold Regime
The drain current of the N-channel MOSFET transistor in the subthreshold region is given by [ (1)

VGS, VDS, and
Vthn are the gate-to-source voltage, the drainto-source voltage, and the threshold voltage, respectively. I0N is given by where µ0n is the electron mobility, Cox is the gate-oxide capacitance per unit area, W is the channel width, L is the channel length, nn is the subthreshold-swing coefficient, and Vth is the thermal voltage.
It is apparent that the drain current in the subthreshold region depends exponentially on both VDS and VGS. Thus, for stacks operating in the subthreshold region, the drain current decreases significantly which causes the charging/discharging of the parasitic capacitance at the output node or the internal nodes to be much slower. This is in contrast to the charge-accumulation process of the proposed scheme which is performed in a parallel fashion. So, the speed advantage gained from the proposed scheme is expected to be more obvious when operating in the subthreshold regime.

4.COMPARATIVE ANALYSIS
In this section, the proposed scheme of the circuit shown in Fig. 4 is investigated quantitatively from six aspects. The first one is the determination of the valid range of Vthinv in order for the proposed scheme to operate properly. The second aspect is the comparison between the high-to-low propagation delay of the proposed and conventional schemes. The third aspect is the area comparison between the conventional and proposed schemes. The fourth one is the power-consumption comparison between the conventional and proposed schemes. The fifth aspect is the noise immunity. The sixth and last aspect is the determination of the minimum frequency of operation. We will in the following analysis assume that the parasitic capacitances at the output according to the conventional and proposed schemes are represented by Coutc and Coutp, respectively.

The Proper Range of Vthinv
As stated in the previous section, for the circuit of Fig. 4 to operate properly, Vthinv must be larger than VCL that is generated when all the inputs except at least one is activated and smaller than VCL that is generated when all the inputs are activated. The previously mentioned condition corresponds to the minimum range for Vthinv and thus represents the worstcase scenario. When all the inputs are activated, a charge is deposited on CL due to the currents of the n activated paths.
The series combination of the two NMOS transistors in each input branch can be represented by a proper equivalent resistance, R [59]. Thus, in steady state, the capacitors act as open circuits and consequently there will be a zero-voltage drop across each resistance. By KVL, we obtain where VCLn is the voltage developed across CL in case of n activated inputs and VC is the voltage across the parallel combination of the capacitors each with capacitance, C. The voltage developed across CL can be found from the charge accumulated across it by using (4) This charge is nothing but the charge extracted from the capacitors in the activated paths. So, for n activated inputs, Q can be written as where nC is the parallel combination of the capacitors each with capacitance C. So, from Eqs. (4) and (5), we obtain Had we assumed that there are n -1 activated inputs, then the voltage developed across CL would have been (8) The difference between these two voltages is thus One important comment about Eq. (9) is in order here. The difference in voltage across CL does not depend on the absolute value of capacitances; rather, it depends on the ratio between them. Taking into consideration that the integrated circuits are sensitive to poor device tolerances but have the advantage of good component matching, this seems to be an important merit [4]. CL depends on the gate capacitances of the transistors of the inverter and the wiring and interconnections to the input branches and to the inverter input [4]. We will neglect the latter component for simplicity and adopt the convention that the capacitance associated with each terminal of the MOS transistor is proportional to its aspect ratio [60]. Also, the PMOS devices are assumed to have twice the size of NMOS ones to compensate for the mobility difference [60]. Let the capacitance associated with each terminal of the transistor be C1 for minimum-sized devices (i.e. with aspect ratio equal to 1 and channel length equal to the minimum-feature size). So, if the parasitic capacitance due to connecting one floating capacitor is equal to that associated with the MOStransistor terminal and the transistors of the inverter are minimum-sized except the PMOS device of the inverter which has an aspect ratio of 2, then the parasitic capacitance, CL, will be (10) C1 for the adopted 45 nm CMOS technology is 0.045 fF [60]. So, (11) It is, of course, preferable to maximize the voltage difference, ∆VCL, so as to make the scheme as robust as possible, thus making the scheme survivable in spite of the process variations and component mismatches. Refer to Figs. 8, 9, and 10 for the plots of ∆VCL versus n, C, and CL, respectively, for VDD = 1 V.  It is obvious from Fig. 8 that ∆VCL decreases monotonically with the increase in n. This makes sense because the voltage generated on CL cannot exceed VDD irrespective of the number of inputs. So, increasing n causes the effect of each activated input on CL to decrease. On the other hand, from Fig. 9 and 10, it is clear that there are optimum values for C and CL at which ∆VCL is maximum. As always the case with any curve featuring an optimum behaviour, there must be two contradicting effects associated with varying the parameter at hand. In this respect, increasing CL causes the voltage developed across it to decrease due to the Q = CV relationship. However, increasing CL causes its capacitive reactance to decrease, thus a larger part of the input voltage will appear across C. As a result, a larger charge will be extracted from C to CL and consequently increases ∆VCL. Similar statements can be said about increasing C.
Of course, the optimum values of C and CL can be found by differentiating ∆VCL from Eq. (9) with respect to C and with respect to CL, then equating the derivative to zero. Alternatively, the optimum values of C and CL can be found from the plots of Fig. 9 and 10 to be 0.07 fF and 0.75 fF, respectively, with the maximum value of ∆VCL equal to 33.4 mV for the two cases.
For the proposed scheme to operate properly, Vthinv must satisfy the following two conditions: (14) The optimum value of Vthinv is certainly the average of these two limits.

Time-Delay Comparison
In this section, the time delays of the proposed and conventional schemes are compared and plotted versus the number of inputs, n, for the high-to-low transition at the output. The high-to-low propagation delay according to the proposed scheme, tdp, contains five subcomponents (assuming a buffer containing two inverters is added at the output to obtain a rail-to-rail voltage swing). The first one, tdp1, is the time required to discharge CL. For typical values of VCL and VDD, the discharging transistor operates in the deep-triode region and thus can be replaced by an equivalent resistance, Rdis = 1/(kn'(W/L)n(VDD -Vthn)) where kn' and (W/L)n are the process-transconductance parameter and the aspect ratio of the NMOS devices. So, tdp1 is equal to 2.3RdisCL in which the discharging-time delay is computed at the instant at which VCL(t) is equal to 0.1 VCLn. The second subcomponent, tdp2, is the time required for CL to charge to Vthinv and the third subcomponent, tdp3, is the high-to-low propagation delay of the first inverter in case all the inputs are activated. The fourth and fifth subcomponents are associated with the two inverters of the added buffer. To find tdp2, the instantaneous voltage, VCL(t) must first be found, then VCL(t) is substituted by Vthinv and t by tdp2. Refer to Fig. 11 for illustration where R is the sum of the equivalent resistances of the two NMOS transistors in each input path. Toward that end, each of the access transistors is substituted by its equivalent resistance, R1, where R1 is given by [59] where α is a parameter that accounts for the short-channel effects and is equal to 1.3 for short-channel devices [61]. One simplification is to represent the two serially connected NMOS transistors in each input path by one equivalent transistor with half the aspect ratio [62]. The two voltages, VGS and VDS, of this transistor can be substituted by their average values in which the source and drain of the equivalent transistor are at CL and the input terminal, respectively. The initial values of VGS and VDS are both VDD as the source terminals of the NMOS transistors controlled by the CLK signal are initially at 0 V while their final values are VDD -VCLn when CL is assumed to be charged to VCLn. So, the average values of VGS and VDS are 0.5(2VDD -VCLn). Another evaluation for the transistor's equivalent resistance is simply to divide its average VDS by its average drain current. After simple circuit analysis, we obtain To find tdp2 assuming that t = 0 corresponds to the instant of time at which CL begins charging, we equate VCL(t) by Vthinv. So, The high-to-low propagation delay of the first inverter, where Cout1 is the parasitic capacitance at the first-inverter output. The fourth and fifth subcomponents can be evaluated in a similar manner. In contrast, the time delay of the conventional NMOS stack, tdc, can be approximated by the following relationship where Coutc is the parasitic load capacitance at the output of the conventional stack and Rc is the equivalent resistance of each of the NMOS transistors in the stack. The delay is estimated to the 50% point. Eq. (19) was based on the assumption that each transistor in the stack was replaced by its equivalent resistance and that these resistances are equal. Rc is approximated by where each of these transistors is assumed to operate in the deep-triode region [45]. The effect of the internal capacitances was neglected here with respect to that of Coutc.
When adopting the previously described convention for evaluating the parasitic capacitances, we get Coutc = (3n) fF. The parameters of the 45 nm CMOS technology extracted from [63 and 64] are adopted. tdp and tdc are plotted versus the number of inputs, n, in Fig. 12 assuming that Vthinv is at its optimum value and a fan-out capacitance of 1 fF. Three important notes are in order here. Firstly, as obvious, the proposed scheme has a smaller time delay when the number of inputs exceeds three. Secondly, the percentage reduction in the time delay is more obvious with increasing n. Thirdly, and the most important, the time delay of the proposed scheme is approximately constant and does not increase with the number of the inputs. This is due to the parallel operation inherent in the proposed scheme represented by the simultaneous flow of the currents in the input branches. To be more accurate, increasing the number of the inputs causes a proportional increase in the value of CL; a slight effect that can be safely neglected. This must be compared with the series operation of the conventional stack which is inherently slow and slows further with increasing n.

The high-to-low propagation delays of the conventional and proposed schemes [ns].
Conventional.
Proposed. Figure 12: The high-to-low propagation delays of the conventional and proposed schemes versus n.

Area Comparison
As a rough estimation of the area, we adopt the approximation that the area of a certain transistor is proportional to the sum of its channel, drain diffusion, and source diffusion areas. Assume that the area of each of the source and drain diffusions are equal to that of the gate. Adopting the convention that the PMOS transistor has twice the area of the NMOS one to compensate for the mobility difference and adopting the conventional sizing strategy of increasing the aspect ratio of the transistors in the stack with n transistors by n in order to compensate for the delay increase [4], then the areas of the conventional and proposed schemes, Ac and Ap, can be approximated by and  WL respectively. In Eq. (22), the area of the capacitor was taken equal to that of the minimum-sized transistor. Refer to Fig.  13 for the plots of Ac and Ap versus n. It can be concluded from this rough estimation of the area that the proposed scheme has an area advantage when n exceeds four.

Power-Consumption Comparison
In this section, the dynamic-switching power in the conventional and proposed schemes, Pc and Pp, are compared. The short-circuit and leakage components are neglected here for simplicity. The power consumption according to the conventional stack is (23) where αsw is the switching activity. Cfan is the fan-out capacitance. These include the switching power required to charge the parasitic capacitances at the gates of the NMOS and PMOS devices to VDD and also the parasitic capacitances at the internal nodes. Now, for the proposed scheme, the short-circuit power consumption of the three inverters must be taken into account if VDD > Vthn + |Vthp|, where Vthp is the threshold voltage of PMOS devices. The adopted values of VDD, Vthn, and Vthp are 1 V, 0.34 V, and -0.23 V, respectively. Although VDD is larger than Vthn + |Vthp|; however, the rise time of the adopted pulses is short enough to neglect the short-circuit power consumption [65]. Adopting the previously described strategies for computing the parasitic capacitances at the nodes, we arrive at the following equation for Pp: The first term represents the switching power required to charge CL (assuming the worst case and thus charging occurs to VCLn). The second and third terms of Eq. (24) represent the switching power required to charge the parasitic capacitances at the outputs of the three inverters and those associated with the CLK and CLK signals to activate the corresponding transistors, respectively. The fourth and fifth terms are associated with the parasitic capacitances at the gates of the NMOS devices in the input paths and the fan-out capacitance, respectively. Refer to Fig.  14 for the plots of Pc and Pp versus n for f = 1 GHz and Cfan = 1 fF. It is obvious that the proposed scheme has a lower power consumption when n exceeds two. Although the conventional static CMOS stack has negligible leakage power consumption due to the stack effect [33], the main reason for the smaller power consumption of the proposed scheme is the reduction of the parasitic capacitances which are associated with the smaller area. Finally, refer to Fig. 15 and 16 for the plots of the power-delay products and the energy-delay products of the conventional and proposed schemes versus n.

Conventional.
Proposed. Figure 14: The power consumption according to the conventional and proposed schemes versus n. The number of inputs, n. The energy-delay products of the conventional and proposed schemes [attoJoule.second].

Figure 16: The energy-delay products according to the conventional and proposed schemes versus n.
In a nutshell, the range of n over which the proposed scheme has an advantage compared to the conventional stack is determined by the area and robustness. The lower limit of this range is dictated by the area and the upper limit is dictated by the effect of the process variations.

The Noise Immunity
There are several metrics for estimating the noise immunity including the noise margins for low and high inputs, the unity-noise gain, the unity-noise average, the average noise threshold energy (ANTE), and the energy normalized ANTE [66]. In this paper, the unity-noise gain (UNG) is used in comparing the noise immunity of the conventional and proposed schemes. It is defined as the amplitude of the input noise that causes a noise pulse with the same amplitude at the output node [67]. The noise level can be varied by changing the amplitude or the width of the noise pulse. However, in this estimation, the pulse width is assumed to be constant with the amplitude of the noise pulse varied. Toward estimating the UNG, we will assume that all the inputs are connected to the same noise source which represents the worst-case scenario from the point of view of robustness.
For estimating the UNG of the proposed scheme, UNGp, the circuit of Fig. 17 (a) is adopted. Since the target is to find the UNGp, the steady-state equivalent circuit of Fig. 17 (b) is adopted. It can be easily shown that the voltage, VCL, is given by Substituting CL from Eq. (10) and assuming that C is equal to C1, we get Toward finding the relationship between Vout and the circuit input, Vin, the relationship between the input of the inverter, VCL, and the inverter output, Vout, must first be find. From the definition of the UNG, the inverter is most likely to operate in the transition region. To simplify the analysis, two assumptions are adopted. First, the voltage-transfer characteristics of the inverter will be represented in a piecewise linear manner as shown in Fig. 18. Second, the input-low voltage and the input-high voltage are assumed to be Vthn and VDD -|Vthp|, respectively. So, the relationship between VCL and Vout in the transition region can be represented by Substituting VCL from Eq. (25) into Eq. (27) Putting both Vin and Vout equal to the unity-noise gain of the proposed scheme, UNGp, in Eq. (28) Now, to find the unity-noise gain of the conventional CMOS stack, UNGc, both the PUN and the PDN are replaced by their equivalent resistances. The n serially connected NMOS devices in the PDN are represented by a single device with an aspect ratio equal to (1/n)(W/L)n. Similarly, the n parallel connected PMOS devices in the PUN are represented by a single device with an aspect ratio equal to (n)(W/L)p, where (W/L)p is the aspect ratio of a single PMOS device. The equivalent resistances of the PDN and the PUN are thus represented by , respectively. The output voltage can thus be found by applying a simple voltage division as follows: For a fair comparison between the conventional and proposed schemes, the NMOS devices are assumed to have minimum size while the PMOS devices are assumed to have double the size of NMOS ones to compensate for the mobility difference. Substituting RN and RP from Eqs. (30) and (31) into Eq. (32) and putting both Vin and Vout equal to the unity-noise gain, UNGc, result after simple mathematical manipulations in where a, b, and c are given by and respectively. The other solution is refused as it does not have a valid physical interpretation. The plots of the unitynoise gains according to the conventional and proposed schemes versus the number of inputs are shown in Fig. 19. As expected, both these metrics degrade with increasing the number of the inputs. The conventional scheme has a better unity-noise gain for all values of n. This is not unexpected as evident from the previous discussion. Conventional.
Proposed. Figure 19: The plots of the unity-noise gains of the conventional and proposed schemes versus n.
Finally, as a combination of the previous performance metrics, a figure of merit is defined as Fig. 20 shows the plots of the figures of merit according to the conventional and proposed schemes versus n. As evident, the proposed scheme is superior to the conventional one when n exceeds four.

The Minimum Frequency of Operation
It is evident that the operation of the proposed scheme depends on the charge accumulated across CL, thus the inevitable leakage will set a lower limit on the frequency of operation. The leakage of CL occurs through the gate-oxide tunneling currents of the attached transistors in addition to the subthreshold leakage of the discharging transistor.
Assuming that the leakage current is Ileak, then the minimum frequency of operation can be estimated as Eq. (38) is based on the fact that the proposed scheme operates properly if VCL does not discharge below VCL(n -1). For n = 8, CL = 1 fF, and Ileak = 10 pA, fmin is approximately 300 kHz which is much smaller than the adopted frequencies (which are in the MHz or the GHz range). Certainly, increasing n causes Ileak to increase, CL to increase, and the voltage difference, VCLn -VCL(n -1), to decrease. However, the dominant effect is that of increasing Ileak, so fmin increases with increasing the number of inputs.

5.IMPACT OF PROCESS VARIATIONS AND COMPONENT MISMATCHES
Process variations and component mismatches are inevitable in any integrated circuit. Process variations include variations in aspect ratio, carrier mobility, threshold voltage, and oxide thickness. The component mismatch that is investigated here is the mismatch in the value of C. This is because, as obvious from the two inequalities, (12) and (13), the only component mismatch that affects ∆VCL is C. It is assumed that the mismatches along the input paths are the same for simplicity. The target is to ensure the reliable operation of the proposed scheme in the existence of these variations. The most critical voltages for the sound operation of the proposed scheme are the voltage difference, ∆VCL, and Vthinv. This is due to the fact that if Vthinv lies out of the valid range, the circuit will operate erroneously. So, in this section, the effect of the mismatch of C in the input paths on ∆VCL is investigated. Also, the effects of the variations in Vthn, Vthp, kn = kn'(W/L)n, and kp = kp'(W/L)p of the access transistors on Vthinv are considered.
Let C have a mismatch of ∆C, so replace each C by C + ∆C into the two inequalities (12) and (13) to obtain the lower and upper bounds for Vthinv as respectively. So, the range of Vthinv is thus After simple mathematical manipulations, it can be shown that the change in ∆VCL (due to ∆C) is given by where the terms containing (∆C) 2 were neglected. The plot of ∆(∆VCL) versus ∆C for n = 16, VDD = 1 V, CL = 0.045(4 + n) fF = 0.9 fF, and C = 0.1 fF is shown in Fig. 21. It can be shown that for a 20% change in C, there will be only approximately 8 mV change in ∆VCL.  Toward finding the variations in Vthinv due to the variations in Vthn, Vthp, kn, and kp, we use the following approximations: and (42) Also, the terms containing (∆Vthn) 2 , (∆Vthp) 2 , (∆kp) 2 , and (∆kn) 2 can safely be neglected. Since the threshold voltage of the inverter is given by [4]  respectively. Assuming that the variations in these parameters are uncorrelated, then the total variation of Vthinv can be expressed as the sum of the products of the sensitivities of Vthinv by the change in every parameter [68]. Thus, . (48) Assuming that the percentage changes in all these parameters are equal, refer to Fig. 22 for the plot of the absolute change in Vthinv versus this percentage change. It can be shown that the variation in Vthinv is approximately 30 mV for a percentage variation of 10% in these parameters. From Fig. 8, the corresponding maximum number of inputs is eight. So, when n exceeds eight, one must resort to the version of Fig. 5.

Simulation Setup
In this section, the proposed scheme is simulated and compared with previous schemes including the conventional stack. The predictive technology model (PTM) of the 45 nm CMOS technology is adopted with VDD equal to 1 V [64]. The 50% criterion is adopted for estimating the time delays. Unless otherwise specified, the load capacitance is set equal to 1 fF, n = 8, and the frequency of operation is 1 GHz. C is set equal to 0.1 fF. The room temperature of 27 ⁰C is adopted.

Results
The high-to-low propagation delays of the proposed scheme according to the analysis and the simulation are shown versus the load capacitance in Fig. 23. The high-tolow propagation delays of the conventional CMOS and the proposed scheme are shown versus the load capacitance according to the simulation in Fig. 24

Analysis.
Simulation. Figure 23: The high-to-low propagation delays versus the load capacitance of the proposed scheme according to the analysis and the simulation.  As a further comparison of the proposed scheme with previous work, the scheme in [40] depends on utilizing a sense amplifier to decide on the output status. However, due to the cascaded nature of this scheme and the need to stack some of the transistors, it is expected that the area and time delay of this scheme are larger than those of the proposed scheme. The scheme in [70] depends on partitioning the dynamic-node capacitance through using a splitter transistor, thus reducing the power consumption compared to the conventional domino logic. The power consumed according to this scheme will, however, be larger than the proposed one due to the need to charge and discharge several internal node capacitances. The scheme proposed in [71] depends on utilizing multi-threshold devices, thus minimizing the leakage and the associated power consumption. However, the associated cost of fabrication is expected to be relatively large.

7.CONCLUSIONS
In this paper, a scheme that depends on charge accumulation was presented as an alternative to the conventional wide fan-in CMOS circuits to enhance the performance. It was concluded that the range of the inputs above which the proposed scheme has an advantage compared to the conventional CMOS logic is dictated by the area, power consumption, and speed. It was found that the proposed scheme has smaller area, power consumption, and delay when the number of inputs exceeds four, two, and three, respectively. However, there is no contender to static CMOS logic from the point of view of robustness for any number of inputs. The proposed scheme was compared with various previous schemes and showed better power-delay and energy-delay products.
It was evident that the percentage reduction in the average time delay can increase with increasing the number of inputs. The speed advantage of the proposed scheme is attributed to the reduction of the parasitic capacitances due to the use of smaller sized transistors and the parallel operation in the input paths instead of that in the series connection of the conventional CMOS logic or domino logic. Also, it can be concluded that the propagation delay of the proposed scheme increases with increasing the number of inputs at a very slow rate compared to the conventional CMOS logic.

Declaration of Competing Interest
The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Declaration of Funding
This work was not funded by any institute or organization.