## Features

This article covers the following topics:

- A comparison of the parallel and serial methods for implementing arithmetic functions
- Building blocks for basic serial multiplication and addition
- An integer polynomial implementation using serial logic
- Variable polynomial order and data width
- Example test bench

## Introduction

In a previous article, I discussed the benefits of using Horner’s Rule and fixed point logic to calculate polynomials in fewer clock cycles. In this article, we’ll look at a very different approach using serial logic instead of parallel that requires many more clock cycles, but significantly less area and a potentially higher clock rate. In most modern designs, speed and power are bigger concerns than area but serial logic may still be worth considering in special situations. Also note that this design uses full precision integer arithmetic instead of the fixed point arithmetic from the first article.

## Background

#### Horner’s Rule

As in the previous article, Horner’s Rule is used here to reduce the number of required multiplications. Assume that we have an *nth* order polynomial of the following form

*f(x) =a _{n}x^{n} + a_{n-1}x^{n-1} + … +a_{1}x + a_{0}.*

Calculating with this form directly requires a total of *n(n+1)/2* multiplications and *n* additions making this computation complexity O(n^{2}).

Horner’s Rule states that we can factor this and rewrite the polynomial as:

*f(x) =(((a _{n}x + a_{n-1})x + a_{n-2})x + a_{n-3})x + … )x +a_{0}.*

Using this form reduces the required number of operations to *n* multiplications and *n* additions. The computation complexity is now reduced to *O(n)* , making higher order polynomials much more manageable.

#### Serial Arithmetic

Since a polynomial is a combination of multiplications and additions, the two main building blocks in this design are serial versions of those functions. The first and simplest block is the adder.

**Figure 1**. Serial adder diagram.

The serial adder is simply a full adder with a feedback flip flop for the carry and a flip flop on the output. Therefore, if A is *M*-bits and B is *N*-bits, a complete addition operation takes *max(M,N)* clock cycles.

**Figure 2**. Example of a 4-bit serial multiplier.

The diagram shows a 4-bit example of the serial/parallel multiplier. One of the inputs is parallel and stored in a register, while the other is a single bit that is provided serially. The AND gates form the partial products that feed into the adders along with the output of the previous stage. The carry out of each adder is fed back into its carry input through a flip flop. During operation, starting with the 0th position and moving left, a column of partial products and carries is summed and the resulting bit is latched into the output flip flop. For a multiplicand of width *M* and a multiplier of width *N*, the entire multiplication requires *M+N* clock cycles.

In this design, a bit of extra logic was added to handle the case of negative numbers. If the multiplicand is ever negative, it is made positive and then the final result is negated. This ensures that the result is always sign extended properly.

The multiplier and adder and combined into higher level multiply-add block which implements the function Ax+B. An extra flip flop is required to delay the B input so that it arrives at the adder at the correct time.

## Operation

At the top level, the serial polynomial block consists of a number of multiply-add blocks equal to the polynomial order, two shift registers for input and output, and a state machine that controls the flow of data. Figure 3 shows a simplified view of the data path for a 3rd order polynomial without some of the additional control signals. Note that bold lines represent a bus and regular lines represent single bits.

**Figure 3**. Top level view of the serial polynomial block.

#### Inputs and Outputs

The polynomial block has two generic parameters for the width of the input data and the order of the polynomial it implements.

Two options are available for loading the input shift register and can be switched between using the s_p signal. If s_p is high, the X register will be loaded serially through the shift_X port. When s_p is low, X will be loaded in parallel on the next rising clock edge.

The *coeffs* vector contains one bit for each coefficient where *coeffs(0)* is the serial line for an , *coeffs(1)* is the serial line for a_{n-1}, etc. For each coefficient starting with a_{n-2}, two flip-flops are added between the input and the adder with two more added with each subsequent coefficient. This delays the signals so that they arrive at their block at the correct time. These flip-flops are clocked by an inverted version of the main clock signal to maximize the setup and hold times on the later stages.

On the output side, the result can be taken from either the shift_out port or the parallel result port. Each port has a corresponding flag signal to alert an external host that the calculation is complete. Valid is asserted once shift_out holds the LSB of the result and remains high until the calculation is done. Done is asserted once the complete result has been loaded into the output shift register and is de-asserted at the beginning of the next calculation.

#### FSM Control

**Figure 4**. State machine diagram.

The state machine shown above is used to control the flow of the calculation. In the INIT state, the data on the X input bus is latched into a register. If using serial mode, the data on shift_X will be latched in instead. All other control signals are set to their initial conditions.

Once start goes high, the state machine can move to either the SERIAL_LOAD state or the CALC state. If serial mode is selected, the state moves to SERIAL_LOAD where the math section is disabled and the remaining bits of X are shifted into the input shift register. When that is finished or if parallel mode was selected during the INIT state, the state becomes CALC.

CALC is where the math operations begin. The first multiply-add block in the chain is enabled and the input shift register is disabled. In this state, clock cycles are counted and used to enable the multiply-add blocks and output register. Since the multiply-add operation takes two clock cycles, the next block in the chain is enabled two clock cycles after the previous one until they are all enabled. After 2*poly_order clock cycles, the first bit (LSB) of the result is calculated. At this point, the output shift register is enabled so that the final result can start shifting in.

The timing diagram below shows when the result of each stage is available for a fourth order polynomial with four bit data width. Notice that after the final bit of a stage is calculated (stage 1, cycles 10-11), all subsequent bits are the sign extension of the result.

**Figure 5**. Timing diagram of calculation process.

After an additional (poly_order+1)*data_width clock cycles, the MSB of the final result is calculated and shifted into the output register. The output register is then disabled and the state changes to COMPLETE.

In COMPLETE, the done bit is asserted and the state machine returns to the INIT state where a new calculation can begin.

#### Advantages & Disadvantages

At this point, it should be clear that this method requires many more clock cycles than the more conventional parallel method. A valid question then might be: If this method is so much slower, why would anyone use it? The answer lies in the resource usage. The table below shows a comparison of the Quartus synthesis results of the multiply-add block in a serial vs parallel topology. The parallel version was made using IP from Altera’s catalog with default optimizations for a Cyclone V SoC.

**Table 1**. Resource usage in ALMs.

While parallel is roughly the same as serial at low data widths, the cost increases quadratically with bit width. However, since most FPGAs have hard multipliers on the die, making multipliers out of logic elements isn’t usually necessary.

A serial implementation increases linearly with data width. But this is still only practical in situations where latency isn’t an issue. A parallel multiply-add has a minimum latency of one clock cycle where this serial method has a minimum latency of the bit width of the product plus one. For example, a 32-bit serial multiply-add requires 65 clock cycles.

The serial method is an interesting, but niche solution. It can be useful in situations where data is being received through a serial protocol from an external host. In this case, the data can be processed as it comes in rather than waiting for the whole transaction to be completed. Other cases where area is a primary concern and latency is not, might also consider a serial approach.

Simulation and Testing

This design was simulated in ModelSim. The example test bench consists of four sets of randomly chosen coefficient values. The first three sets use the parallel load mode and the final uses the serial load mode.

**Figure 6**. Simulation screen capture.

## Conclusion

Serial arithmetic is an option that can be useful when resource reduction is critical and latency isn’t an issue. So while it is very situational, it is an alternative to be aware of. Since polynomials are ubiquitous in the engineering world, it is a good starting example for creating a higher level function from the serial building blocks.

Possible extensions of this concept could be applications to floating or fixed point numbers. The overall polynomial design could also potentially be revised to be fully pipelined to handle continuous calculations more efficiently.

## Downloads

The .zip file includes all VHDL source files for the project, including top level test bench and .do file for simulation.

Serial Polynomial_v1.1.zip (13.0 KB)