This article covers the following topics:
- Implementing polynomials in FPGAs using Horner’s Rule
- Fractional arithmetic using fixed point format
- Interfacing to an external processor
- Scalable polynomial order and data width
- Test bench and verification with MATLAB
Evaluating a high order polynomial on a processor can be very time consuming. For example, without any tricks a 10th order polynomial requires 65 multiplications and 10 additions. That’s a minimum of 75 cycles that your processor is tied up assuming you can do each operation in a single cycle, which isn’t always the case. This design uses two methods to reduce that computation time, requiring a number of clocks equal to the order of the polynomial plus two, or one cycle for each input required. First, Horner’s Rule allows a polynomial calculation with fewer multiplications. Second, the use of fixed point as opposed to floating point is also useful since it reduces the time required for a multiplication operation.
At the core of this design is Horner’s Rule which is a simple and efficient algorithm for calculating polynomials. Assume that we have an nth order polynomial of the following form
f(x) =anxn + an-1xn-1 + … +a1x + a0
Calculating with this form directly requires a total of n(n +1)/2 multiplications and n additions making this computation complexity O(n2).
Horner’s Rule states that we can factor this and rewrite the polynomial as:
f(x) =(((anx + an-1)x + an-2)x + an-3)x + … )x +a0.
Using this form reduces the required number of operations to n multiplications and n additions. The computation complexity is now reduced to O(n) , making higher order polynomials much more manageable.
For this application, the polynomial input and coefficients are represented in a fixed point format. A fixed point number is simply a normal binary integer that, in this case, is interpreted as a fractional number. Qm.n notation will be used where m is the number of integer bits (excluding the sign bit) and n is the number of fractional bits. The resulting data width is then m+n+1. Note that some places will include the sign bit in m making the overall width m+n . Either convention works as long as consistency is maintained. This article will follow the first convention. Also note that these parameters are available as generics in the polynomial VHDL code. For example, a Q3.4 number is 8 bits wide, consisting of 1 sign + 3 bits for the signed integer portion plus 4 bits for the fraction. However in many cases, m is 0 and all but the sign bit are allocated to the fractional portion.
Unlike floating point, the radix point of a fixed point number is implied and it is up to the designer to keep track of the radix point during calculations. The dynamic range of a fixed point number, [-(2m), 2m - 2-n], is also much smaller than that of a floating point number. The upside is that it requires no special hardware since the numbers are regular integers which leads to faster computation time. The number’s resolution, 2-n, stays constant for all numbers within the dynamic range.
Since the fixed point numbers are integers, arithmetic is essentially the same as with any 2’s complement number. One caveat is that when adding or subtracting Q numbers, the user must keep the radix points aligned. Alignment can be accomplished by shifting by the difference in fractional bits or simply specifying the correct bits to use in the HDL. Also when adding or subtracting, care must be taken in regards to overflow. In some cases, overflow can be allowed if an overall system gain will keep the values within their allowed range. It can also be allowed if intermediate calculations will cause the value to wrap-around to the correct result.
Multiplication of a Qm1.n1 number with a Qm2.n2 number results in a Q(m1 + m2).(n1 + n2) number. Note that the two most significant bits will always be identical as they are the sign bit and a one bit extension of it. Overflow is possible with multiplication as well, but since one or both operands are typically normalized to a magnitude between 0 and 1, this isn’t usually a problem.
Horner’s Rule has demonstrated that a polynomial can be calculated in sections of 1st order polynomials of the form a1x + a0. Therefore the most basic building blocks of this design are multipliers and adders. The VHDL for these components can be found in adder.vhd and mult.vhd.
The multiplier simply uses the built in VHDL signed multiplication function with a registered output. This way, it should utilize an FPGAs built-in multiplier blocks if it has them. The adder uses a Carry Look Ahead architecture without a registered output.
These two building blocks are combined into a higher level block that performs a single multiply-add operation in mult_add.vhd. This unit implements the function a1x + a0, where ai and x are both Qm.n numbers, used for each stage of Horner’s Rule. Therefore an nth order polynomial requires n instances of this block.
It is assumed that either the coefficients, input x, or both are normalized to be between 0 and 1 in magnitude so that the multiplication result can be stored back into the same data size since the integer bits cannot overflow. In this application, the multiplication result is simply truncated, taking the sign bit, the closest m bits left of the radix point, and the first n bits after the radix point.
This is the first point where the integer and fractional lengths come into play since the result needs to be truncated in the right place to preserve the correct format. The individual multipliers and adders only see integers and don’t care about the radix point.
Figure 1. Polynomial RTL view.
At the polynomial level, the multiply-add blocks are cascaded so that the result of one feeds into the input of the next. An RTL view of the polynomial block is shown above in Figure 1 for a 3rd order polynomial. This level also adds a register for x and each coefficient. This way a single bus can be used for all the inputs. An internal signal is shifted at each rising clock edge to load the inputs into their respective registers. The data must be in the order x, an, an-1, …, a0. The final result is then valid after n+2 clock cycles plus the propagation delay of the adder.
That concludes the inner workings of the actual polynomial block. However, the original idea behind this design was to use it like a co-processor for a host microcontroller. This design really excels when used in a system-on-a-chip (SoC) where the data can be shared between a processor and FPGA fabric through memory mapped registers. But this is intended to be a generic guide that should work with whatever hardware you have available.
An external MCU could use a parallel interface to directly interface with the polynomial block, but that would take up a lot of pins. A serial interface may be more practical as nearly all MCUs have one built in. This example uses a SPI slave interface based on the one found here. The test bench for the top level also incorporates the SPI Master found here. While SPI is the fastest of the common serial interfaces, any serial interface will create a significant bottleneck especially with higher order polynomials and large data widths. This presents a problem if you need the result very quickly. Users are encouraged to develop their own interface, like the memory mapping mentioned above, to suit their needs and combine it with the polynomial block.
Figure 2. Top level RTL view.
Figure 2 shows the top level design including the SPI interface. With this SPI interface, the slave select signal is used as the clock to the polynomial unit, latching the data when a transaction is complete so that the data is processed as fast as it comes in. A counter keeps track of the number of transactions and will signal the host after the final calculation is finished by asserting the “done” output. The host then pulses the “request” line at the start of the next SPI transaction during which the result will be latched and sent back to the host. The slave will then be expecting the next x value to start a new calculation in the same transaction. Note that this application uses only SPI mode 3.
This design was simulated in ModelSim Altera Starter Edition v10.4b. Test data was created using Matlab’s fi function. The Matlab script runs through the same algorithm and gives an expected result to compare to. The first simulation just tests the polynomial unit. The test data is a 7th order polynomial with a Q7.24 data format, so our values can range from -128 to 128-2-24. However, in order to avoid overflow, the chosen values range from -64 to 64-2-24.
Figure 3. Polynomial block simulation in ModelSim.
The second simulation incorporates the SPI interface and calculates a 3rd order polynomial using a more typical Q0.15 data format.
Figure 4. Top level design simulation in ModelSim
The Matlab script and test bench files are included in the downloads. Note that depending on the rounding method used in Matlab, the simulation result might differ slightly. The direct truncation used in this application is equivalent to Matlab’s “Floor” rounding method.
This design offers an implementation of Horner’s Rule for calculating polynomials in fixed point arithmetic with variable order, data width, and integer and fractional lengths. Users may easily replace the provided multiplier and adder components with IP from their preferred silicon vendor if desired. The example uses a SPI interface to communicate with a host microcontroller, but this isn’t ideal and users are encouraged to design their own communication interface to suit their application.
Polynomial.zip (17.2 KB)
This zip file includes all VHDL files, 2 test benches, 2 DO files for simulation, and the Matlab simulation script.