Implementing a Robust Microcontroller to FPGA SPI Interface: Part 1 - FPGA Challenges

In this multipart post we will explore a microcontroller (uC) to Field Programmable Gate Array (FPGA) Serial Peripheral Interface (SPI). The primary objective is to make the FPGA easier to control. We will explore a Verilog implementation that was developed on the Digilent BASYS-3 (Xilinx Atrix-7) development board with corresponding microcontroller code developed on an Arduino Nano Every. Together, they provide a packetized data transfer with 16-bit Cyclic Redundancy Check (CRC) that operates with a tested SPI serial clock of 16 MHz. The Verilog implementation should avoid any Xilinx specific dependencies making it portable to other platforms with minimal modification. However, this cross platform aspirational stipulation has yet to be tested.

Figure 1: Bench test of the uC to FPGA SPI interface featuring a Digilent BASYS 3 and an Arduino Nano Every. The uC is multiplexing the SSD and the 16 LEDS on the Basys 3 board.

Many people are introduced to the FPGA as part of a digital logic class using schematic entry tools to develop simple combinations circuits using logic primitives. Many continue their studies with a class dedicated to the FPGA that introduces Verilog or VHDL. Some continue with advanced logic classes or implement the FPGA in a capstone project. Unfortunately, few people move beyond the independent modules to integrate multiple modules into a larger system. There is a reason for this.

While the FPGA is an amazing piece of technology, it has a reputation of being hard to control. By comparison to a microcontroller (uC), the FPGA complexity is an order of magnitude higher. The hardware structures must be built from scratch or instantiated from examples found in textbooks or on the Internet.

This series of articles is written to help you move toward system design within the FPGA. It suggests using the best attributes of the FPGA for time critical, parallel, and deterministic circuitry where speed is required. We then leverage the microcontroller’s strengths of flexibility, relative ease of programming, and communications stack including wireless and cloud capability. One way to view this situation is to think of the FPGA as powerful microcontroller peripheral connected via a moderately high speed data bus.

The target audience is a person or team who can program an FPGA in Verilog as well a program a microcontroller. This may be a good fit for capstone students as a team approach allows simultaneous work on the FPGA and uC.

Since our focus in on the FPGA, it is desirable to minimize the microcontroller content. The Arduino Nano Every was chosen as it is a common and well know microcontroller. Most readers of this article will be intimately familiar with the microcontroller and C programming.

Rather than simply present the Verilog and uC code we will explore the reasoning and challenges associated with the design process. The result is a didactic series of articles. However, I trust they will be informative.

System Definition by Example

A system is a collection of components that work together. For an FPGA application, this may include modules for data acquisition, filtering, control, and a method to present the information to a user or integration into a larger automated system.

To better define the term, let’s consider a challenging FPGA plus uC based example. Suppose we wish to construct a system to measure real and reactive power for a three-phase 400 Hz waveform. The design requirements are to provide RMS voltage, RMS current, as well as an accurate measure of the phase difference between the signals. Let’s further assume line harmonics up to 20 kHz must be measured.

To ensure the best performance, let’s assume that the voltage and current measurements are performed simultaneously implying the need for 6 independent Analog to Digital Converters (ADC). The Nyquist sampling demands that we measure at a rate of at least 40,000 Samples per second. In total, this requires 240,000 samples per second. On top of this, the system must perform the RMS and phase angle calculations. The RMS calculation mechanism should include filtering to present a short- and long-term integration allowing fast response to transients while maintaining long term stability. The cherry on top is the Fast Fourier Transform (FFT) to determine the harmonics.

There are many ways to design such a system. A high end uC or several coordinated uC’s could perform the task. But that is not the focus of this article. Instead, we will recognize that the data acquisition and filter aspects are well within the capabilities of an entry level FPGA.

An FPGA based system with parallel structures could easily perform the tasks. Designing the individual modules isn’t too difficult. In fact, many of you have already integrated a single ADC into a FPGA. The real challenge is to glue all the modules together to move data to where it is needed when it is needed.


Personally, one of the most difficult FPGA learning challenges I encountered was a stubborn unlearning of the uC programming techniques. Verilog and VHDL are hardware description languages whereas a language such as C is a procedural featuring abstraction to eliminate hardware dependencies. It took longer than it should for my mind to grasp the parallelism – everything, all at once - inherent in the FPGA hardware descriptions.

For our opening example, we need to visualize the six ADCs. We need to visualize the various control and data lines used to connect them together. This is followed by registers to hold the intermediate results, multipliers, and adders to perform the squaring and summing operation of the RMS calculation, and a host of other state machine hardware to coordinate the activities.

The glue for this FPGA based hardware is the Register Transfer Level (RTL) design methodology. The term Register Transfer (RT) implies a control mechanism used to transfer data from one register to another often with a cloud of combinational logic inserted between registers. This is related to the pipeline process used in a microprocessor. For example, on the 1st clock cycle, data are presented to an adder. On the 2nd clock cycle the adder performs the operation. On the 3rd clock cycle the data are transferred to memory. The controller in this example is a state machine responsible for initiating the RT process. For our purposes, we will assume all the registers fall within a single clock domain.

This idea is worth repeating.

In our RTL designs, a state machine or collection of state machines will control and coordinate the transfer of data from register to register. All operations are assumed to be within the same clock domain. This may be a good time to explore a related post regarding the use of synchronizers to cross clock domains.

Recall that each register is memory. This term applies to anything from a single D-type flip-flop to an instantiation of one of the FPGA’s large block memories. Let’s explore the concept using the simple 8-bit register.

The example shown below contains all of the RTL machinery with our timing stipulations. The Q output is a register. The register update is synchronous with the positive edge of the clock as evident by the @(posedge clk) statement and the use of Verilog’s non-block <= operator.

module reg_8bit(
    input clk,
    input load,
    input wire [7:0] D, 
    output reg [7:0] Q 
always @(posedge clk) begin
    if (load) 
        Q <= D;      

Consider the nature of the load signal. It must be stable prior to rising edge of clock. Provided all signals are in the same clock domain, and assuming the load signal itself is driven by a registered output of a state machine, the synthesis tools will do their very best to ensure this critical timing stability is met.

Note that a “load” command line is asserted on the rising edge of a clock. The data present on D will become Q on the next rising edge of clock. This is a trap for new programmers than leads to unexpected single-clock delays. It’s important to think in terms of state and state next to keep track of such RTL actions. Finally, note the width (period) of the load signal.

In a synchronous RTL system, the signal’s on-time can be no less than the period of the clock (rising edge to rising edge). This is explained by understanding the associated state machine updates on rising edge of clock.

Programming Tip: The design RTL constraints mentioned in this article place limitations on the FPGA’s performance and can result in unnecessary use of FGPA fabric. However, registering all signals with the @(posedge clk) stipulation will generally improve system stability. It’s a good starting point that you can later modify to suite your needs.

Strobed Register Transfers

In the previous example we noted that the “load” signal’s on time (width) could vary. Since this is a synchronous system, the width is always a function of clock. The minimum on time is one period of the system clock. This short signal has several different names for this type of signal including strobe, tick, or pulse. For this series of articles, we will use the term strobe.

Earlier, we defined RTL as a design methodology that features a series of registers. Data are transferred between registers all of which are assumed to be in the same clock domain. Combinational logic is placed between registers to modify the data with the understanding that all logic operations must settle within the period of the domain’s clock. For example, with a 100 MHz clock, all FPGA signals must settle in 10 ns so that they are ready for the next clock event.

The RTL process requires a controller or collection of coordinated controllers to control the registers. One way to control the registers is for each controller to generate a strobe signal to advance the register of interest.

One simple strobe-based controller is shown below. This RTL will generate a strobe at a 20 kHz rate given a 100 MHz clock; it is a mod-5000 counter. It is a controller in the sense that the strobe could be used to initiate a process that repeats 20,000 times a second such an ADC.

module pulse_20k (                          // mod 5000 for a 100 MHz clock
    input clk,
    output reg zero_strobe,
    output reg [12:0] count                 // 13 bits to hold numbers from 0 to 4999
always @(posedge clk) begin
    zero_strobe <= 1'b0; 			        // default
    count <= count + 1;
    if (count >= 4999) begin                // Count starts at 0
        count <= 13'd0;  
        zero_strobe <= 1'b1;

Observe that the strobe and zero count occur in the same clock cycle. This may seem counterintuitive unless we view the state machine through a state / state-next lens. The (count >= 4999) condition will be true when the counter is in state 4999. In this example, a count of 4999 is the maximum count for a mod-5000 counter. On the next rising edge of clock, the count changes back to 0 and asserts the zero_strobe at the same time. This is like a clock where the maximum number of minutes is 60.

At this point we could begin to design controllers with more complexity. We certainly will in later installments of this article. For now, the simple time-based controller has served its purpose. We know that it will generate a strobe that is synchronous within the clock domain. That strobe may then be used as part of a larger RTL system.

Double Buffer

At this point we have briefly explored RTL operations with a caution to retain synchronous register transfers within a single clock domain. We will now explore RTL operations when the registers have different widths. This is a common occurrence especially when using communications protocols such as SPI. In this case, SPI typically handles data by operating on consecutive bytes while the associated FPGA hardware may be 2 to 4-bytes wide.

As an example, consider a 10-bit Pulse Width Modulator (PWM). Given reg_B1 and reg_B0, we may perform the following operation:

assign reg_PWM = {reg_B1, reg_B0}[9:0];

That a reasonable concatenation. But there is great potential for things to go wrong. The problem is the update time of reg_B1 and reg_B0. If there is any delay between their update reg_PWM may end up with an erroneous value.

Suppose reg_B1 and reg_B0 were derived from a SPI interface. In this case the registers would be updated at different times. As a worst-case scenario, suppose the PWM command is ramping up from 255 to 256. At one moment in time the PWM is running with a duty cycle of 25 % (command of 255 to a 10-bit PWM). Now suppose reg_B1 is updated by the SPI while reg_B0 remains. The duty cycle will now jump to 50 % (command of 511 to a 10-bit PWM). It will stay at this erroneous value until reg_B0 is updated by the SPI. This jump in the PWM even for a brief period of command could have undesirable effects on the system stability. Consider the instability this would cause if the PWM was controlling a motor in a closed loop system.

The solution is to add an intermediate register known as a double buffer as shown in Figure 2. This allows reg_B1 to reg_B0 naturally update. Later, when both registers are known to be updated, the controller can transfer the contents to a 2-byte register known as the double buffer. This ensures that devices such as the PWM are updated with known complete registers as opposed to intermediate values.

Block diagram representation of the double buffer RTL.

Figure 2: Block diagram representation of the double buffer RTL.

Closing for Part 1

In this installment we explored a few system-level considerations for FPGA design. While this is certainly not a complete list, the essential RTL methodology, synchronous design with a single clock boundary, and use of strobes should be clear in your mind. This information will help us understand the SPI module presented in the next installments. This text provides clues on how the SPI module’s output strobes may be used to control the flow of data from the uC Master into the FPGA.

Part 2 has been posted.

Your comments and suggestions are welcome. Further discussion about high-level RTL system design methodology is especially welcome.

Best Wishes,