Debugging Common DMA Issues [Part 3]

Debugging Common DMA Issues

If you frequent the PYNQ forum, one of the most common questions/issues we get is why DMA transfer do not work. In part 3 of this series, I will reproduce these issues and use both the register_map and ILA to identify and address these issues.

DMA channel not started

Often, we see users reporting something like this.

---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 dma_recv.transfer(output_buffer)
----> 2 dma_recv.wait()

File /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:169, in _SDMAChannel.wait(self)
167 """Wait for the transfer to complete"""
168 if not self.running:
--> 169 raise RuntimeError("DMA channel not started")
170 while True:
171 error = self._mmio.read(self._offset + 4)

RuntimeError: DMA channel not started

Although the error message does not provide all the information, a closer look to the dma.register_map helps debugging the problem.

To generate and reproduce this error, we are going to modify our overlay to only assert the S2MM channel every 20 (valid) Stream Beats. For this, we will use the AXI4-Stream Subset Converter.

Based upon the initial overlay, open your Vivado IPI integrator and run these commands from the TCL console.

disconnect_bd_intf_net [get_bd_intf_net axis_data_fifo_M_AXIS] [get_bd_intf_pins axis_data_fifo/M_AXIS]
create_bd_cell -type ip -vlnv xilinx.com:ip:axis_subset_converter axis_subset_converter
connect_bd_net [get_bd_pins axis_subset_converter/aclk] [get_bd_pins axis_data_fifo/s_axis_aclk]
connect_bd_net [get_bd_pins axis_subset_converter/aresetn] [get_bd_pins axis_data_fifo/s_axis_aresetn]
connect_bd_intf_net [get_bd_intf_pins axis_subset_converter/S_AXIS] [get_bd_intf_pins axis_data_fifo/M_AXIS]
connect_bd_intf_net [get_bd_intf_pins axis_subset_converter/M_AXIS] [get_bd_intf_pins axi_dma/S_AXIS_S2MM]
validate_bd_design

Double click on the AXI4-Stream Subset Converter to set it up. We are not going to use the input TLAST, so we can setup when the output TLAST is generated. Manually disable the TLAST of the Slave interface and enable the TLAST of the Master interface. Set Generate TLAST to 20. This forces the IP to assert TLAST in the M_AXIS AXI4-Stream every 20 (valid) Stream Beats.

Your block design should look like this:

Re-generate the bitstream. Once the bitstream is generated, move the new .bit and .hwh files to the board, use the basename dma_subset.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel

data_size = 16

input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

input_buffer[:] = np.arange(data_size, dtype=np.uint32)

Use TVALID signal from the S2MM in trigger setup (rising edge) as shown in part 2, then click 10_capture_trigger_button to wait for the trigger.

Launch the DMA transfer from JupyterLab, note that we are starting the receive DMA transfers first so we can capture the Stream Beats without non-empty valid transactions, this is due to the DMA preempting accepting S2MM channel transactions.

dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_recv.wait()

This throws RuntimeError: DMA channel not started. Check the dma.register_map for more details

dma.register_map

The register S2MM_DMASR has useful information for your debug documentation here:

  • Halted=1: DMA channel halted

  • DMAIntErr=1: this error occurs when the Status AXI4-Stream packet RxLength field does not match the S2MM packet being received by the S_AXIS_S2MM interface

  • Err_Irq: indicates an interrupt event was generated on error

With just these registers, we can decipher that something with TLAST is not handled properly.

To confirm our analysis, check the waveform in the Hardware manager, note the 16 valid transactions on the S2MM channel, although the number of transactions is the same size as the output_buffer the fact that TLAST is not asserted in the last beat, causes the RuntimeError: DMA channel not started issue.

Using both the dma.register_map and the ILA we confirmed that the issue is with TLAST not asserting.

To prove that the overlay works as intended, run this code from a different notebook.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel

data_size = 40

input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer_0 = allocate(shape=(data_size//2,), dtype=np.uint32)
output_buffer_1 = allocate(shape=(data_size//2,), dtype=np.uint32)

input_buffer[:] = np.arange(data_size, dtype=np.uint32)

Connect the Hardware manager to the board. So far, we have been using the default capture modes, for this experiment we are going to use two windows of 32 samples each, we are also going to set the trigger position in 0.

5_ila_window_settings

On top of this, add S2MM channel (slot_1) TREADY signal to the Trigger Setup (Value R) and then set the trigger condition to Global OR.

Once it is configured click 10_capture_trigger_button, then in JupyterLab run

dma_recv.transfer(output_buffer_0)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()
dma_recv.transfer(output_buffer_1)
dma_recv.wait()

print(f'{output_buffer_0=}')
print(f'{output_buffer_1=}')

I will leave to the reader the analysis of the waveform, but I would like to highlight that you see two TLAST assertions in the S2MM channel, this aligns with the configuration of our subset converter, as the input_buffer has 40 elements. Not all the valid transactions of the MM2S channel are captured due to the size of window depth.

Receive DMA .wait() never completes

This issue is way trickier to debug as there are at least a couple of probable reasons.

Data Transfer is Shorter than Expected

To see this, let’s continue using the modified overlay. The size of the output buffer

will be the double of the input buffer.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel

data_size = 10

input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(20,), dtype=np.uint32)

input_buffer[:] = np.arange(data_size, dtype=np.uint32)

Open the Hardware manager and configure the ILA settings, input this on the tcl console:

set_property CONTROL.WINDOW_COUNT 1 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.DATA_DEPTH 1024 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.TRIGGER_POSITION 0 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]

8_ila_settings_1window

Now, click 10_capture_trigger_button, then in JupyterLab run

dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_recv.wait()

dma_recv.wait() never completes, click the Interrupt the kernel button a few times to kill the waiting process. If you explore the dma.register_map, you will find that the S2MM is not Idle neither Halted, which means that something was not

correct in the AXI4-Stream transactions.

Let’s turn our attention to the ILA waveform. Notice how S2MM is only valid for 10 transactions, however, in the PYNQ code we specified that we want to receive an array of 20 elements, so the DMA keeps waiting for the remaining data. PYNQ does not have a DMA timeout capability.

To demonstrate the the DMA will complete when it receives the data, if you transfer the input_buffer again, you should see the DMA completing the transfer.

dma_send.transfer(input_buffer)
dma_recv.wait()
print(f'{output_buffer=}')

Not Handling TKEEP

The other reason for the DMA to never finish could be bad handling of TKEEP, to demonstrate this let’s modify the subset converter to set TKEEP always to 0.

With the block design open in Vivado, type the following in the TCL console:

set_property -dict [list \
  CONFIG.S_HAS_TLAST.VALUE_SRC USER \
  CONFIG.M_HAS_TKEEP.VALUE_SRC USER \
  CONFIG.M_HAS_TLAST.VALUE_SRC USER \
  CONFIG.M_HAS_TKEEP {1} \
  CONFIG.M_HAS_TLAST {1} \
  CONFIG.S_HAS_TLAST {1} \
  CONFIG.TKEEP_REMAP {4'b0} \
] [get_bd_cells axis_subset_converter]

The subset converter is configured to always set TKEEP to 0, this mimics designs that do not have TKEEP or do not set it properly.

Re-generate the bitstream. Once the bitstream is generated, move the new .bit and .hwh files to the board, use the basename dma_notkeep.

Create a new JupyterLab notebook.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('dma_notkeep.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel

data_size = 20

input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

input_buffer[:] = np.arange(data_size, dtype=np.uint32)

Open the Hardware manager and set ILA setup with the following TCL commands:

set_property CONTROL.WINDOW_COUNT 1 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.DATA_DEPTH 32 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.TRIGGER_POSITION 0 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]

Now, click 10_capture_trigger_button, then in JupyterLab run

dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()

dma_recv.wait() never completes, click the Interrupt the kernel button a few times to kill the waiting process. If you explore the dma.register_map, you will find that the S2MM is not Idle neither Halted, which means that something was not

correct in the AXI4-Stream transactions.

Let’s turn our attention to the ILA waveform. Notice how everything seems correct in the S2MM channel, except TKEEP is always 0. The DMA IP expects TKEEP to be present and handled as described in the specs. For simplicity, when using the DMA you can set all its bits to 1.

Every odd-index element in the array is 0

Let’s remove the subset converter and reconnect the FIFO to the DMA.

delete_bd_objs [get_bd_intf_nets axis_data_fifo_M_AXIS] [get_bd_intf_nets axis_data_fifo_M_AXIS1] [get_bd_cells axis_subset_converter]
connect_bd_intf_net [get_bd_intf_pins axis_data_fifo/M_AXIS] [get_bd_intf_pins axi_dma/S_AXIS_S2MM]

Double click on the ZYNQ7 Processing System, in PS-PL Configuration, expand HP Slave AXI Interface then expand S AXI HP0 DATA WIDTH and set it to 32.

For MPSoC devices, double click on Zynq Ultrascale+ MPSoC, in PS-PL Configuration, expand PS-PL Interfaces then expand Slave Interface expand AXI HP, expand AXI HP0 FPD Data Width and set it to 32. If set to 64, you will not see this issue.

Re-generate the bitstream. Once the bitstream is generated, move the new .bit and .hwh files to the board, use the basename dma_hp32.

Create a new JupyerLab notebook.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('dma_hp32.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel

data_size = 20

input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

input_buffer[:] = np.arange(data_size, data_size+data_size, dtype=np.uint32)

Open Hardware manager and set MM2S TVALID signal as trigger with Value R.

Now, click 10_capture_trigger_button, then in JupyterLab run

dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()
print(f'{input_buffer=}')
print(f'{output_buffer=}')

You can see that every odd-index element is 0

input_buffer=PynqBuffer([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
                   35, 36, 37, 38, 39], dtype=uint32)
output_buffer=PynqBuffer([20, 0, 22, 0, 24, 0, 26, 0, 28, 0, 30, 0, 32, 0, 34,
                   0, 36, 0, 38, 0], dtype=uint32)

If we look at the ILA waveform, you can see that the MM2S channel received the data like this already.

The reason for this issue is that PS configuration is pre-defined in the SD card and cannot change at runtime. In this configuration the HP ports have 64-bit wide data signals. So, make sure the HP ports are properly configured.

if ol.ip_dict.get('processing_system7_0'):
    ps = ol.ip_dict.get('processing_system7_0')
    hp_ports = [f'C_S_AXI_HP{idx}_DATA_WIDTH' for idx in range(4)]
    width = [64]
else:
    ps = ol.ip_dict.get('zynq_ultra_ps_e')
    hp_ports = [f'C_SAXIGP{idx}_DATA_WIDTH' for idx in range(6)]
    width = [128, 64]
for i in hp_ports:
    w = ps['parameters'][i]
    print(f"{i}: {w}, configured correctly? {int(w) in width}")

This concludes the third part of this blog series, as bonus see Using an ILA without Physical JTAG cable

Please, use the comments section for questions related to the content of this blog. If you have questions about your own design or unrelated topics, please create a new topic in the forum.

For each part, can you please split them into a)Problem regenerating and b)addressing the issue? It is hard to discern where is the problem addressing and where you continue to show how the issue is generated

Hi @matthew,

Thank you for the feedback, I’ll see how can I incorporate it.

1 Like

Thanks for these tutorials, Mario.

I’m an undergraduate student researcher that is temporarily covering for a grad student on an RFSoC 4x2 project. Both the grad student and myself are running into issues with DMA, and I’m aware that Part 3 of these tutorials cover troubleshooting. I’m trying to backtrack to a known-working design, so I modified the 1_0_mpsoc_dma_overlay.tcl script to work with the RFSoC 4x2 board, and I’ve been able to get through this tutorial successfully.

After completing Part 1, I started experimenting with the notebook’s commands in order to help develop my understanding of how DMA works so that I can understand why my existing design is failing. In doing so, I’ve developed the following questions:

  1. In your notebook, if I check the idle status of dma_send or dma_recv immediately after the following block of code, both return “False”. However, both will return “True” after a transfer has been completed. This prevents me from setting up a reliable if/else that checks idle status before initiating a transfer. Is there something wrong with my implementation of your script or is this expected?
    from pynq import Overlay, allocate, PL
    import numpy as np
    PL.reset()
    ol = Overlay('dma_wrapper.bit')
    dma = ol.axi_dma
    dma_send = ol.axi_dma.sendchannel
    dma_recv = ol.axi_dma.recvchannel

  2. I’ve found that I can run the following block repeatedly without issue.
    dma_send.transfer(input_buffer)
    dma_recv.transfer(output_buffer)
    dma_send.wait()
    dma_recv.wait()
    However, if I change this block to the following code, the cell will hang during its seventh execution and require kernel interruption.
    dma_send.transfer(input_buffer)
    dma_send.wait()
    After that kernel interruption, I can then run the following code repeatedly, which will hang on its eighth execution.
    dma_recv.transfer(output_buffer)
    dma_recv.wait()
    Is this behavior expected or is there an issue with my adaptation of your block design. If this is normal, could you please help me understand why this is occurring?

Thanks in advance for any time anyone might spend addressing my questions.

  • Shawn Feezer

Hi @feezus,

  1. You may also need to check for the other flag such as MM2S and S2MM length (which should be 0) in the first iteration.
  2. If I understand correctly, you’re running
dma_send.transfer(input_buffer)
dma_send.wait()

Multiple times without starting the dma_recv.transfer(output_buffer). If my understanding is correct, then the behavior is expected as the FIFO gets full and no more data can be transferred.

Mario

Thanks for the response, Mario. Yes, your understanding of my question is correct, and I think you’ve given me the understanding that I need to make some progress with my designs. I’d like to test that understanding, though, so could you confirm that my thinking is sound, please?

Using this Part 1 diagram as an example:

  1. Since input_buffer is an array allocated in PYNQ, it exists within DRAM.
  2. When dma_send.transfer(input_buffer) is executed, the data within “input_buffer” is sent to axis_data_fifo:S_AXIS from axi_dma:M_AXIS_MM2S, and is stored within the FIFO.
  3. When dma_recv.transfer(output_buffer) is executed, axi_dma begins accepting transfer of data at its S_AXIS_S2MM port. The data is sent (and removed) from the FIFO to the axi_dma, and the DMA module places that data within the “output_buffer” allocation in DRAM.

Shawn

Yes, your understanding is correct.

Awesome. Tyvm for the help, Mario.

1 Like