Debugging Common DMA Issues
If you frequent the PYNQ forum, one of the most common questions/issues we get is why DMA transfer do not work. In part 3 of this series, I will reproduce these issues and use both the register_map
and ILA
to identify and address these issues.
DMA channel not started
Often, we see users reporting something like this.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [4], in <cell line: 2>()
1 dma_recv.transfer(output_buffer)
----> 2 dma_recv.wait()
File /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:169, in _SDMAChannel.wait(self)
167 """Wait for the transfer to complete"""
168 if not self.running:
--> 169 raise RuntimeError("DMA channel not started")
170 while True:
171 error = self._mmio.read(self._offset + 4)
RuntimeError: DMA channel not started
Although the error message does not provide all the information, a closer look to the dma.register_map
helps debugging the problem.
To generate and reproduce this error, we are going to modify our overlay to only assert the S2MM
channel every 20 (valid) Stream Beats. For this, we will use the AXI4-Stream Subset Converter.
Based upon the initial overlay, open your Vivado IPI integrator and run these commands from the TCL console.
disconnect_bd_intf_net [get_bd_intf_net axis_data_fifo_M_AXIS] [get_bd_intf_pins axis_data_fifo/M_AXIS]
create_bd_cell -type ip -vlnv xilinx.com:ip:axis_subset_converter axis_subset_converter
connect_bd_net [get_bd_pins axis_subset_converter/aclk] [get_bd_pins axis_data_fifo/s_axis_aclk]
connect_bd_net [get_bd_pins axis_subset_converter/aresetn] [get_bd_pins axis_data_fifo/s_axis_aresetn]
connect_bd_intf_net [get_bd_intf_pins axis_subset_converter/S_AXIS] [get_bd_intf_pins axis_data_fifo/M_AXIS]
connect_bd_intf_net [get_bd_intf_pins axis_subset_converter/M_AXIS] [get_bd_intf_pins axi_dma/S_AXIS_S2MM]
validate_bd_design
Double click on the AXI4-Stream Subset Converter
to set it up. We are not going to use the input TLAST
, so we can setup when the output TLAST
is generated. Manually disable the TLAST
of the Slave interface and enable the TLAST
of the Master interface. Set Generate TLAST
to 20. This forces the IP to assert TLAST
in the M_AXIS
AXI4-Stream every 20 (valid) Stream Beats.
Your block design should look like this:
Re-generate the bitstream. Once the bitstream is generated, move the new .bit
and .hwh
files to the board, use the basename dma_subset
.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel
data_size = 16
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
input_buffer[:] = np.arange(data_size, dtype=np.uint32)
Use TVALID
signal from the S2MM
in trigger setup (rising edge) as shown in part 2, then click to wait for the trigger.
Launch the DMA transfer from JupyterLab, note that we are starting the receive DMA transfers first so we can capture the Stream Beats without non-empty valid transactions, this is due to the DMA preempting accepting S2MM
channel transactions.
dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_recv.wait()
This throws RuntimeError: DMA channel not started
. Check the dma.register_map
for more details
dma.register_map
The register S2MM_DMASR
has useful information for your debug documentation here:
-
Halted=1
: DMA channel halted -
DMAIntErr=1
: this error occurs when the Status AXI4-Stream packet RxLength field does not match the S2MM packet being received by the S_AXIS_S2MM interface -
Err_Irq
: indicates an interrupt event was generated on error
With just these registers, we can decipher that something with TLAST
is not handled properly.
To confirm our analysis, check the waveform in the Hardware manager, note the 16 valid transactions on the S2MM
channel, although the number of transactions is the same size as the output_buffer
the fact that TLAST
is not asserted in the last beat, causes the RuntimeError: DMA channel not started
issue.
Using both the dma.register_map
and the ILA we confirmed that the issue is with TLAST
not asserting.
To prove that the overlay works as intended, run this code from a different notebook.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel
data_size = 40
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer_0 = allocate(shape=(data_size//2,), dtype=np.uint32)
output_buffer_1 = allocate(shape=(data_size//2,), dtype=np.uint32)
input_buffer[:] = np.arange(data_size, dtype=np.uint32)
Connect the Hardware manager to the board. So far, we have been using the default capture modes, for this experiment we are going to use two windows of 32 samples each, we are also going to set the trigger position in 0.
On top of this, add S2MM
channel (slot_1) TREADY
signal to the Trigger Setup (Value R
) and then set the trigger condition to Global OR
.
Once it is configured click , then in JupyterLab run
dma_recv.transfer(output_buffer_0)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()
dma_recv.transfer(output_buffer_1)
dma_recv.wait()
print(f'{output_buffer_0=}')
print(f'{output_buffer_1=}')
I will leave to the reader the analysis of the waveform, but I would like to highlight that you see two TLAST
assertions in the S2MM
channel, this aligns with the configuration of our subset converter, as the input_buffer
has 40 elements. Not all the valid transactions of the MM2S
channel are captured due to the size of window depth.
Receive DMA .wait()
never completes
This issue is way trickier to debug as there are at least a couple of probable reasons.
Data Transfer is Shorter than Expected
To see this, let’s continue using the modified overlay. The size of the output buffer
will be the double of the input buffer.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay('dma_subset.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel
data_size = 10
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(20,), dtype=np.uint32)
input_buffer[:] = np.arange(data_size, dtype=np.uint32)
Open the Hardware manager and configure the ILA settings, input this on the tcl console:
set_property CONTROL.WINDOW_COUNT 1 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.DATA_DEPTH 1024 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.TRIGGER_POSITION 0 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
Now, click , then in JupyterLab run
dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_recv.wait()
dma_recv.wait()
never completes, click the Interrupt the kernel
button a few times to kill the waiting process. If you explore the dma.register_map
, you will find that the S2MM
is not Idle
neither Halted
, which means that something was not
correct in the AXI4-Stream transactions.
Let’s turn our attention to the ILA waveform. Notice how S2MM
is only valid for 10 transactions, however, in the PYNQ code we specified that we want to receive an array of 20 elements, so the DMA keeps waiting for the remaining data. PYNQ does not have a DMA timeout capability.
To demonstrate the the DMA will complete when it receives the data, if you transfer the input_buffer
again, you should see the DMA completing the transfer.
dma_send.transfer(input_buffer)
dma_recv.wait()
print(f'{output_buffer=}')
Not Handling TKEEP
The other reason for the DMA to never finish could be bad handling of TKEEP
, to demonstrate this let’s modify the subset converter to set TKEEP
always to 0.
With the block design open in Vivado, type the following in the TCL console:
set_property -dict [list \
CONFIG.S_HAS_TLAST.VALUE_SRC USER \
CONFIG.M_HAS_TKEEP.VALUE_SRC USER \
CONFIG.M_HAS_TLAST.VALUE_SRC USER \
CONFIG.M_HAS_TKEEP {1} \
CONFIG.M_HAS_TLAST {1} \
CONFIG.S_HAS_TLAST {1} \
CONFIG.TKEEP_REMAP {4'b0} \
] [get_bd_cells axis_subset_converter]
The subset converter is configured to always set TKEEP
to 0, this mimics designs that do not have TKEEP
or do not set it properly.
Re-generate the bitstream. Once the bitstream is generated, move the new .bit
and .hwh
files to the board, use the basename dma_notkeep
.
Create a new JupyterLab notebook.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay('dma_notkeep.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel
data_size = 20
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
input_buffer[:] = np.arange(data_size, dtype=np.uint32)
Open the Hardware manager and set ILA setup with the following TCL commands:
set_property CONTROL.WINDOW_COUNT 1 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.DATA_DEPTH 32 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
set_property CONTROL.TRIGGER_POSITION 0 [get_hw_ilas -of_objects [get_hw_devices xc7z020_1] -filter {CELL_NAME=~"dma_i/system_ila_0/inst/ila_lib"}]
Now, click , then in JupyterLab run
dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()
dma_recv.wait()
never completes, click the Interrupt the kernel
button a few times to kill the waiting process. If you explore the dma.register_map
, you will find that the S2MM
is not Idle
neither Halted
, which means that something was not
correct in the AXI4-Stream transactions.
Let’s turn our attention to the ILA waveform. Notice how everything seems correct in the S2MM
channel, except TKEEP
is always 0. The DMA IP expects TKEEP
to be present and handled as described in the specs. For simplicity, when using the DMA you can set all its bits to 1
.
Every odd-index element in the array is 0
Let’s remove the subset converter and reconnect the FIFO to the DMA.
delete_bd_objs [get_bd_intf_nets axis_data_fifo_M_AXIS] [get_bd_intf_nets axis_data_fifo_M_AXIS1] [get_bd_cells axis_subset_converter]
connect_bd_intf_net [get_bd_intf_pins axis_data_fifo/M_AXIS] [get_bd_intf_pins axi_dma/S_AXIS_S2MM]
Double click on the ZYNQ7 Processing System
, in PS-PL Configuration
, expand HP Slave AXI Interface
then expand S AXI HP0 DATA WIDTH
and set it to 32
.
For MPSoC devices, double click on
Zynq Ultrascale+ MPSoC
, inPS-PL Configuration
, expandPS-PL Interfaces
then expandSlave Interface
expandAXI HP
, expandAXI HP0 FPD Data Width
and set it to32
. If set to64
, you will not see this issue.
Re-generate the bitstream. Once the bitstream is generated, move the new .bit
and .hwh
files to the board, use the basename dma_hp32
.
Create a new JupyerLab notebook.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay('dma_hp32.bit')
dma = ol.axi_dma
dma_send = ol.axi_dma.sendchannel
dma_recv = ol.axi_dma.recvchannel
data_size = 20
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
input_buffer[:] = np.arange(data_size, data_size+data_size, dtype=np.uint32)
Open Hardware manager and set MM2S
TVALID
signal as trigger with Value R
.
Now, click , then in JupyterLab run
dma_recv.transfer(output_buffer)
dma_send.transfer(input_buffer)
dma_send.wait()
dma_recv.wait()
print(f'{input_buffer=}')
print(f'{output_buffer=}')
You can see that every odd-index element is 0
input_buffer=PynqBuffer([20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39], dtype=uint32)
output_buffer=PynqBuffer([20, 0, 22, 0, 24, 0, 26, 0, 28, 0, 30, 0, 32, 0, 34,
0, 36, 0, 38, 0], dtype=uint32)
If we look at the ILA waveform, you can see that the MM2S
channel received the data like this already.
The reason for this issue is that PS configuration is pre-defined in the SD card and cannot change at runtime. In this configuration the HP ports have 64-bit wide data signals. So, make sure the HP ports are properly configured.
if ol.ip_dict.get('processing_system7_0'):
ps = ol.ip_dict.get('processing_system7_0')
hp_ports = [f'C_S_AXI_HP{idx}_DATA_WIDTH' for idx in range(4)]
width = [64]
else:
ps = ol.ip_dict.get('zynq_ultra_ps_e')
hp_ports = [f'C_SAXIGP{idx}_DATA_WIDTH' for idx in range(6)]
width = [128, 64]
for i in hp_ports:
w = ps['parameters'][i]
print(f"{i}: {w}, configured correctly? {int(w) in width}")
This concludes the third part of this blog series, as bonus see Using an ILA without Physical JTAG cable
Please, use the comments section for questions related to the content of this blog. If you have questions about your own design or unrelated topics, please create a new topic in the forum.