Load data to IP via DMA

I try to load data(numpy array) to my IP for the purpose of reuse between several operations. However, when I try to validate the writing correctness, I stream out the cached data via DMA, the whole process hanged at dma.recvchannel.wait(). My platform is Ultra96-v2 and PYNQ version is 2.5.

When streaming the data to IP, my IP has a control port which can be configured to write data to BRAM via DMA in a streaming manner. The python code is as follow,

conv_accel.write(0x10, 13)
conv_accel.write(0x0, 0)
conv_accel.write(0x0, 1)

I have also monitored the control signal of my IP(axi_lite), I can read the signal initialized as 4(idle) and change to 1 right after I write 1 to 0x00, then the status stay 1 and never changed. This “ipython block” can complete execution(not stuck on wait()). I noticed that this is different from my several previous practices, which initialized as 4 and changed to 6(done+idle) than changed to 4 at final.

When streaming data from IP to PS, the Python code is,

conv_accel.write(0x10, 0)
conv_accel.write(0x0, 1)
dma_recv.wait()  # process hang on here.

Where I configure the IP as streaming out the cached data, and trigger IP to start by writing 1 to 0x00, then call the DMA to work. This “ipython block” is stuck at dma.recvchannel.wait().

My top function in C++ is as follows

 * read to Channel * (k_dim*k_dim), 16x9
void write_weight(stream<axis_t> &in){

	for(int i = 0; i < KERNEL_DIM*KERNEL_DIM; i++){
		// read stream data, one column
		axis_t data_in =;
		for(int j = 0; j < IFM_CHANNEL; j++){
#pragma HLS UNROLL
			ap_uint<PRECISION> temp * PRECISION-1, j*PRECISION);
			weights[j][i] = temp;

void DoCompute(stream<axis_t> &in, stream<axis_t> &out, int control){
#pragma HLS INTERFACE s_axilite port=return bundle=control_bus
#pragma HLS INTERFACE s_axilite port=control bundle=control_bus
#pragma HLS INTERFACE axis register both port=out
#pragma HLS INTERFACE axis register both port=in

	if(control == 13){

		for(int j = 0; j<KERNEL_DIM*KERNEL_DIM; j++){
			axis_t data_out;
			for(int i = 0; i < IFM_CHANNEL; i++){*PRECISION-1, i*PRECISION) = weights[i][j];
			data_out.last = (j == KERNEL_DIM*KERNEL_DIM-1? 1:0);

My questions are

  1. Is there any issue in my top function? I refer to the implementation from SpooNN.
  2. I think it would be help to monitor on the AXI streaming interface to my IP, is there any way to do this by using PYNQ?


I have exactly the same problem. Seems that something is broken with the DMA on PYNQ V2.5. I ran a working script on PYNQ 2.4 on ultra96 and the same script hangs on DMA receive on V2.5

I believe I am also having the same issue. I will try loading v2.4 and report back if that fixes my issue

I can also confirm, my issue went away when I dropped to 2.4 and removed the wait calls

1 Like

@rock , @PeterOgden,

Can you confirm if there is a workaround for this issue on PYNQ2.5 ?
Moving back to PYNQ 2.4 would require downgrading everything?

Can you post the block design? I wonder if you are using interrupt 0 for DMA; that is a known issue.

@dimiter I don’t think you should downgrade. I am thinking there is something basic broken or not properly done. We need a little more info.

I’m not really sure what’s going on. Would either of you be able to provide an HWH or TCL file for your design so I can dig deeper?.

In the meantime, can you try the PYNQ-HelloWorld example as that also uses the DMA engine and is part of our regression tests prior to each release.


Hi @rock and @PeterOgden,

Thank you for replies. The TCL files, block design and HLS codes are zipped and attached.

The zip file contains

  1. TCL file
  2. Block Design(xpr)
  3. IP HLS source code
  4. Python driver

My board is Ultra96-v2 with PYNQ-2.5 is installed.

Thank you very much!! (15.9 KB)

For the block design, can you export it to pdf file and attach it?

Hi @rock,
Here is the PDF file.
DoCompute.pdf (199.9 KB)

@Jia-Ming_Lin I noticed that you are using the M_AXI_HPM0_FPD for AXI lite interfaces. Can you use M_AXI_HPM0_LPD port instead?

After you check that, if it still does not work, I should remind you that the ultra96 bsp may have a different bit width than your PS configuration. So the boot files may have, say, 128 bits for LPD port, but your bitstream has 32 bits for LPD ports.
One way to check is that, in,
you can make block_design from v2 tcl file. There is also ways to adjust your system memory port width directly on your OS. For example, check
And adjust register values accordingly (e.g. look for afi_fs (LPD_SLCR) Register and adjust its width).

Hi @rock,

I change to use M_AXI_HPM0_LPD and set the bitwidth as default 32, which is consistent with the example project “sensors96b”. I compiled “sensors96b” from branch image_v2.4_v2 by using Vivado 2018.3, and it worked normally.

However, when I running my code for creating DMA object

dma = overlay.axi_dma_0

the system crashed, and there is a message dialog

Connection Failed. 
A connection to the notebook server could not be established. The notebook will continue trying to reconnect. Check your network connection or notebook server configuration.

The following is the updated block design in PDF, and the DMA configuration.
DoCompute_LPD.pdf (199.9 KB)

Thank you very much!

The crash looks strange. I am wondering if you have loaded a bitstream somewhere else outside of pynq, then started to use the overlay class. In that case the system may have probed a non-existing AXI address which leads to system hanging. Can you reboot the board and start cleanly?

Also, please make sure you use *.bit + *.hwh for overlay files, as you have been warned (you probably have used *.bit + *.tcl, which will be deprecated soon).

Hi @rock,

I solved the crash issue by delete then add the custom IP and DMA and then force to update the design to update the interface offset addresses.

However, the last operation to fetch the cached data is still not working, DMA hanging at dma.recvchannel.wait().
And in my scenario, it’s inappropriate to adjust the system configuration for a single application. Therefore, I should still correctly configure the design in Vivado before bitstream generation.

Thank you very much!

I tested the resize IP example that was working on V2.4 of PYNQ.
That fails when reading from the DMA. Then I implemented a simple FIFO DMA readback.
The transaction gets stuck when reading. There are no interrupts at all on this design.

This issue happens for all designs that use the DMA read on V2.5.
I am using .bit anmd .tcl files in case that is important.
How do we generate .hwh from Vivado?

The simple DMA code is below:

#!/usr/bin/env python
# coding: utf-8

from PIL import Image
import numpy as np
from IPython.display import display
from pynq import Xlnk
from pynq import Overlay
from pynq import allocate

dma_design = Overlay("dmafifo.bit")

get_ipython().run_line_magic('pinfo', 'dma_design')


from pynq.lib import AxiGPIO
led_instance = dma_design.ip_dict['axi_gpio_0']
led = AxiGPIO(led_instance).channel1


xlnk = Xlnk()
in_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)
out_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)

dma = dma_design.axi_dma_0

input_buffer = allocate(shape=(5,), dtype=np.uint32)
output_buffer = allocate(shape=(5,), dtype=np.uint32)

for i in range(100):
   in_buffer[i] = i


hwh file can be located in paths like:
{overlay_name}.srcs/sources_1/bd/{design_name}/hw_handoff/{design_name}.hwh {overlay_name}.hwh

If you look at our overlays, you can find how we generate it:

Thanks, I will try that but I don’t see how it can make a difference with the DMA xlnk read issue.

@dimiter, I also noticed that you have used allocate function and cma_array function both. You don’t have to - you only need

in_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)
out_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)

@Jia-Ming_Lin Not sure if you are able to resolve it. For me, I am using the DMA calls consistently without having issues. So it is somewhere in your flow that is not done properly. If you still need help, please zip up the entire vivado project (along with the tcl file that generates your block design), your HLS IP, and your notebook, and send it to me. I can have a look when I have some time.

Hi @rock,

Thank you for your help, and the attachments are as follows

  1. zip file of HLS code(compiled with Vivado HLS 2018.3)
  2. zip file of Block Design for Vivado.
    • DoCompute.tcl: Design using M_AXI_HPM0_FPD
    • DoComputeLPD.tcl: Design using M_AXI_HPM0_LPD
  3. conv_load_weights.ipynb: IPython notebook

The overall procedure is

  1. setting the IP to load data from DMA by setting a register(0x10) as 13.
  2. trigger the DMA and custom IP to work(0x0 = 1)
    • DMA stream data to IP, and IP write data to on-chip memory
  3. setting the IP status to do computation, here packing the cached data to stream to DMA and write to DRAM, by setting a register(0x10) as 1(any number other than 13).
  4. triggering the DMA and custom IP to work(0x0 = 1)
    • packing the cached data to stream to DMA and write to DRAM

I referred the similar method from SpooNN. I have successfully reproduced the experiments in SpooNN. However, the method/process doesn’t work when I try to implement by myself.

I have ever monitored the value of the control port(0x0), in my design, the status is always 1 after I triggered the transferring action(send data). However, in the SpooNN, the status would back to 4(idle) in a short time.

My tools version and environment

  • Ubuntu 18.04
  • Vivado HLS 2018.3 and Vivado 2018.3
  • Ultra96 v2
  • PYNQ 2.5

And thank you again for your help. (2.4 MB) (18.2 KB)
conv_load_weights.ipynb (30.6 KB)