Load data to IP via DMA

Jia-Ming_Lin · December 29, 2019, 1:13pm

Hi,
I try to load data(numpy array) to my IP for the purpose of reuse between several operations. However, when I try to validate the writing correctness, I stream out the cached data via DMA, the whole process hanged at dma.recvchannel.wait(). My platform is Ultra96-v2 and PYNQ version is 2.5.

When streaming the data to IP, my IP has a control port which can be configured to write data to BRAM via DMA in a streaming manner. The python code is as follow,

conv_accel.write(0x10, 13)
conv_accel.write(0x0, 0)
conv_accel.write(0x0, 1)
dma_send.transfer(input_buffer)
dma_send.wait()

I have also monitored the control signal of my IP(axi_lite), I can read the signal initialized as 4(idle) and change to 1 right after I write 1 to 0x00, then the status stay 1 and never changed. This “ipython block” can complete execution(not stuck on wait()). I noticed that this is different from my several previous practices, which initialized as 4 and changed to 6(done+idle) than changed to 4 at final.

When streaming data from IP to PS, the Python code is,

conv_accel.write(0x10, 0)
dma_recv.transfer(output_buffer)
conv_accel.write(0x0, 1)
dma_recv.wait()  # process hang on here.

Where I configure the IP as streaming out the cached data, and trigger IP to start by writing 1 to 0x00, then call the DMA to work. This “ipython block” is stuck at dma.recvchannel.wait().

My top function in C++ is as follows

static ap_uint<PRECISION> weights[IFM_CHANNEL][KERNEL_DIM*KERNEL_DIM];
/*
 * read to Channel * (k_dim*k_dim), 16x9
 */
void write_weight(stream<axis_t> &in){

	for(int i = 0; i < KERNEL_DIM*KERNEL_DIM; i++){
#pragma HLS PIPELINE
		// read stream data, one column
		axis_t data_in = in.read();
		for(int j = 0; j < IFM_CHANNEL; j++){
#pragma HLS UNROLL
			ap_uint<PRECISION> temp =data_in.data((j+1) * PRECISION-1, j*PRECISION);
			weights[j][i] = temp;
		}
	}
}

void DoCompute(stream<axis_t> &in, stream<axis_t> &out, int control){
#pragma HLS INTERFACE s_axilite port=return bundle=control_bus
#pragma HLS INTERFACE s_axilite port=control bundle=control_bus
#pragma HLS INTERFACE axis register both port=out
#pragma HLS INTERFACE axis register both port=in


	if(control == 13){
		write_weight(in);
	}else{

		for(int j = 0; j<KERNEL_DIM*KERNEL_DIM; j++){
			axis_t data_out;
			for(int i = 0; i < IFM_CHANNEL; i++){
				data_out.data((i+1)*PRECISION-1, i*PRECISION) = weights[i][j];
			}
			data_out.last = (j == KERNEL_DIM*KERNEL_DIM-1? 1:0);
			out.write(data_out);
		}
	}
}

My questions are

Is there any issue in my top function? I refer to the implementation from SpooNN.
I think it would be help to monitor on the AXI streaming interface to my IP, is there any way to do this by using PYNQ?

Thanks,
Jia-Ming.

dimiter · December 30, 2019, 5:54pm

I have exactly the same problem. Seems that something is broken with the DMA on PYNQ V2.5. I ran a working script on PYNQ 2.4 on ultra96 and the same script hangs on DMA receive on V2.5

r_n · January 14, 2020, 8:31pm

I believe I am also having the same issue. I will try loading v2.4 and report back if that fixes my issue

r_n · January 16, 2020, 3:19pm

I can also confirm, my issue went away when I dropped to 2.4 and removed the wait calls

dimiter · January 23, 2020, 5:30pm

@rock , @PeterOgden,

Can you confirm if there is a workaround for this issue on PYNQ2.5 ?
Moving back to PYNQ 2.4 would require downgrading everything?

rock · January 23, 2020, 6:05pm

Can you post the block design? I wonder if you are using interrupt 0 for DMA; that is a known issue.

@dimiter I don’t think you should downgrade. I am thinking there is something basic broken or not properly done. We need a little more info.

PeterOgden · January 23, 2020, 6:31pm

I’m not really sure what’s going on. Would either of you be able to provide an HWH or TCL file for your design so I can dig deeper?.

In the meantime, can you try the PYNQ-HelloWorld example as that also uses the DMA engine and is part of our regression tests prior to each release.

Peter

Jia-Ming_Lin · January 24, 2020, 5:33pm

Hi @rock and @PeterOgden,

Thank you for replies. The TCL files, block design and HLS codes are zipped and attached.

The zip file contains

TCL file
Block Design(xpr)
IP HLS source code
Python driver

My board is Ultra96-v2 with PYNQ-2.5 is installed.

Thank you very much!!

output_file.zip (15.9 KB)

rock · January 24, 2020, 5:40pm

For the block design, can you export it to pdf file and attach it?

Jia-Ming_Lin · January 25, 2020, 4:38am

Hi @rock,
Here is the PDF file.
DoCompute.pdf (199.9 KB)

rock · January 25, 2020, 5:45am

@Jia-Ming_Lin I noticed that you are using the M_AXI_HPM0_FPD for AXI lite interfaces. Can you use M_AXI_HPM0_LPD port instead?

After you check that, if it still does not work, I should remind you that the ultra96 bsp may have a different bit width than your PS configuration. So the boot files may have, say, 128 bits for LPD port, but your bitstream has 32 bits for LPD ports.
One way to check is that, in https://github.com/Avnet/Ultra96-PYNQ/tree/master/Ultra96/sensors96b,
you can make block_design from v2 tcl file. There is also ways to adjust your system memory port width directly on your OS. For example, check Zynq UltraScale+ Devices Register Reference
And adjust register values accordingly (e.g. look for afi_fs (LPD_SLCR) Register and adjust its width).

Jia-Ming_Lin · January 25, 2020, 4:36pm

Hi @rock,

I change to use M_AXI_HPM0_LPD and set the bitwidth as default 32, which is consistent with the example project “sensors96b”. I compiled “sensors96b” from branch image_v2.4_v2 by using Vivado 2018.3, and it worked normally.

However, when I running my code for creating DMA object

dma = overlay.axi_dma_0

the system crashed, and there is a message dialog

Connection Failed. 
A connection to the notebook server could not be established. The notebook will continue trying to reconnect. Check your network connection or notebook server configuration.

The following is the updated block design in PDF, and the DMA configuration.
DoCompute_LPD.pdf (199.9 KB)

Thank you very much!

rock · January 25, 2020, 5:10pm

The crash looks strange. I am wondering if you have loaded a bitstream somewhere else outside of pynq, then started to use the overlay class. In that case the system may have probed a non-existing AXI address which leads to system hanging. Can you reboot the board and start cleanly?

Also, please make sure you use *.bit + *.hwh for overlay files, as you have been warned (you probably have used *.bit + *.tcl, which will be deprecated soon).

Jia-Ming_Lin · January 26, 2020, 4:58pm

Hi @rock,

I solved the crash issue by delete then add the custom IP and DMA and then force to update the design to update the interface offset addresses.

However, the last operation to fetch the cached data is still not working, DMA hanging at dma.recvchannel.wait().
And in my scenario, it’s inappropriate to adjust the system configuration for a single application. Therefore, I should still correctly configure the design in Vivado before bitstream generation.

Thank you very much!

dimiter · January 27, 2020, 1:38pm

I tested the resize IP example that was working on V2.4 of PYNQ.
That fails when reading from the DMA. Then I implemented a simple FIFO DMA readback.
The transaction gets stuck when reading. There are no interrupts at all on this design.

This issue happens for all designs that use the DMA read on V2.5.
I am using .bit anmd .tcl files in case that is important.
How do we generate .hwh from Vivado?

The simple DMA code is below:

#!/usr/bin/env python
# coding: utf-8

from PIL import Image
import numpy as np
from IPython.display import display
from pynq import Xlnk
from pynq import Overlay
from pynq import allocate


dma_design = Overlay("dmafifo.bit")

get_ipython().run_line_magic('pinfo', 'dma_design')

dma_design.ip_dict


from pynq.lib import AxiGPIO
led_instance = dma_design.ip_dict['axi_gpio_0']
led = AxiGPIO(led_instance).channel1



led.write(0,0x01)



xlnk = Xlnk()
in_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)
out_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)



dma = dma_design.axi_dma_0




input_buffer = allocate(shape=(5,), dtype=np.uint32)
output_buffer = allocate(shape=(5,), dtype=np.uint32)



for i in range(100):
   in_buffer[i] = i



dma.sendchannel.transfer(in_buffer)
dma.recvchannel.transfer(out_buffer)
dma.sendchannel.wait()
dma.recvchannel.wait()

rock · January 27, 2020, 5:41pm

hwh file can be located in paths like:
${overlay_name}.srcs/sources_1/bd/${design_name}/hw_handoff/${design_name}.hwh ${overlay_name}.hwh

If you look at our overlays, you can find how we generate it:
https://github.com/Xilinx/PYNQ/blob/master/boards/Pynq-Z1/base/build_bitstream.tcl#L28

dimiter · January 27, 2020, 6:09pm

Thanks, I will try that but I don’t see how it can make a difference with the DMA xlnk read issue.

rock · January 30, 2020, 7:00pm

@dimiter, I also noticed that you have used allocate function and cma_array function both. You don’t have to - you only need

in_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)
out_buffer = xlnk.cma_array(shape=(100, 100, 3), dtype=np.uint8, cacheable=1)

rock · January 30, 2020, 7:12pm

@Jia-Ming_Lin Not sure if you are able to resolve it. For me, I am using the DMA calls consistently without having issues. So it is somewhere in your flow that is not done properly. If you still need help, please zip up the entire vivado project (along with the tcl file that generates your block design), your HLS IP, and your notebook, and send it to me. I can have a look when I have some time.

Jia-Ming_Lin · January 31, 2020, 12:57pm

Hi @rock,

Thank you for your help, and the attachments are as follows

DW_LoadWeightComp.zip: zip file of HLS code(compiled with Vivado HLS 2018.3)
DW_LoadWeightComp_Design.zip: zip file of Block Design for Vivado.
- DoCompute.tcl: Design using M_AXI_HPM0_FPD
- DoComputeLPD.tcl: Design using M_AXI_HPM0_LPD
conv_load_weights.ipynb: IPython notebook

The overall procedure is

setting the IP to load data from DMA by setting a register(0x10) as 13.
trigger the DMA and custom IP to work(0x0 = 1)
- DMA stream data to IP, and IP write data to on-chip memory
setting the IP status to do computation, here packing the cached data to stream to DMA and write to DRAM, by setting a register(0x10) as 1(any number other than 13).
triggering the DMA and custom IP to work(0x0 = 1)
- packing the cached data to stream to DMA and write to DRAM

I referred the similar method from SpooNN. I have successfully reproduced the experiments in SpooNN. However, the method/process doesn’t work when I try to implement by myself.

I have ever monitored the value of the control port(0x0), in my design, the status is always 1 after I triggered the transferring action(send data). However, in the SpooNN, the status would back to 4(idle) in a short time.

My tools version and environment

Ubuntu 18.04
Vivado HLS 2018.3 and Vivado 2018.3
Ultra96 v2
PYNQ 2.5

And thank you again for your help.

DW_LoadWeightComp.zip (2.4 MB) DW_LoadWeightComp_Design.zip (18.2 KB)
conv_load_weights.ipynb (30.6 KB)

Topic		Replies	Views
Data is not write back from DMA Support	32	2223	May 26, 2022
Mmio data loading issue Support	15	1084	June 10, 2022
Custom IP generates different results and DMA stuck at 'dma.recvchannel.wait()' Support	4	127	February 10, 2025
Tutorial: PYNQ DMA (Part 2: Using the DMA from PYNQ) Learn	7	18531	May 27, 2025
Tutorial: PYNQ DMA (Part 1: Hardware design) Learn	33	35628	April 2, 2025

Load data to IP via DMA

Related topics