DMA Stream Output Only Hangs

PYNQ-Z2, v2.6 I think, Vitis HLS and Vivado 2023

Hello! I’m new to PYNQ and FPGAs. I have an IP design that takes in two input streams, and outputs two streams. This works correctly. However, the input streams are input data, which is slow to calculate, but the board could calculate it by itself, if given 6 constant input values. I’m trying to make a new version that just takes those 6 input values (2 are ints, height and width of the output buffer – relevant in a sec), and outputs the same streams.

However, the issue I’m experiencing is that the DMA transfer hangs. Writing the 6 input values works fine, but when I call the DMA recv transfer, then write a 1 to AP_START (or in the opposite order), I find that the DMA wait() method hangs forever. It should terminate in under a second.

I had a similar issue to this before, where the problem was that the board did not know the length of the input/output buffers. However, this was fixed once I passed this number to the function, and just called read()/write() inside a for-loop that ran that many times. Now, instead of one input number, the stream length is the product of the two int inputs (height * width, it’s 2D). There are 2 nested for-loops, and the write() method is called inside the inner one, so it should run the correct number of times. However, if there’s some way I’m supposed to manually signal the end of the stream, I’m not aware of it. (Note: the stream uses ap_uint and should already have TLAST taken care of, so I don’t think that’s the issue. If it were, my previous example with input and output streams would not have worked.)

Does anyone know what the problem might be? I know I don’t give a ton of context, please let me know what other info I could provide that would be helpful!

Here is my HLS code:

#include <math.h>
#include <complex>
#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axis<32,0,0,0> transPkt;

// For converting byte-wise from float to integer and back, because hls streams use ints
union fp_int {
	int i;
	float fp;
};
inline std::complex<float> func_to_test(std::complex<float> z) {
	return __complex_cos(z);
}

void my_ip(int width_px, int height_px, float xMin, float xMax, float yMin, float yMax,
hls::stream<transPkt>&out_hues, hls::stream<transPkt>&out_brightnesses) {
#pragma HLS INTERFACE mode=axis port=in_angles,in_moduluses,out_hues,out_brightnesses
#pragma HLS INTERFACE s_axilite port=width_px
#pragma HLS INTERFACE s_axilite port=height_px
#pragma HLS INTERFACE s_axilite port=xMin
#pragma HLS INTERFACE s_axilite port=xMax
#pragma HLS INTERFACE s_axilite port=yMin
#pragma HLS INTERFACE s_axilite port=yMax
#pragma HLS INTERFACE mode=s_axilite port=return
	fp_int angle, modulus, hue, brightness;
	transPkt io1pkt, io2pkt; // output packets for hue and brightness respectively

// here are the for-loops that should create a height*width length output
	for (unsigned int x = 0; x<width_px; x++){
		for (unsigned int y = 0; y<height_px; y++){
			std::complex<float> z(
				xMin + (x/(float)width_px)*(xMax-xMin),
				yMin + (y/(float)height_px)*(yMax-yMin)
			);
			
[various calculations removed -- it's a complex function color plot generator]

			//AXIS output packets are expecting integer type
			io1pkt.data = (angle / TWOPI) * 255;
			io2pkt.data = frac_lightness * 255; // frac_lightness is generated in the removed code
			out_hues.write(io1pkt); // write values to stream
			out_brightnesses.write(io2pkt);
		}
	}
}

Update on this: It seems like the IP is running correctly, but the data is not being written back to DMA. I know this because I added a register to the HLS, that simply gets set to a certain value when the IP starts, then a different value once the stream operations are done. This register got set to that value, and the AP_IDLE signal got set back to 1, so the IP seems to be finishing its job, but somehow the data is not written back to the DMA buffer correctly.

1 Like

I found an inefficient but functional workaround. Initially I figured that if the input stream was required to work, I could just use the input stream to transmit the 6 input values, rather than registers. However, this did not work. It seems like simply reading the first few values and discarding the rest would not work – when I only transmitted the inputs, it would only return one output value (leaving the rest of the buffer as 0s), and when I tried to send any other size of input, it would give a DMA internal error (0 values returned). A few times I just got the original hanging problem. Finally I gave up, since the DMA memory transfer takes a very small amount of time at the scale I’m using, I just had the software send an amount of input data that was equal to the output data in length, and the IP simply ignores that data and constructs its own output (based on the 6 input values done in registers). This works, although it feels like such a dumb solution.

The DMA expects to see TLAST in the AXI stream set to 1 for the last value of the transfer. This may be of use:

https://discuss.pynq.io/t/tutorial-using-a-hls-stream-ip-with-dma-part-1-hls-design/3344

Cathal

Hi Cathal, I’ve read through your tutorial several times in the process of building this project, and it’s been very helpful, but I’m not sure how TLAST applies to this issue. Shouldn’t the dma0_send.transfer() and dma0_recv.transfer() calls handle TLAST? They both get it right when I’m sending equal-length streams, and since the streams use different buffers and different transfer() calls, I don’t see why one would affect the other. (The TLAST signal is present in my design, and transfers work fine when they’re the same length input and output, so TLAST is clearly being set, at least sometimes!)

The only thing I can think of is if there’s really only one TLAST signal for both input and output, and it has to be set by both at the same time or same index of the stream. Is this the case? (If so, why does it work this way??)

I’ve read through your tutorial several times in the process of building this project, and it’s been very helpful, but I’m not sure how TLAST applies to this issue.

I may not be understanding your problem properly, especially when you say this:

They both get it right when I’m sending equal-length streams, and since the streams use different buffers and different transfer() calls, I don’t see why one would affect the other.

Can you share your block diagram?

Shouldn’t the dma0_send.transfer() and dma0_recv.transfer() calls handle TLAST?

No, not really. TLAST is handled at the level of the DMA IP. The data stream may be transferred, but the DMA doesn’t update its own status to show that the transaction is complete the transaction until it sees TLAST. You can verify this by using register_map to check the status register of the IP.
Register info for DMA here: https://docs.xilinx.com/r/en-US/pg021_axi_dma/AXI-DMA-Register-Address-Map

In the code you share for my_ip, you are generating two AXI streams of data. The TLAST wire may exist in your IP, but you need to set it to one from your IP, otherwise it will always be 0. You should add code to your IP to set TLAST to 1 on the last value in the stream.

You should be able to stop and reset a channel to workaround the TLAST issue, but this is building a design incorrectly, and then trying to workaround with soft resets, but I wouldn’t suggest this. I’d suggest you try to build the hardware properly in the first place.

Cathal

Thank you Cathal,

My block design is simply my custom IP connected to two (three in later versions) AXI DMA streams, stream 0 in and stream 0 out to DMA 0 read and write respectively, stream 1 to DMA 1, etc., plus a ZYNQ7 processing system and whatever else is added by the DMA’s connection automation. I’m not sure how to share it here.

I’ve now successfully found a workaround (of just transmitting 0s for input) that works well enough that I don’t think I need to investigate this further (I am under a tight deadline). However, I’ll try to give more information in case someone else in the future has this issue.

Basically, I’ve followed this tutorial to create an IP with multiple streams in and out. I found that following the same process but not using any input streams (and disabling the corresponding stream in the DMA blocks) causes the transfer() calls to hang forever, but I do not know why. The IP does know the length of the output that it should be writing.
I also found that if the input streams exist but are a different length than the output streams, this causes the transfer() calls to throw an error reported “no data transferred?”
Note that in my use case, I’ve only ever had equal lengths for all output streams, and they’re written to in sequence (value to stream 0, then 1, then next value to 0, then 1, etc., rather than all of 0 then all of 1).
Since streaming the inputs takes on the order of milliseconds, my workaround is to leave the inputs but transmit all 0s, to avoid all preprocessing except the initial allocate() call.