Using AXI DMA with Wide Streams of packed numbers

I’m trying to test an HLS core that outputs a single 256 bit wide AXI stream (8 complex< ap_uint<16> >) that asserts TLAST ever 256 transactions (so 8kB). I’ve connected it to an AXI DMA’s S2MM port and I thought I’ve gotten all the various widths and settings right.

Over in my test notebook I’ve tried allocating an array of 2048 np.uint32 (so 8kB) and used that to kick off the transfer. I am getting data and things are mostly right, but I’m thinking they are shifted by 128bits and I might be seeing some other issues.

Presently I’m re-implementing with a System ILA to look more but for now I’m wondering if I’m even on the right track? Perhaps the issue is that I need to override the DMA receive channels length to specify 256 transactions instead of 8192, but I thought that length should be the number of bytes.

You are correct that the receive channel’s length should be the number of bytes so that is unlikely to be the issue. The IPI validator would also be issuing warnings if widths aren’t connected up correctly so it’s worth double checking there.

Would you be able to say what board you are using and what the processing system’s HP settings are for the memory port you are using?


Sure. I was in the midst of re-implementing the design with changes I thought might help (they didn’t) and a system ILA (which prevents using PYNQ, it seems because I’d never set in uenv.txt, per this other post).

I’m on the ZCU111, PYNQ 2.5, using 2019.2 for all my work and clocking at 100MHz (for testing, eventually the streams must be at 512MHz, but then I won’t be working with a DMA!). My test bench block diagram is:

And some of the key points are:


In case it becomes relevant the HLS core’s code/docs are here. I don’t have the Python testbench into a public git yet.

I can’t see anything wrong with how everything is set up. Any luck getting data out of the ILA?


Sort of. I got the ILA working and placed debug on the streams to and from the DMA. The packet I was trying to send were 256 beats of the 256bit width stream and no matter how I configured the ILA I could not get it to show more than ~5 beats on either side of what I triggered on (with a window 1024 long).

In the end I placed an axi stream data fifo in packet mode before and after my HLS core and then debugged those streams, there I was able to capture a full transfer and find the problem.

It turns out that a function declared as

typedef struct {
	std::complex< ap_uint<16> >  data[8];
        ap_uint<1> last;
} stream_t;
void top(..., stream_t &out,  ...) {
#pragma HLS INTERFACE axis register forward port=out
#pragma HLS INTERFACE ap_ctrl_none port=return

will pack the complex values as r0 r1 …r7 i0 i1 … i7 on the stream, not r0 i0. I’ve not been able to find ANY documentation that would indicate this but at least I understand now.

An example of the driving test code for posterity is here.

I think for simplify my test benches for future cores I’ll use the AXI Stream FIFO so that I can manually fill it with a preset number of (partial) packets and then pump that through on command. I think I would just need a small wrapper around the DefaultIP instance that deals with setting the various registers over MMIO. Slow, but still instant for my testing purposes and faster to implement.