PYNQ: PYTHON PRODUCTIVITY

DMA output all zeros with custom IP

Hi I have made the following design: design.pdf (68.6 KB) it is a simple dma flow around a custom IP block. The custom IP block is a test to see how things work and is very simple, it simply bit shifts all words 1 bit to the right. The verilog code is here: shifter_v1_0.zip (856 Bytes).

Now when I do a sendchannel.transfer and a recvchannel.transfer from a jupyter notebook I can see in the ILA that actual data is flowing. But the output buffer in python contains only zeros. I am not sure why and how to debug it. My design makes the TVALID always high as you can always read the internal buffer. I was unsure how TLAST works and just made it always high as well, does this perhaps not work with the dma controller?

any ideas are welcome. thanks!

My chrome stops me from downloading your zip file. Can you just show the design in an embedded picture, and also attach the code directly here?

thats weird, I am not allowed to upload a .v file on this forum strangely. Ill try to copy paste it in here:

`timescale 1 ns / 1 ps

module shifter_v1_0 #
(
            parameter integer TDATA_WIDTH                   = 32,

	// Parameters of Axi Master Bus Interface M00_AXIS
	parameter integer C_M00_AXIS_TDATA_WIDTH	= TDATA_WIDTH,

	// Parameters of Axi Slave Bus Interface S00_AXIS
	parameter integer C_S00_AXIS_TDATA_WIDTH	= TDATA_WIDTH
)
(
	// Ports of Axi Master Bus Interface M00_AXIS
	input wire  m00_axis_aclk,
	input wire  m00_axis_aresetn,
	output wire  m00_axis_tvalid,
	output wire [C_M00_AXIS_TDATA_WIDTH-1 : 0] m00_axis_tdata,
	output wire [(C_M00_AXIS_TDATA_WIDTH/8)-1 : 0] m00_axis_tstrb,
	output wire  m00_axis_tlast,
	input wire  m00_axis_tready,

	// Ports of Axi Slave Bus Interface S00_AXIS
	input wire  s00_axis_aclk,
	input wire  s00_axis_aresetn,
	output wire  s00_axis_tready,
	input wire [C_S00_AXIS_TDATA_WIDTH-1 : 0] s00_axis_tdata,
	input wire [(C_S00_AXIS_TDATA_WIDTH/8)-1 : 0] s00_axis_tstrb,
	input wire  s00_axis_tlast,
	input wire  s00_axis_tvalid
);

    // internal buffer of one word
    reg [TDATA_WIDTH-1 : 0] buffer;
    // register of output driver
    reg [TDATA_WIDTH-1 : 0] data_out;
    // always be ready to read since our operation is single clock cycle
    assign s00_axis_tready = 1'b1;
    
    // slave driver
    always @(posedge s00_axis_aclk)
    begin
        if(!s00_axis_aresetn)
        begin
            buffer <= {(TDATA_WIDTH){1'b0}};
        end
        else
        begin
            if(s00_axis_tvalid)
            begin
                buffer <= s00_axis_tdata;
            end
        end
    end


    // master driver
    // data can always be read from the buffer so always valid
    assign m00_axis_tvalid = 1'b1;
    assign m00_axis_tstrb  = {(TDATA_WIDTH/8){1'b1}};
    // one cycle per word so tlast always asserted
    assign m00_axis_tlast  = 1'b1;

    // assign to the output the shift operation
    assign m00_axis_tdata = data_out;

    always @(posedge m00_axis_aclk)
    begin
        if(!m00_axis_aresetn)
        begin
           data_out <= {(TDATA_WIDTH){1'b0}};
        end
        else
        begin
            data_out <= buffer >> 1;
        end
    end

endmodule

and my block design:

What does your python code look like? Did you start the IP?

I call the DMA with a simple transfer. allocate a buffer and execute sendchannel.transfer and recvchannel.transfer. I tested it first with a fifo block and that worked flawlessly. I then replaced the fifo block with the custom IP.

with other IP blocks you dont need to call the IP itself either, only the DMA transfer. I also have a working FFT pipeline for example.

the code is nothing more then:

dma_in_buffer = xlnk.cma_array(shape=(N,), dtype=np.int32)
dma_out_buffer = xlnk.cma_array(shape=(N,), dtype=np.int32)
data = np.random.randint(2**10, size=N, dtype=np.int32)
np.copyto(dma_in_buffer, data)
dma_in.sendchannel.transfer(dma_in_buffer)
dma_in.sendchannel.wait()
dma_out.recvchannel.transfer(dma_out_buffer)
dma_out.recvchannel.wait()

I checked the logic of your verilog. The assertion of the tlast is very suspicious. Since it is always asserted, the transaction almost immediately stops after start. Since you have ILA connected, I would recommend you run a C program in SDK and debug the AXI interface. The logic does not look correct to me - at least I would do a FSM to make sure those handshake signals are properly generated.

I dont really have experience with the C SDK with pynq. However I now inserted a delay line for the tlast signal. So it is buffered from the input, my idea is that this alligns the transaction from slave to master and is a little simpler then a full FSM. I think this is also compliant to the axis spec that I found here

This results in tlast not being asserted at the beginning, so I would expect to see at least some data. However again everything is zero in the output buffer.

My adjusted verilog (without unchanged module definition):

    // internal buffer of one word
    reg [TDATA_WIDTH-1 : 0] buffer;
    // delay buffer of tlast signal
    reg buffer_tlast;
    
    // always be ready to read since our operation is single clock cycle
    assign s00_axis_tready = 1'b1;
    
    // slave driver
    always @(posedge s00_axis_aclk)
    begin
        if(!s00_axis_aresetn)
        begin
            buffer <= {(TDATA_WIDTH){1'b0}};
            buffer_tlast <= 1'b0;
        end
        else
        begin
            if(s00_axis_tvalid)
            begin
                buffer <= s00_axis_tdata;
                buffer_tlast <= s00_axis_tlast;
            end
        end
    end


    // master driver
    // data can always be read from the buffer so always valid
    assign m00_axis_tvalid = 1'b1;
    assign m00_axis_tstrb  = {(TDATA_WIDTH/8){1'b1}};

    // register of output driver
    reg [TDATA_WIDTH-1 : 0] data_out;
    // register of output driver of tlast signal
    reg tlast_out;

    // assign output drivers to output wires
    assign m00_axis_tdata = data_out;
    assign m00_axis_tlast = tlast_out;

    always @(posedge m00_axis_aclk)
    begin
        if(!m00_axis_aresetn)
        begin
           data_out <= {(TDATA_WIDTH){1'b0}};
           tlast_out <= 1'b0;
        end
        else
        begin
            data_out <= buffer >> 1;
            tlast_out <= buffer_tlast;
        end
    end