Transfer burst of data with DMA

I am implementing a system using vivado, vitis hls and PYNQ. The first prototype of my system works perfectly, I have some DMAs that work as input and 2 DMAs that work as output. However, in my first prototype the output DMAs works in a sample by sample logic. To make this clearer, here is some more details:

  1. My input data is an array of 512 elements of 32bits
  2. My output data is an array of 512 elements of 64bits
  3. I am sending data with bursts, by using in_buffer_signal = allocate(shape=(512,), dtype=np.uint32) and then calling a transfer on the send channel of the DMA associated with such buffer
  4. I read the output sample by sample by allocating out_buffer_left = allocate(shape=(2,), dtype=np.uint32) and then reading with:
for j in range(512):
     dma_output_left.recvchannel.transfer(out_buffer_left)

The system works fine in this setup. What I wanted to do is to convert it into a burst one, so basically instead of making a for loop with 512 reads, I allocate a buffer out_buffer_left = allocate(shape=(512,2), dtype=np.uint32) and then I simply call dma_output_left.recvchannel.transfer(out_buffer_left) and all my samples are transferred.

My system has an output IP which manages the output writing and tlast, tkeep and tstrb signals.
The code of my final IP (in the case of sample by sample processing) is:

#include "overlapper_simplified_HLS.hpp"

void overlapper(hls::stream<pkt_t> &input_signal_stream, hls::stream<pkt_t> &output_signal_stream){
	#pragma HLS INTERFACE axis register port=input_signal_stream
	#pragma HLS INTERFACE axis register port=output_signal_stream
	#pragma HLS INTERFACE ap_ctrl_none port=return
	#pragma HLS PIPELINE

	static ap_fixed_data_type buffer[BUFFER];
	static int samples_counter = 0;
	static int counter = 0;
	pkt_t input_sample;
	pkt_t output_sample;
	//float tmp_buffer_value = 0;

	input_signal_stream.read(input_sample);
	if(samples_counter<INPUT_WINDOW_LENGTH){
		output_sample.data.real_part = input_sample.data.real_part + buffer[samples_counter];
		samples_counter ++;
                //Setting DMA signals
		output_sample.last = true;
                output_sample.keep = -1;
		output_sample.strb = 0;
		output_signal_stream.write(output_sample);
	}
	else{
		buffer[counter] = input_sample.data.real_part;
		samples_counter ++;
		counter ++;
	}

	if(samples_counter == IFFT_LENGTH){
		counter = 0;
		samples_counter = 0;
	}


}

I adapted the above code to work with burst of 512 samples by enabling the tlast every 512 samples (tested with a testbench, it works fine):


#include "overlapper_simplified_HLS.hpp"

void overlapper(hls::stream<pkt_t> &input_signal_stream, hls::stream<pkt_t> &output_signal_stream){
	#pragma HLS INTERFACE axis register port=input_signal_stream
	#pragma HLS INTERFACE axis register port=output_signal_stream
	#pragma HLS INTERFACE ap_ctrl_none port=return
	#pragma HLS PIPELINE

	static ap_fixed_data_type buffer[BUFFER];
	static int samples_counter = 0;
	static int counter = 0;
	pkt_t input_sample;
	pkt_t output_sample;
	//float tmp_buffer_value = 0;

	input_signal_stream.read(input_sample);
	if(samples_counter<INPUT_WINDOW_LENGTH){
		output_sample.data.real_part = input_sample.data.real_part + buffer[samples_counter];
		samples_counter ++;
		output_sample.last = false;

		//Overwriting tlast for just the last sample of the data chunk
		if(samples_counter == INPUT_WINDOW_LENGTH){
			output_sample.last = true;
		}

		output_sample.keep = -1;
		output_sample.strb = 0;
		output_signal_stream.write(output_sample);
	}
	else{
		buffer[counter] = input_sample.data.real_part;
		samples_counter ++;
		counter ++;
	}

	if(samples_counter == IFFT_LENGTH){
		counter = 0;
		samples_counter = 0;
	}


}

However, it seems that my system still works on a sample by sample logic. In fact, the output of the sample by sample system is:

[4294967196          0]
[4294967022          0]
[4294966609          0]
[4294967210          0]
[762   0]
[2037    0]
[4546    0]
[7594    0]
[11597     0]

While the output of the burst system is (calling the dma_output_left.recvchannel.transfer(out_buffer_left) two times with out_buffer_left = allocate(shape=(512,2), dtype=np.uint32)) :

[[4294967196          0]
 [         1          1]
 [         1          1]
 ...
 [         1          1]
 [         1          1]
 [         1          1]]

[[4294967022          0]
 [         1          1]
 [         1          1]
 ...
 [         1          1]
 [         1          1]
 [         1          1]]

As you might notice, only the first sample is sent for each call of recvchannel.transfer. The rest of the buffer is 1 because I initialize it to such value to debug the DMA transfer.

I thought that setting the tlast signal to the correct burst size was enough, but it seems I am missing something. How can i solve my problem?