HLS IP repeated output read gives different output for same input

Hello Everyone,

I encountered some issues while using VDMA with a custom IP developed using Vitis HLS. The issue I am having is that every time I read the output of the IP, it returns a different, out of order permutation of my input (in pass through operation) without changing the input. I attached a notebook that includes the output of 2 consecutive access to the output buffer showcasing the issue.

I tried to provide every detail I could think off below and I hope someone can give me a pointer to go on.

Any help would be greatly appreciated.

Thank you,
Mario

Setup details

  • Windows 10 Pro host machine
  • Ubuntu 18.04.6 VM set up using modified Vagrant fiile from PYNQ repo to assign more disk space and RAM
  • Xilinx tools version 2020.2
  • PYNQ 2.7.0 image
  • Custom ZU+ SoC

Task explanation

I am working on a preprocessing IP in Vitis HLS for channel-wise manipulation of RGB images for ML preprocessing. The IP takes an input stream (AXI4 Stream) corresponding to the raw pixels (24-bit) and outputs the processed packed pixels using a AXI4 Master Interface. The output of the IP is the packed pixel with 32-bit floats for each channel value (96-bit output data). I am using AXI4 Stream input as I will have other IPs connected doing basic image manipulations and a AXI4 Master Interface for the output as with my limited HLS knowledge it seemed to be the easier option, but I am more than happy to take suggestions on this front.

The final goal is to turn the design into a Vitis platform for DPU integration and use the IP to accelerate preprocessing for ML inference. At this point I am just looking to verify the correct operation of the IP in a basic hardware design.

Below I tried to collect anything that could help with tracking down the issue. As you can see from the IP name and the attached notebook, I have went through a few iterations, trying to resolve the issue. The IP works fine in simulation, but when I run the attached notebook and read the content of the allocated PYNQ Buffer, it is a mess and changes between reads, even though I am not passing it a new input. Bonus fun is that the entries corresponding to the B colour channel are all zeros which makes no sense to me.

System ILA debug results

I used System ILA to debug the IP and found that the input data is streamed to the IP as expected, but there is B colour channel is missing at the output (all zeros). The output on the WDATA of the Master AXI4 Interface for the most part in sequence, but there are some pixel channel values that randomly repeat.

HLS code for IP

Note: I am using multiplication in the HLS IP with the standard deviations as it is quicker in hardeware. The passed parameter will be the reciprocal of the standard deviations, handled in software.

#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>

// Input stream definition
#define DATA_WIDTH 24
typedef ap_axiu<DATA_WIDTH,1,1,1> interface_t;
typedef hls::stream<interface_t> stream_t;

// Output pointer definition
#define OUTPUT_PTR_WIDTH 96
typedef ap_uint<OUTPUT_PTR_WIDTH> pointer_t;

typedef ap_uint<DATA_WIDTH> data_t;
typedef ap_uint<8> U8;
typedef ap_uint<32> U32;
typedef ap_uint<96> packed_fp_t;

extern "C" {
    void preprocess3_accel(stream_t& stream_in,
                          pointer_t* img_out,
                          unsigned int rows,
                          unsigned int cols,
                          float r_mean,
                          float g_mean,
                          float b_mean,
                          float r_std,
                          float g_std,
                          float b_std) {

        #pragma HLS INTERFACE axis register both port=stream_in
        #pragma HLS INTERFACE m_axi     port=img_out  offset=slave bundle=gmem1
        #pragma HLS INTERFACE s_axilite port=rows
        #pragma HLS INTERFACE s_axilite port=cols
        #pragma HLS INTERFACE s_axilite port=r_mean
        #pragma HLS INTERFACE s_axilite port=g_mean
        #pragma HLS INTERFACE s_axilite port=b_mean
        #pragma HLS INTERFACE s_axilite port=r_std
        #pragma HLS INTERFACE s_axilite port=g_std
        #pragma HLS INTERFACE s_axilite port=b_std
        #pragma HLS INTERFACE s_axilite port=return


        for (unsigned int idx = 0; idx < rows * cols; idx++) {
			#pragma HLS PIPELINE

            data_t pixel = stream_in.read().data;

            // Unpack pixel
            U8 pR = pixel.range(7, 0);
            U8 pG = pixel.range(15, 8);
            U8 pB = pixel.range(23, 16);

            // Preprocess channels
            float fR = (pR.to_float() - r_mean) * r_std;
            float fG = (pG.to_float() - g_mean) * g_std;
            float fB = (pB.to_float() - b_mean) * b_std;

            // Repack pixel
            packed_fp_t packed_pixel;
            packed_pixel.range(31, 0) = reinterpret_cast<U32&>(fR);
            packed_pixel.range(63, 32) = reinterpret_cast<U32&>(fG);
            packed_pixel.range(95, 64) = reinterpret_cast<U32&>(fB);

            img_out[idx] = packed_pixel;
        }

        return;
    }
}

Test bench

#include <iostream>
#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>

// Input stream definition
#define DATA_WIDTH 24
typedef ap_axiu<DATA_WIDTH,1,1,1> interface_t;
typedef hls::stream<interface_t> stream_t;

// Output pointer definition
#define OUTPUT_PTR_WIDTH 96
typedef ap_uint<OUTPUT_PTR_WIDTH> pointer_t;

typedef ap_uint<DATA_WIDTH> data_t;
typedef ap_uint<32> U32;
typedef ap_uint<96> packed_fp_t;

extern "C" {
    void preprocess3_accel(stream_t& stream_in,
                          pointer_t* img_out,
                          unsigned int rows,
                          unsigned int cols,
                          float r_mean,
                          float g_mean,
                          float b_mean,
                          float r_std,
                          float g_std,
                          float b_std);
}

int main() {
	// IP configuration
	unsigned int rows = 5;
	unsigned int cols = 5;
	float r_mean = 0.0f;
	float g_mean = 0.0f;
	float b_mean = 0.0f;
	float r_std = 1.0f;
	float g_std = 1.0f;
	float b_std = 1.0f;

	// Interfaces for IP
	stream_t stream_in("stream_in");
	pointer_t* img_out = (pointer_t*)malloc((rows * cols + 1) * sizeof(pointer_t));

	data_t packed_data;
	for (int idx = 0; idx < rows*cols; idx++) {

		// Pack data for input stream
		packed_data.range(7, 0) = idx;
		packed_data.range(15, 8) = idx + 1;
		packed_data.range(23, 16) = idx + 2;

		interface_t data;
		data.data = packed_data;
		data.user = idx == 0;
		data.id = 0;
		data.dest = 0;
		data.last = idx == ( rows*cols - 1 );
		data.strb = 1;
		data.keep = 1;

		stream_in.write(data);
	}

	preprocess3_accel(stream_in, img_out, rows, cols, r_mean, g_mean, b_mean, r_std, g_std, b_std);

	for (int idx = 0; idx < rows*cols; idx++) {
		packed_fp_t packed_pixel = img_out[idx];
		U32 pR = packed_pixel.range(31, 0);
		U32 pG = packed_pixel.range(63, 32);
		U32 pB = packed_pixel.range(95, 64);

		float fR = reinterpret_cast<float&>(pR);
		float fG = reinterpret_cast<float&>(pG);
		float fB = reinterpret_cast<float&>(pB);

		printf("%12.8f %12.8f %12.8f\n", fR, fG, fB);
	}

	return 0;
}

Block design

VDMA configuration

Notebook

preprocess_test.ipynb (298.1 KB)

So after quite a bit of headache I found the source of evil.

  1. The VDMA seems to be unhappy with input dimensions that are not multiples of 4. This resulted in repeated reads resulting in different outputs. I don’t know why this is the case and would be happy to gain some insight on this matter.
  2. I had a mistake in my config method where I was writting to the offset of 88 instead of 84 for the b_std parameter. This resulted in all zeros for the blue channel of the output. To avoid such mistakes in the future I think I will just stick with using the register map.

Here is the new “driver” for the IP with some safety measures.

def pack_float(f):
    return unpack('I', pack('f', f))[0]

class Preprocess(DefaultIP):
    bindto = ['xilinx.com:hls:preprocess_accel:1.0',
              'xilinx.com:hls:preprocess2_accel:1.0',
              'xilinx.com:hls:preprocess3_accel:1.0']
    def __init__(self, description):
        super().__init__(description)
        
    def config(self, rows:int, cols:int, means:List[float], stds:List[float], buffer_dtype=np.float32):
        assert rows % 4 == 0, f'Rows for IP must be multiples of 4! Rows specified: {rows}'
        assert cols % 4 == 0, f'Columns for IP must be multiples of 4! Cols specified: {cols}'
        
        r_mean, g_mean, b_mean = means
        r_std, g_std, b_std = stds
        
        # Configure IP for processing
        self.register_map.rows = rows
        self.register_map.cols = cols
        self.register_map.r_mean = pack_float(r_mean)
        self.register_map.g_mean = pack_float(g_mean)
        self.register_map.b_mean = pack_float(b_mean)
        self.register_map.r_std = pack_float(1/r_std)
        self.register_map.g_std = pack_float(1/g_std)
        self.register_map.b_std = pack_float(1/b_std)
       
        # Allocate buffer to retrieve output
        self.output_buffer = allocate(shape=(rows, cols, 4), dtype=buffer_dtype)
        self.register_map.img_out_1 = self.output_buffer.device_address
    
    def results(self):
        self.output_buffer.sync_from_device()
        return self.output_buffer
    
    def start(self):
        self.write(0, 0x81)
        
    def stop(self):
        self.write(0, 0)
        self.output_buffer.freebuffer()

I hope this helps someone in the future.

Mario

1 Like