Hello Everyone,
I encountered some issues while using VDMA with a custom IP developed using Vitis HLS. The issue I am having is that every time I read the output of the IP, it returns a different, out of order permutation of my input (in pass through operation) without changing the input. I attached a notebook that includes the output of 2 consecutive access to the output buffer showcasing the issue.
I tried to provide every detail I could think off below and I hope someone can give me a pointer to go on.
Any help would be greatly appreciated.
Thank you,
Mario
Setup details
- Windows 10 Pro host machine
- Ubuntu 18.04.6 VM set up using modified Vagrant fiile from PYNQ repo to assign more disk space and RAM
- Xilinx tools version 2020.2
- PYNQ 2.7.0 image
- Custom ZU+ SoC
Task explanation
I am working on a preprocessing IP in Vitis HLS for channel-wise manipulation of RGB images for ML preprocessing. The IP takes an input stream (AXI4 Stream) corresponding to the raw pixels (24-bit) and outputs the processed packed pixels using a AXI4 Master Interface. The output of the IP is the packed pixel with 32-bit floats for each channel value (96-bit output data). I am using AXI4 Stream input as I will have other IPs connected doing basic image manipulations and a AXI4 Master Interface for the output as with my limited HLS knowledge it seemed to be the easier option, but I am more than happy to take suggestions on this front.
The final goal is to turn the design into a Vitis platform for DPU integration and use the IP to accelerate preprocessing for ML inference. At this point I am just looking to verify the correct operation of the IP in a basic hardware design.
Below I tried to collect anything that could help with tracking down the issue. As you can see from the IP name and the attached notebook, I have went through a few iterations, trying to resolve the issue. The IP works fine in simulation, but when I run the attached notebook and read the content of the allocated PYNQ Buffer, it is a mess and changes between reads, even though I am not passing it a new input. Bonus fun is that the entries corresponding to the B colour channel are all zeros which makes no sense to me.
System ILA debug results
I used System ILA to debug the IP and found that the input data is streamed to the IP as expected, but there is B colour channel is missing at the output (all zeros). The output on the WDATA of the Master AXI4 Interface for the most part in sequence, but there are some pixel channel values that randomly repeat.
HLS code for IP
Note: I am using multiplication in the HLS IP with the standard deviations as it is quicker in hardeware. The passed parameter will be the reciprocal of the standard deviations, handled in software.
#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>
// Input stream definition
#define DATA_WIDTH 24
typedef ap_axiu<DATA_WIDTH,1,1,1> interface_t;
typedef hls::stream<interface_t> stream_t;
// Output pointer definition
#define OUTPUT_PTR_WIDTH 96
typedef ap_uint<OUTPUT_PTR_WIDTH> pointer_t;
typedef ap_uint<DATA_WIDTH> data_t;
typedef ap_uint<8> U8;
typedef ap_uint<32> U32;
typedef ap_uint<96> packed_fp_t;
extern "C" {
void preprocess3_accel(stream_t& stream_in,
pointer_t* img_out,
unsigned int rows,
unsigned int cols,
float r_mean,
float g_mean,
float b_mean,
float r_std,
float g_std,
float b_std) {
#pragma HLS INTERFACE axis register both port=stream_in
#pragma HLS INTERFACE m_axi port=img_out offset=slave bundle=gmem1
#pragma HLS INTERFACE s_axilite port=rows
#pragma HLS INTERFACE s_axilite port=cols
#pragma HLS INTERFACE s_axilite port=r_mean
#pragma HLS INTERFACE s_axilite port=g_mean
#pragma HLS INTERFACE s_axilite port=b_mean
#pragma HLS INTERFACE s_axilite port=r_std
#pragma HLS INTERFACE s_axilite port=g_std
#pragma HLS INTERFACE s_axilite port=b_std
#pragma HLS INTERFACE s_axilite port=return
for (unsigned int idx = 0; idx < rows * cols; idx++) {
#pragma HLS PIPELINE
data_t pixel = stream_in.read().data;
// Unpack pixel
U8 pR = pixel.range(7, 0);
U8 pG = pixel.range(15, 8);
U8 pB = pixel.range(23, 16);
// Preprocess channels
float fR = (pR.to_float() - r_mean) * r_std;
float fG = (pG.to_float() - g_mean) * g_std;
float fB = (pB.to_float() - b_mean) * b_std;
// Repack pixel
packed_fp_t packed_pixel;
packed_pixel.range(31, 0) = reinterpret_cast<U32&>(fR);
packed_pixel.range(63, 32) = reinterpret_cast<U32&>(fG);
packed_pixel.range(95, 64) = reinterpret_cast<U32&>(fB);
img_out[idx] = packed_pixel;
}
return;
}
}
Test bench
#include <iostream>
#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>
// Input stream definition
#define DATA_WIDTH 24
typedef ap_axiu<DATA_WIDTH,1,1,1> interface_t;
typedef hls::stream<interface_t> stream_t;
// Output pointer definition
#define OUTPUT_PTR_WIDTH 96
typedef ap_uint<OUTPUT_PTR_WIDTH> pointer_t;
typedef ap_uint<DATA_WIDTH> data_t;
typedef ap_uint<32> U32;
typedef ap_uint<96> packed_fp_t;
extern "C" {
void preprocess3_accel(stream_t& stream_in,
pointer_t* img_out,
unsigned int rows,
unsigned int cols,
float r_mean,
float g_mean,
float b_mean,
float r_std,
float g_std,
float b_std);
}
int main() {
// IP configuration
unsigned int rows = 5;
unsigned int cols = 5;
float r_mean = 0.0f;
float g_mean = 0.0f;
float b_mean = 0.0f;
float r_std = 1.0f;
float g_std = 1.0f;
float b_std = 1.0f;
// Interfaces for IP
stream_t stream_in("stream_in");
pointer_t* img_out = (pointer_t*)malloc((rows * cols + 1) * sizeof(pointer_t));
data_t packed_data;
for (int idx = 0; idx < rows*cols; idx++) {
// Pack data for input stream
packed_data.range(7, 0) = idx;
packed_data.range(15, 8) = idx + 1;
packed_data.range(23, 16) = idx + 2;
interface_t data;
data.data = packed_data;
data.user = idx == 0;
data.id = 0;
data.dest = 0;
data.last = idx == ( rows*cols - 1 );
data.strb = 1;
data.keep = 1;
stream_in.write(data);
}
preprocess3_accel(stream_in, img_out, rows, cols, r_mean, g_mean, b_mean, r_std, g_std, b_std);
for (int idx = 0; idx < rows*cols; idx++) {
packed_fp_t packed_pixel = img_out[idx];
U32 pR = packed_pixel.range(31, 0);
U32 pG = packed_pixel.range(63, 32);
U32 pB = packed_pixel.range(95, 64);
float fR = reinterpret_cast<float&>(pR);
float fG = reinterpret_cast<float&>(pG);
float fB = reinterpret_cast<float&>(pB);
printf("%12.8f %12.8f %12.8f\n", fR, fG, fB);
}
return 0;
}
Block design
VDMA configuration
Notebook
preprocess_test.ipynb (298.1 KB)