PYNQ: PYTHON PRODUCTIVITY

2D Convolution with Line Buffer from HLS Tiny Tutorials

Hi,

I have been studying how to accelerate image processing applications using the FPGA on ZYnq 7020, namely on PYNQ-Z2 board. I’ve successfully implemented functions from Vitis vision libraries using memory mapped interface.

However, I would like to use the AXI Stream and I am not getting how to successfully convert those functions to that interface in order to use the AXI DMA.

Then, I have tested the 2D Convolution function from HLS Tiny Tutorials, which is implemented in streaming mode. After generating the IP core, I’ve moved to Vivado and implemented a design with Zynq processor, AXI DMA and the Conv IP core. However, when I validated the design I’ve noticed that the IP does not have the TLAST side band, necessary to the DMA.

I would like to know what would be the correct way of implementing this stream IP core without the DMA or what should I change in the HLS .cpp file in order to get the TLAST signal. What would be the best practice to incorporate this IP core in my design?

Thank you,

Pedro

1 Like

@pedropinheiro2 can you share the link of the tutorial?

Hi @marioruiz ,

The datatype in that case is int, you will have to modify the code to use ap_axis, something like in the example below

Or include the TLAST signal externally.

Thank you @marioruiz

Then, I will have also to change the function code to set the last to 1 at the end of the function, right?

Yes, you need to make sure tlast is asserted in the last item

I have been trying those changes. I have changed the variable type inside hls::stream from int to ap_axiu<8,0,0,0> and made the changes into the code, as you can see below. The top function is filter11x11_strm.

convolution.cpp file

#include "convolution.h"

template<typename T, int K>
static void convolution_orig(
        int width, int height,
        const T *src, T *dst,
        const T *hcoeff, const T *vcoeff)
{
    // Convolution kernel size
    const int conv_size = K;
    // Half the convolution window - rounded down - i.e. the border width
    const int border_width = int(conv_size / 2);
#ifndef __SYNTHESIS__
    T * const local = new T[MAX_IMG_ROWS*MAX_IMG_COLS];
#else // Static storage allocation for HLS, dynamic otherwise
    T local[MAX_IMG_ROWS*MAX_IMG_COLS];
#endif

    // Clear local frame buffer
    Clear_Local:for(int i = 0; i < height * width; i++){
        local[i]=0;
    }
    // Horizontal convolution pass - makes O(K*K) reads from input image
    // per output pixel
    HconvH:for(int col = 0; col < height; col++){
        HconvW:for(int row = border_width; row < width - border_width; row++){
            Hconv:int pixel = col * width + row;
            for(int i = - border_width; i <= border_width; i++){
                local[pixel] += src[pixel + i] * hcoeff[i + border_width];
            }
        }
    }
    // Clear dst storage
    Clear_Dst:for(int i = 0; i < height * width; i++){
        dst[i]=0;
    }
    // Vertical convolution pass - makes O(K*K) reads from frame buffer -
    // resulting in only interior, i.e.
    // (border_width < col < height - border_width && border_width < row < width - border_width), pixels being valid
    VconvH:for(int col = border_width; col < height - border_width; col++){
        VconvW:for(int row = 0; row < width; row++){
            int pixel = col * width + row;
            Vconv:for(int i = - border_width; i <= border_width; i++){
                int offset = i * width;
                dst[pixel] += local[pixel + offset] * vcoeff[i + border_width];
            }
        }
    }
    // Populate borders by replicating adjacent valid pixels - uses a separate
    // set of loop nest for each vertical border region - top border; left/right
    // of valid vertical range; bottom. This is problematic for performance...
    int border_width_offset = border_width * width;
    int border_height_offset = (height - border_width - 1) * width;
    Top_Border:for(int col = 0; col < border_width; col++){
        int offset = col * width;
        Top_Left:for(int row = 0; row < border_width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_width_offset + border_width];
        }
        Top_Row:for(int row = border_width; row < width - border_width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_width_offset + row];
        }
        Top_Right:for(int row = width - border_width; row < width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_width_offset + width - border_width - 1];
        }
    }
    Side_Border:for(int col = border_width; col < height - border_width; col++){
        int offset = col * width;
        Left_Col:for(int row = 0; row < border_width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[offset + border_width];
        }
        Right_Col:for(int row = width - border_width; row < width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[offset + width - border_width - 1];
        }
    }
    Bottom_Border:for(int col = height - border_width; col < height; col++){
        int offset = col * width;
        Bottom_Left:for(int row = 0; row < border_width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_height_offset + border_width];
        }
        Bottom_Row:for(int row = border_width; row < width - border_width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_height_offset + row];
        }
        Bottom_Right:for(int row = width - border_width; row < width; row++){
            int pixel = offset + row;
            dst[pixel] = dst[border_height_offset + width - border_width - 1];
        }
    }
}

template<typename T, int K>
static void convolution_strm(int width, int height,
        hls::stream<axis> &src, hls::stream<axis> &dst,
        const T *hcoeff, const T *vcoeff)
{
    const int border_width = int(K / 2);
    // Horizontal pixel window (cache)
    T hwin[K];
    hls::stream<T> hconv("hconv");
    // Vertical pixel window (cache)
//    T vwin[K];
    // Line-buffers allowing full pixel reuse in vertical pass
    static T linebuf[K - 1][MAX_IMG_COLS];
    hls::stream<T> vconv("vconv");
    const int vconv_xlim = width - (K - 1);
    // Line-buffer for border pixel replication
    T borderbuf[MAX_IMG_COLS - (K - 1)];
#pragma HLS ARRAY_PARTITION variable=linebuf dim=1 complete
#pragma HLS INLINE // Into a DATAFLOW region
    // These assertions let HLS know the upper bounds of loops
    assert(height < MAX_IMG_ROWS);
    assert(width < MAX_IMG_COLS);
    assert(vconv_xlim < MAX_IMG_COLS - (K - 1));
    // Horizontal convolution - consumes each pixel in source image
    // exactly once, reusing values cached in hwin[], producing a stream
    // of pixels required for the following vertical convolution
    HConvH:for(int col = 0; col < height; col++) {
        HConvW:for(int row = 0; row < width; row++) {
#pragma HLS PIPELINE
        	axis in_val_sidebands = src.read();
            T in_val = in_val_sidebands.data;
            // Reset pixel value on-the-fly - eliminates an O(height*width) loop
            T out_val = 0;
            HConv:for(int i = 0; i < K; i++) {
                hwin[i] = i < K - 1 ? hwin[i + 1] : in_val;
                out_val += hwin[i] * hcoeff[i];
            }
            if (row >= K - 1)
                hconv << out_val;
        }
    }
    // Vertical convolution - consumes stream generated by the horizontal
    // pass; generates a stream of only the pixels in the valid interior
    // region, i.e. (height - (K - 1)) * (width - (K - 1)) values
    VConvH:for(int col = 0; col < height; col++) {
        VConvW:for(int row = 0; row < vconv_xlim; row++) {
#pragma HLS DEPENDENCE variable=linebuf inter false
#pragma HLS PIPELINE
            T in_val = hconv.read();
            // Reset pixel value on-the-fly - eliminates an O(height*width) loop
            T out_val = 0;
            VConv:for(int i = 0; i < K; i++) {
                T vwin_val = i < K - 1 ? linebuf[i][row] : in_val;
                out_val += vwin_val * vcoeff[i];
                if (i > 0)
                    linebuf[i - 1][row] = vwin_val;
            }
            if (col >= K - 1)
                vconv << out_val;
        }
    }
    //Handle border by replicating the exact same pixels as orig, but in
    // a single loop taking the minimum (height*width) number of cycles
    Border:for (int i = 0; i < height; i++) {
        for (int j = 0; j < width; j++) {
            T pix_in, l_edge_pix, r_edge_pix, pix_out;
#pragma HLS PIPELINE
            if (i == 0 || (i > border_width && i < height - border_width)) {
                // read a pixel out of the input stream and cache it for
                // immediate use and later replication purposes
                if (j < width - (K - 1)) {
                    pix_in = vconv.read();
                    borderbuf[j] = pix_in;
                }
                if (j == 0) {
                    l_edge_pix = pix_in;
                }
                if (j == width - K) {
                    r_edge_pix = pix_in;
                }
            }
            // Select output value from the appropriate cache resource
            if (j <= border_width) {
                pix_out = l_edge_pix;
            } else if (j >= width - border_width - 1) {
                pix_out = r_edge_pix;
            } else {
                pix_out = borderbuf[j - border_width];
            }
            axis pix_out_sidebands;
            pix_out_sidebands.data = pix_out;
            pix_out_sidebands.last = in_val_sidebands.last;
            dst << pix_out_sidebands;
        }
    }
}

void filter11x11_orig(int width, int height, const data_t *src, data_t *dst)
{
#pragma HLS INTERFACE m_axi port=src depth=32400 // TEST_IMG_SIZE
#pragma HLS INTERFACE m_axi port=dst depth=32400 // TEST_IMG_SIZE
#pragma HLS INTERFACE s_axilite port=width  bundle=hls_ctrl 
#pragma HLS INTERFACE s_axilite port=height bundle=hls_ctrl 
#pragma HLS INTERFACE s_axilite port=return bundle=hls_ctrl 

#pragma HLS INLINE
#pragma HLS DATAFLOW

  const data_t filt11_coeff[11] = {
    36, 111, 266, 498, 724, 821, 724, 498, 266, 111, 36
  };

  convolution_orig<data_t, 11>(width, height,
			       src, dst,
			       filt11_coeff, filt11_coeff);
}

void filter11x11_strm(int width, int height,
		      hls::stream<axis> &src, hls::stream<axis> &dst)
{
#pragma HLS INTERFACE axis port=src
#pragma HLS INTERFACE axis port=dst
#pragma HLS INTERFACE s_axilite port=width  bundle=hls_ctrl
#pragma HLS INTERFACE s_axilite port=height bundle=hls_ctrl
#pragma HLS INTERFACE s_axilite port=return bundle=hls_ctrl

#pragma HLS DATAFLOW
#pragma HLS INLINE // bring loops in sub-functions to this DATAFLOW region

  const data_t filt11_coeff[11] = {
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1
  };

  convolution_strm<data_t, 11>(width, height,
			       src, dst,
			       filt11_coeff, filt11_coeff);
}

convolution.h file

#ifndef CONVOLUTION_H_
#define CONVOLUTION_H_

#include <assert.h>
#include <stdint.h>
#include <hls_stream.h>

#include "ap_axi_sdata.h"
#include "ap_int.h"

#define MAX_IMG_ROWS 2496
#define MAX_IMG_COLS 3360

#define TEST_IMG_ROWS 135
#define TEST_IMG_COLS 240
#define TEST_IMG_SIZE (TEST_IMG_ROWS * TEST_IMG_COLS)

typedef uint8_t data_t;
typedef ap_axiu<8,0,0,0> axis;

// External function prototypes
void filter11x11_orig(
        int w, int h,
        const data_t *src_image, data_t *dst_image);

void filter11x11_strm(
        int w, int h,
        hls::stream<axis> &src_image, hls::stream<axis> &dst_image);

#endif // CONVOLUTION_H_ not defined

I can synthesize the IP and generate the design with the DMA at Vivado. The design validates successfully, however the DMA gets stuck at the sending channel and I cannot see why

Did you simulate this? Is the tlast being asserted in the last item?

You assign last, but in_val_sidebands is not defined in the Border loop context, so this is always going to be 0.

You may better do something like:

pix_out_sidebands.last = ((i==height-1) && (j==width)) ? 1 : 0

The snipped above is just a suggestion. You need to stimulate your IP and verify its functionality. If it does not work in simulation, it won’t work in hardware either.

Mario

Thank you! I get what you say about in_val_sidebands, but at the last iteration of the src.read(), the tlast in in_val_sidebands shouldn’t become 1?

I have tried your suggestion, which makes sense, but unfortunately the problem could not be resolved. I believe the problem is in another place, since the error on the DMA appears at the .wait() from the sendchannel and not on the recvchannel.

When you say simulation, do your refer to Vivado simulation? I have done C simulation and Co simulation in Vitis HLS and all tests passed correctly. But I have not done simulations in Vivado design. Could you suggest me some resources/material and/or some procedure for doing simulation in Vivado?

Thank you so much for the support.
Pedro

Yes, but in_val_sidebands only exists in the context of HConvW loop. It is undefined in the context of Border loop

I am referring to HLS simulation. Did you verify that last is generated?

Are you setting width and height in the notebook?

Yes, tlast is being generated.


HLS simulation and Co simulation are passed.

I am setting width and height this way,

I’ve noticed that in your suggestion, maybe it should be j==width-1, right?

Btw, I have tested that way and the same error occurs. The DMA remains stuck at the sendchannel

Yes, it should be

pix_out_sidebands.last = ((i==height-1) && (j==width-1)) ? 1 : 0

What I was referring earlier is to verify in simulation that the tlast signal is being asserted when the last item is produced.

Yes, with that modification the tlast signal becomes 1 in the last item produced by the IP.

Since the error is in sendchannel, could the problem be in tlast from the input stream? The DMA sets it to 1 after sending the whole image?

I could verify in simulation that the tlast in the output stream is being correctly set to 1 at the last iteration of the IP. How could I verify it for the input stream?

The next step in the debug process is to insert an ILA and verify what is happening in the output.

But, before doing that. Can you share the DMA configuration?

The DMA configuration is

The Vivado block design is

And the PS HP0 port is configured in 64 bits

In the DMA, the address width should be 32-bit for Zynq 7000 devices (Pynq-Z2), this may be causing problems.

Hey Pedro, can i ask you why you decided to change to 8 bit?

Hi @Wardo82,

I have changed to 8 bit because my images are 8 bit grayscale

Hey @marioruiz , Do you know where can one find the script to generate the stimulus the video uses? Or even better, some resource on how to generate them and use the system_ila_0 ourselves? Is it of any use from Pynq?
I have followed Pedro’s and your indications, built the bitstream with debug flags as in this video, but i don’t know how to proceed. I jumped into this discussion because i’m constantly having the same issue with cv::Mat object type and DMA send/recv Pynq code.