AXI DMA Channel not idle (PYNQ Z1)

Hi, I’m a bit new to PYNQ and I’m having some troubles running DMA on Jupyter Notebook.

The current issue I am having is that my “DMA receive channel” is not idle, and I can’t seem to find a way to make it idle again. I have gone through some online PYNQ and video tutorials, and also the documentation, but am confused as to how to do a reset. I’ve tried restarting the kernel and also the board itself. I am also aware that DMA channel not idle probably means that it is waiting for more bits of data, but my current goal is just to try and get it out of non-idle state.

Speculations
The entire design was created by: packaging HLS design which had a cvtcolor_rgb2gray function, added to Vivado and connected to AXI DMA and ZYNQ IPs through block connection automation. I noticed that the HLS function used bundle=gmem for the m_axi ports, which means it is designed to access DDR global memory (iirc). That means in my Vivado block design, S_AXIS_S2MM and M_AXIS_MM2S` of the AXI DMA are not connected to anything. I hypothesize that this might be the reason causing the idle issue in the first place, and I am working on a new HLS function. If you know how I can avoid re-designing the HLS function and use the current gmem method for memory transfer, please advise.

Debugging
Here are also some information and speculations to help with debugging:

Register map

RegisterMap {
  MM2S_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  MM2S_DMASR = Register(Halted=0, Idle=1, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=1, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),
  MM2S_CURDESC = Register(Current_Descriptor_Pointer=0),
  MM2S_CURDESC_MSB = Register(Current_Descriptor_Pointer=0),
  MM2S_TAILDESC = Register(Tail_Descriptor_Pointer=0),
  MM2S_TAILDESC_MSB = Register(Tail_Descriptor_Pointer=0),
  MM2S_SA = Register(Source_Address=377786368),
  MM2S_SA_MSB = Register(Source_Address=0),
  MM2S_LENGTH = Register(Length=4000),
  SG_CTL = Register(SG_CACHE=0, SG_USER=0),
  S2MM_DMACR = Register(RS=1, Reset=0, Keyhole=0, Cyclic_BD_Enable=0, IOC_IrqEn=0, Dly_IrqEn=0, Err_IrqEn=0, IRQThreshold=1, IRQDelay=0),
  S2MM_DMASR = Register(Halted=0, Idle=0, SGIncld=0, DMAIntErr=0, DMASlvErr=0, DMADecErr=0, SGIntErr=0, SGSlvErr=0, SGDecErr=0, IOC_Irq=0, Dly_Irq=0, Err_Irq=0, IRQThresholdSts=0, IRQDelaySts=0),
  S2MM_CURDESC = Register(Current_Descriptor_Pointer=0),
  S2MM_CURDESC_MSB = Register(Current_Descriptor_Pointer=0),
  S2MM_TAILDESC = Register(Tail_Descriptor_Pointer=0),
  S2MM_TAILDESC_MSB = Register(Tail_Descriptor_Pointer=0),
  S2MM_DA = Register(Destination_Address=377790464),
  S2MM_DA_MSB = Register(Destination_Address=0),
  S2MM_LENGTH = Register(Length=4000)
}

Snippet of code I was running on Jupyter Notebook:

### Test on jpeg image

image_path = "/home/xilinx/jupyter_notebooks/kevin/ip24/src/lol.png"
og_image = Image.open(image_path)
og_image.load()

input_array = np.array(og_image)

# Ensure input_array is of type np.uint8 for image display (for completeness)
if input_array.dtype != np.uint8:
    input_array = input_array.astype(np.uint8)

display(Image.fromarray(input_array))

# Allocate buffers with the same dtype as the original image for compatibility
in_buffer = allocate(shape=input_array.shape, dtype=np.uint8)
out_buffer = allocate(shape=input_array.shape, dtype=np.uint8)

# Copy data into input buffer
np.copyto(in_buffer, input_array)
in_buffer.flush()

# Convert buffer back to an image for display (for buffer-check purposes)
buf_image = Image.fromarray(in_buffer)
display(buf_image)

# DMA transfer to PL for processing
dma.sendchannel.transfer(in_buffer)
dma.recvchannel.transfer(out_buffer)

# cvt.write(0x10, frame.shape[0])
# cvt.write(0x18, frame.shape[1])
# cvt.write(0x00, 1)

out_buffer.invalidate()

dma.sendchannel.wait()
dma.recvchannel.wait()

# Free the buffers
del in_buffer, out_buffer

Note: Previously before I ran the cvt.write lines, the idle issue did not occur. It was only after I added the cvt.write lines that the idle issue occured, but now even if I comment these lines, I can’t get out of non-idle state.

HLS function

void cvtcolor_rgb2gray(ap_uint<INPUT_PTR_WIDTH>* img_rgb, ap_uint<OUTPUT_PTR_WIDTH>* img_gray, int rows, int cols) {
// clang-format off
    #pragma HLS INTERFACE m_axi     port=img_rgb  	offset=slave bundle=gmem1
    #pragma HLS INTERFACE m_axi     port=img_gray  	offset=slave bundle=gmem2
    #pragma HLS INTERFACE s_axilite port=rows              	 
    #pragma HLS INTERFACE s_axilite port=cols              	 
    #pragma HLS INTERFACE s_axilite port=return
    // clang-format on

    xf::cv::Mat<XF_8UC3, HEIGHT, WIDTH, NPC1, XF_CV_DEPTH_IN_0> imgInput0;

    imgInput0.rows = rows;
    imgInput0.cols = cols;
    xf::cv::Mat<XF_8UC1, HEIGHT, WIDTH, NPC1, XF_CV_DEPTH_OUT_0> imgOutput0;

    imgOutput0.rows = rows;
    imgOutput0.cols = cols;

// clang-format off
    #pragma HLS DATAFLOW
    // clang-format on
    xf::cv::Array2xfMat<INPUT_PTR_WIDTH, XF_8UC3, HEIGHT, WIDTH, NPC1, XF_CV_DEPTH_IN_0>(img_rgb, imgInput0);
    xf::cv::rgb2gray<XF_8UC3, XF_8UC1, HEIGHT, WIDTH, NPC1, XF_CV_DEPTH_IN_0, XF_CV_DEPTH_OUT_0>(imgInput0, imgOutput0);
    xf::cv::xfMat2Array<OUTPUT_PTR_WIDTH, XF_8UC1, HEIGHT, WIDTH, NPC1, XF_CV_DEPTH_OUT_0>(imgOutput0, img_gray);
}

Channel register debugging
https://docs.xilinx.com/r/en-US/pg021_axi_dma/S2MM_DMASR-S2MM-DMA-Status-Register-Offset-34h
I found this documentation in an attempt to reset the “idle” status of the DMA channel, but I am unsure of how to do so, even after reading the documentation.

Finally, if any more information, such as the block diagrams in Vivado are needed, please ask, I am happy to provide them. Thank you!

Hi @booth,

Welcome to the PYNQ community.

This issue is very common. It is likely that your custom HLS IP is not generating the TLAST.

In your Python code, the output buffer should be 3 times smaller the the original one. Also, you are not specifying the rows and cols to the IP.

Mario

Hi, thank you for your reply. I’ve taken your advice from this thread and other questions and am trying to fix the issues - it would be great if you could provide some advice on the following work I’ve done:

To try and fix the idling and TLAST issue, I created a new HLS design in the Vitis Unified IDE, based on the updated Vitis Vision Library format:

accel2.cpp

#include "accel2.hpp"

// void func(hls::stream, hls::stream, rows, cols)
// xres <-> cols (image width)
// yres <-> rows (image height)
// e.g. 1920x1080 <-> xres * yres

void cvtcolor(axi_stream& img_in, axi_stream& img_out, int rows, int cols) {
    #pragma HLS INTERFACE axis port=img_in
    #pragma HLS INTERFACE axis port=img_out
    #pragma HLS INTERFACE s_axilite port=rows
    #pragma HLS INTERFACE s_axilite port=cols
    #pragma HLS INTERFACE s_axilite port=return

    gray_image img_mat_gray(rows, cols);
    RGB_image img_mat_rgb(rows, cols);

    #pragma HLS DATAFLOW

    xf::cv::AXIvideo2xfMat(img_in, img_mat_rgb);
    
    xf::cv::rgb2gray<XF_8UC3, XF_8UC1, MAX_HEIGHT, MAX_WIDTH, XF_NPPC1, XF_CV_DEPTH, XF_CV_DEPTH>(img_mat_rgb, img_mat_gray);

    xf::cv::xfMat2AXIvideo(img_mat_gray, img_out);
}

accel2.hpp

#include "ap_fixed.h"
#include "stdint.h"

#include "hls_stream.h"
#include "ap_axi_sdata.h"
#include "ap_int.h"
#include "common/xf_common.hpp"
#include "common/xf_utility.hpp"
#include "common/xf_infra.hpp"
#include "imgproc/xf_cvt_color.hpp"
#include "imgproc/xf_cvt_color_1.hpp"
//#include "imgproc/xf_rgb2hsv.hpp"
#include "imgproc/xf_bgr2hsv.hpp"

#define MAX_WIDTH 1920
#define MAX_HEIGHT 1080

#define XF_CV_DEPTH 2

typedef ap_axiu<32,1,1,1> axi_pixel;
typedef hls::stream<axi_pixel> axi_stream;

// Update Mat definitions to use xf::cv namespace and appropriate types
typedef xf::cv::Mat<XF_8UC3, MAX_HEIGHT, MAX_WIDTH, XF_NPPC1, XF_CV_DEPTH> RGB_image;
typedef xf::cv::Mat<XF_8UC1, MAX_HEIGHT, MAX_WIDTH, XF_NPPC1, XF_CV_DEPTH> gray_image;

void cvtcolor(axi_stream& img_in, axi_stream& img_out, int rows, int cols);

And this is my block design:

According to this answer, TLAST should be generated correctly. However I cannot find the signal in the synthesis report - it may be that it was generated, but it is not shown in the new synthesis report format.

Is there any other way to verify that TLAST was generated? Is it necessary for TLAST to be generated to avoid the deadlock waiting state of the DMA?

Also, assuming that I would want the function to be used to analyse a live video stream on the Jupyter Notebook, would it be better to use the VDMA instead of the DMA?

Thank you!

Hi @booth,

I would suggest you check how images are packaged over AXI4-Stream.
DMA is not the best alternative to move images, I suggest you check the VDMA IP.

Mario