Composable pipeline need to wait before readframe() from VDMA

MarioMihaly · November 29, 2024, 4:43pm

Setup: Custom ZU+ MPSoC, AMD Xilinx tools 2020.2, PYNQ 2.7, Composable 1.0, DPU 1.4.0

Hi All,

I created a custom composable pipeline to process images before sending them to the DPU. I noticed that before I read the output of the pipeline from the VDMA, I need to wait a couple of milliseconds otherwise my models’ performance drops significantly. This wait time needs to increase for larger image resolutions.

My IPs (generated using Vitis HLS) are configured and started in auto_restart mode along with the VDMA in/out. I tried reading a frame before setting the input to avoid stale frames as suggested by the documentation which made no difference.

Is there a way to ensure the output read from the VDMA will correspond to the output of the last IP in the pipeline after it finished processing?

I had success with custom IPs with AXI Master interfaces running in single execution mode and using interrupts so I am fairly certain I have the basic interrupt configurations working. I couldn’t find a lot on interrupts from streaming IPs in auto restart mode though, so if anyone has notes on it I am all ears (I would wait for an interrupt from the last IP before reading the frame). My understanding so far is that in auto restart mode ap_done in the control register is only briefly set which makes polling an unreliable alternative.

This is the last piece of my puzzle to combine the composable methodology with a DPU and any help would be highly appreciated. The design will be open sourced once it is stable.

Thank you,

Mario

marioruiz · December 3, 2024, 6:18pm

Hi @MarioMihaly,

Without seeing the block diagram is hard to say. But, the interrupt should be driven by the VDMA.
The read function should be blocking and it should return the frame once the interrupt is asserted.

Mario

MarioMihaly · December 4, 2024, 10:42am

Hi @marioruiz,

Thank you for your response, sorry for the very lengthy detail storm here . I tried stating my understanding of the necessary configurations, I am here to learn, thank you for your help.

Below is my block design for the composable pipeline. The HLS IPs use the xf::cv::AXIvideo2xfMat and xf::cv::xfMat2AXIvideo functions to convert between the AXI4-Stream interface and the xf::cv::Mat instances used for processing.

I compared my VDMA configuration to the one in the video processing pipeline for Pynq-ZU and it seems identical except the frame buffers (mine is 3 compared to 4) and some channel configs, but the advanced tab is the same.

For now I am working frame by frame, so the video streaming capabilities of the VDMA may be too much for my usage. Looking at the AXI4-Stream Video Protocol, tuser indicates the Start Of Frame (SOF) which is used for fsync by the write channel. Doesn’t this mean that the s2mm_introut is fired at the SOF?

The bus of 6 interrupts from the pipeline is connected to the AXI Interrupt Controller with the following configuration. The 7th interrupt is from the DPU instance. The irq pin is connected to the pl_ps_irq0[0:0] pin of the US+ MPSoC instance. This is all following the PYNQ 2.7.0 documentation on interrupts. While I manually configure the interrupt controller for Edge as per the documentation, Pynq-ZU seems to omit this manual configuration. Could this be the source of evil for me?

I connected an ILA to the interrupt bus and I could see interrupts firing. When evaluating the system, I ran it for 100+ images and only got ~20 interrupts ($ cat /proc/interrupts | grap fabric) even though I would have expected 2 from the VDMA and 1 from the DPU for each image in the system.

I included the entire block design with the pipeline hierarchy as a PDF for clarity.
mlvap.pdf (219.6 KB)

I use the following sequence to feed the VDMA and retrieve the output from the composable pipeline.

def run(self, input_data, wait_time:float=0.01):
    if not self._running:
        assert False, "start must be called before calling run!"
    
    in_frame = self._vdma_in.newframe()
    in_frame[...] = input_data
    self._vdma_in.writeframe(in_frame)
    
    # TODO: replace with interrupt from last IP in the pipeline ?
    sleep(wait_time)
    return self._vdma_out.readframe()

Now I am working on the version that uses the xf::cv::axiStrm2xfMat and xf::cv::xfMat2axiStrm instead of the video versions and use a regular DMA instead of the VDMA, but I would be very happy to get it to work with the VDMA if possible.

Thank you very much for looking into this and helping me sort through this
Mario

marioruiz · December 4, 2024, 7:15pm

Have you added an ILA in the datapath to see if the all signals are correct?

Have you tried a bypass pipeline, meaning no HLS IP on the path.
Also, make sure that all the IP are correctly configured as the size of the image can changed based on the IP you’ve defined.

MarioMihaly · December 5, 2024, 10:44am

Thank you for the suggestions, @marioruiz, I believe I already covered them and it is likely something a bit more abstract. I added an ILA to the AXI4-Stream from the switch to the VDMA and all the AXI4-Stream Video protocol signals seem about right. The simple [ps_in, ps_out] configuration also fails on the first iteration without time.sleep(0.01) but passes all iterations with time.sleep(0.01).

I think my issue is not related to the configuration of the HLS IPs and the pipeline, each HLS IP was developed with self-checking test benches and I have a test suit using PYNQ as well. Here is an example test setup that fails on the first frame if I remove time.sleep(0.01). The bypass test setup is the same as this one without the colour conversion.

import os
import gc
import cv2
import time
import pytest
import numpy as np
import pynq_composable
from pynq import Overlay, DefaultIP
from pynq.lib.video import VideoMode
from mlvap.data import VOCLoader
from mlvap.utils import compare_results

class BGR2RGB(DefaultIP):
    bindto = ['xilinx.com:hls:bgr2rgb_accel:1.0']
    
    START = 0x81
    STOP  = 0x00
    
    def __init__(self, description):
        super().__init__(description)
        
    def config(self, rows:int, cols:int):
        # Configure IP for processing
        self.register_map.rows = rows
        self.register_map.cols = cols
        
    def start(self):
        self.register_map.CTRL = self.START
        
    def stop(self):
        self.register_map.CTRL = self.STOP

ROOT = os.path.dirname(os.path.realpath(__file__))
OVERLAY_PATH = os.path.join(ROOT, '../../overlays/mlvap_test/mlvap_test.bit')

ITERATIONS = 100

BGR2RGB_DIMENSIONS = [
    (160, 320, ITERATIONS),
    (320, 320, ITERATIONS),
    (416, 416, ITERATIONS),
    (480, 640, ITERATIONS)
]

@pytest.mark.parametrize('rows, cols, N', BGR2RGB_DIMENSIONS)
def test_mlvap_bgr2rgb(rows, cols, N):
    ol = Overlay(OVERLAY_PATH)
    
    pipeline = ol.pipeline
    vdma = pipeline.vdma
    bgr2rgb = pipeline.bgr2rgb
    
    # Configure IP and pipeline
    bgr2rgb.config(rows, cols)
    pipeline.compose([pipeline.ps_in, bgr2rgb, pipeline.ps_out])
    
    # Configure VDMA
    vdma_in = vdma.writechannel
    vdma_in.mode = VideoMode(cols, rows, 24)
    vdma_out = vdma.readchannel
    vdma_out.mode = VideoMode(cols, rows, 24)
    
    # Start IP and VDMA
    vdma_in.start()
    vdma_out.start()
    bgr2rgb.start()
    
    for i in range(N):
        # Generate random input
        input_data = (np.random.rand(rows, cols, 3) * 255).astype(np.uint8)
        
        # Process test input using the hardware
        in_frame = vdma_in.newframe()
        in_frame[...] = input_data
        vdma_in.writeframe(in_frame)
        
        # Wait for result -> it is a bit sensitive and need to wait.
        # If the test fails, try increasing the wait time.
        time.sleep(0.01)
        
        out_data = vdma_out.readframe()

        # Generate reference output
        reference_output = cv2.cvtColor(input_data, cv2.COLOR_BGR2RGB)
        err_per = compare_results(out_data, reference_output, 0.005)

        if err_per == 0.0:
            out_data.freebuffer()
            continue
            
        cv2.imwrite(os.path.join(ROOT, 'sw.jpg'), reference_output)
        cv2.imwrite(os.path.join(ROOT, 'hw.jpg'), out_data)

       out_data.freebuffer()
            
        # Clean up in case of error
        vdma_out.stop()
        vdma_in.stop()
        bgr2rgb.stop()
        del ol
        gc.collect()
        assert False, f'Failed at {i+1}/{N}'
    
    vdma_out.stop()
    vdma_in.stop()
    bgr2rgb.stop()
    del ol
    gc.collect()

Everything works when it comes to the behaviour of the pipeline, I just need to spend some time waiting before reading the output from the VDMA which is not very elegant. I suspect it has something to do with interrupts, but I am out of ideas what to check regarding that.

Bonus question: My current issue is likely unrelated to my older question on the forum Clarification for using VDMA. If you have time, could you please take a look, I am still not clear on some of the issues I raised there.

Thank you,
Mario

marioruiz · December 5, 2024, 11:03pm

Hi @MarioMihaly,

Now I have the full picture, I think I recognize the issue with time.sleep(0.01), I think this may be a limitation of the current version. Unfortunately, the sleep may be the only solution for now.

Mario

MarioMihaly · December 6, 2024, 9:38am

Hi @marioruiz,

Thank you for your help chasing this down, at confirming it as a known limitation. I got a version working with a DMA instead of the VDMA which solved my time.sleep(0.01) issue.

Thank you again,
Mario

Topic		Replies	Views
DMA Transfer Error! Support	11	1707	June 28, 2021
Pynq stop at the vdma writing Support	8	1155	October 29, 2019
DMA stuck when using custom HLS ip Support	5	1331	December 1, 2023
Interrupt from HLS IP core Support	0	508	July 28, 2021
Code hangs on --> await vdma.s2mm_introut.wait() Support	0	493	February 5, 2020

Composable pipeline need to wait before readframe() from VDMA

Related topics