DPU input zero copy

Hello there,

I have been looking for a way to accelerate the pre-processing of images in hardware before passing them to a ML model deployed on the DPU. I used HLS to implement the pre-processing IP for YOLOv3 which works nicely with the DPU but I need to manually transfer the output from my IP to the DPU.

Is there a way to achieve zero copy between my IP (AXI4 Master interface input/output) and the DPU using PYNQ?

I am using PYNQ 2.7.0 on a custom MPSoC and my overlay integrates a single B4096 DPU core version of the DPU-PYNQ 1.4.0 ZCU-104 configuration. I integrated the core using the DPU-PYNQ flow using a custom platform.

I thought about writing the address of the PynqBuffer used by my IP to the DPU instance, but I don’t have access to the DPU’s register map in PYNQ.

I found the resnet50_zero_copy whole app acceleration (WAA) example that uses C/C++ and vart::RunnerExt. I modified the DpuOverlay to use vart.RunnerExt instead of vart.Runner to get the vart.TensorBuffer used for the DPU input as in the example linked above. However, the vart.TensorBuffer I get is labeled HOST_VIRT which means it is a host only allocation.

Is there a way to get a HOST_PHY vart.TensorBuffer for the DPU using PYNQ?

Although the Python binding of vart.TensorBuffer is fairly limited, I can parse the address from the string representation using regex. My only hiccup is the location of the buffer I am getting.

My theory is that if I can get a vart.TensorBuffer with HOST_PHY location for the DPU input, I could configure my IP to write it’s output there and call it a day.

Does anyone have experience achieving zero copy to a DPU using Python?

Any advice on this would be highly appreciated, even failed attempts to prune my search :slight_smile:

Many thanks,
Mario