Hello there,
I have been looking for a way to accelerate the pre-processing of images in hardware before passing them to a ML model deployed on the DPU. I used HLS to implement the pre-processing IP for YOLOv3 which works nicely with the DPU but I need to manually transfer the output from my IP to the DPU.
Is there a way to achieve zero copy between my IP (AXI4 Master interface input/output) and the DPU using PYNQ?
I am using PYNQ 2.7.0 on a custom MPSoC and my overlay integrates a single B4096 DPU core version of the DPU-PYNQ 1.4.0 ZCU-104 configuration. I integrated the core using the DPU-PYNQ flow using a custom platform.
I thought about writing the address of the PynqBuffer
used by my IP to the DPU instance, but I don’t have access to the DPU’s register map in PYNQ.
I found the resnet50_zero_copy whole app acceleration (WAA) example that uses C/C++ and vart::RunnerExt
. I modified the DpuOverlay
to use vart.RunnerExt
instead of vart.Runner
to get the vart.TensorBuffer
used for the DPU input as in the example linked above. However, the vart.TensorBuffer
I get is labeled HOST_VIRT
which means it is a host only allocation.
Is there a way to get a HOST_PHY
vart.TensorBuffer
for the DPU using PYNQ?
Although the Python binding of vart.TensorBuffer
is fairly limited, I can parse the address from the string representation using regex. My only hiccup is the location of the buffer I am getting.
My theory is that if I can get a vart.TensorBuffer
with HOST_PHY
location for the DPU input, I could configure my IP to write it’s output there and call it a day.
Does anyone have experience achieving zero copy to a DPU using Python?
Any advice on this would be highly appreciated, even failed attempts to prune my search
Many thanks,
Mario