I am trying to use the DPU-PYNQ flow to connect the instruction fetch port (M_AXI_GP0) of a single DPU core to the S_AXI_LPD port on a ZYNQ US+ as suggested in this documentation. I am getting the following error:

ERROR: [CF2SW 83-2178] Memory for component zynq_ultra_ps_e, interface S_AXI_LPD cannot have address segment information automatically inferred. Please annotate the interface with memory segment information.

I have a working design where the instructing fetch port is connected to the S_AXI_HPC1_FPD port, however, I have significant data movement on the S_AXI_HPC0_FPD port which results in ~25% performance drop for the DPU and nominal performance drop for the subsystem connected to S_AXI_HPC0_FPD. I am trying to resolve this by only using the S_AXI_HPC0_FPD port in the design and using the S_AXI_LPD port for instruction fetching

I looked up the error but haven’t found anything that could help me with this issue so any help would be highly appreciated.

Set up

  • Custom ZU+ SoC
  • PYNQ 2.7.0 image
  • DPU-PYNQ 1.4.0
  • Xilinx tools 2020.2
  • Ubuntu 18.04.6 set up using Vagrant file

Full build log

ERROR: [CF2SW 83-2178] Memory for component zynq_ultra_ps_e, interface S_AXI_LPD cannot have address segment information automatically inferred.  Please annotate the interface with memory segment information
INFO: [v++ 60-1442] [14:48:39] Run run_link: Step cf2sw: Failed
Time (s): cpu = 00:00:01 ; elapsed = 00:00:01 . Memory (MB): peak = 1586.246 ; gain = 0.000 ; free physical = 4545 ; free virtual = 9147
ERROR: [v++ 60-661] v++ link run 'run_link' failed
ERROR: [v++ 60-626] Kernel link failed to complete
ERROR: [v++ 60-703] Failed to finish linking
I managed to resolve the issue by specifying the “memory” field when enabling the the S_AXI_LPD port for the platform as <PS name> LPD_DDR_LOW. I didn’t find any documentation about this, I stumbled upon this while reading through the source code of this TRD. While this is the recommended configuraiton, I didn’t manage to resolve my bottleneck.

Part of the subsystem is responsible for data preprocessing for the DPU. Is there a way to directly connect a preprocessing system to the DPU? I feel like that should help with the bottleneck as it would remove avoid using the PS to move data between sections of the DDR (output of the preprocessing subsystem and input to the DPU).

