Pynq Allocate Speed

** PYNQ Version **
Release 2021_11_18 14a7328
Board 2021_11_18 14a7328
GitHub - Xilinx/PYNQ: Python Productivity for ZYNQ

  • Standard Image
  • Board: Pynq-Z1

What I’m trying to do:
I’m trying to get an idea of how fast my M_AXI matrix multiplier IP is by sending in matrices then using “time” library to measure latency.

I noticed that the following commands can take up to 0.103+ seconds:

inbuff = allocate(shape=(mat1_rmat1_c,), dtype=np.uint64)
in2buff = allocate(shape=(mat2_r
mat2_c,), dtype=np.uint64)
outbuff = allocate(shape=(mat1_r*mat2_c,), dtype=np.uint64)

The reason why this is an issue is because my matrix multiplier, for the given example, takes 0.00703 seconds.

Is there a way to overcome this slowdown caused by the “allocate” function?

I’ve attached my code to this post:
fpga_mmult_function.ipynb (3.9 KB)

Thank you!

Hi Nick,

Yes, so the allocate function is different depending on the type of device that you are running on, either edge or x86. When an allocation occurs a check is made to determine what type of device you are on and then a fetch to get a handle for that device.

If we perform some profiling we can clearly see that getting a handle for the device is where the majority of the time is spent during allocation. On v2.7 this is a separate process that is communicated with via sockets (the pl_server), and in v3.0 a global state file is referenced to get information for the current device.

The actual allocation time is quite small compared to this overhead of all the comms with the pl_server, see highlighted above.

On v3.0, if you are loading the overlay in the same process that you want to perform the allocation then internally we grab a handle for the device directly from the overlay object. This allows you to bypass the communication with the server and get a much faster allocation time.

The above example is on v3.0. As we can see, if the overlay has not been loaded (or loaded in a separate process) then the time taken to allocate a buffer is 0.113s. However, once the overlay has been loaded in the current process, then we can grab the device handle directly from it and get a much faster allocation time of 0.0033s. Unfortunately, this will only work if you are on v3.0.

Hope that help explain things a bit.

All the best,


Beautiful explanation! Thank you so much!

That clarifies everything and makes total sense. I downgraded my system from v3.0 on my PYNQ-Z1 a couple months back for a certain reason, but I will be upgrading my system back to v3.0 ASAP.

Thank you again!

1 Like