Pynq Allocate Speed

** PYNQ Version **
Release 2021_11_18 14a7328
Board 2021_11_18 14a7328
  • Standard Image
  • Board: Pynq-Z1

What I’m trying to do:
I’m trying to get an idea of how fast my M_AXI matrix multiplier IP is by sending in matrices then using “time” library to measure latency.

I noticed that the following commands can take up to 0.103+ seconds:

inbuff = allocate(shape=(mat1_rmat1_c,), dtype=np.uint64)
in2buff = allocate(shape=(mat2_r
mat2_c,), dtype=np.uint64)
outbuff = allocate(shape=(mat1_r*mat2_c,), dtype=np.uint64)

The reason why this is an issue is because my matrix multiplier, for the given example, takes 0.00703 seconds.

Is there a way to overcome this slowdown caused by the “allocate” function?

I’ve attached my code to this post:
Hi Nick,

Yes, so the allocate function is different depending on the type of device that you are running on, either edge or x86. When an allocation occurs a check is made to determine what type of device you are on and then a fetch to get a handle for that device.

If we perform some profiling we can clearly see that getting a handle for the device is where the majority of the time is spent during allocation. On v2.7 this is a separate process that is communicated with via sockets (the pl_server), and in v3.0 a global state file is referenced to get information for the current device.

The actual allocation time is quite small compared to this overhead of all the comms with the pl_server, see highlighted above.

On v3.0, if you are loading the overlay in the same process that you want to perform the allocation then internally we grab a handle for the device directly from the overlay object. This allows you to bypass the communication with the server and get a much faster allocation time.

The above example is on v3.0. As we can see, if the overlay has not been loaded (or loaded in a separate process) then the time taken to allocate a buffer is 0.113s. However, once the overlay has been loaded in the current process, then we can grab the device handle directly from it and get a much faster allocation time of 0.0033s. Unfortunately, this will only work if you are on v3.0.

Hope that help explain things a bit.

Beautiful explanation! Thank you so much!

That clarifies everything and makes total sense. I downgraded my system from v3.0 on my PYNQ-Z1 a couple months back for a certain reason, but I will be upgrading my system back to v3.0 ASAP.

Thank you again!

