Pynq Allocate Speed

NNUT · November 22, 2022, 11:34am

** PYNQ Version **
Release 2021_11_18 14a7328
Board 2021_11_18 14a7328
GitHub - Xilinx/PYNQ: Python Productivity for ZYNQ

Standard Image
Board: Pynq-Z1

What I’m trying to do:
I’m trying to get an idea of how fast my M_AXI matrix multiplier IP is by sending in matrices then using “time” library to measure latency.

Issue:
I noticed that the following commands can take up to 0.103+ seconds:

inbuff = allocate(shape=(mat1_rmat1_c,), dtype=np.uint64)
in2buff = allocate(shape=(mat2_rmat2_c,), dtype=np.uint64)
outbuff = allocate(shape=(mat1_r*mat2_c,), dtype=np.uint64)

The reason why this is an issue is because my matrix multiplier, for the given example, takes 0.00703 seconds.

Question:
Is there a way to overcome this slowdown caused by the “allocate” function?

I’ve attached my code to this post:
fpga_mmult_function.ipynb (3.9 KB)

Thank you!
Nick

stf · November 23, 2022, 12:06pm

Hi Nick,

Yes, so the allocate function is different depending on the type of device that you are running on, either edge or x86. When an allocation occurs a check is made to determine what type of device you are on and then a fetch to get a handle for that device.

If we perform some profiling we can clearly see that getting a handle for the device is where the majority of the time is spent during allocation. On v2.7 this is a separate process that is communicated with via sockets (the pl_server), and in v3.0 a global state file is referenced to get information for the current device.

The actual allocation time is quite small compared to this overhead of all the comms with the pl_server, see highlighted above.

On v3.0, if you are loading the overlay in the same process that you want to perform the allocation then internally we grab a handle for the device directly from the overlay object. This allows you to bypass the communication with the server and get a much faster allocation time.

The above example is on v3.0. As we can see, if the overlay has not been loaded (or loaded in a separate process) then the time taken to allocate a buffer is 0.113s. However, once the overlay has been loaded in the current process, then we can grab the device handle directly from it and get a much faster allocation time of 0.0033s. Unfortunately, this will only work if you are on v3.0.

Hope that help explain things a bit.

All the best,
Shane

NNUT · November 26, 2022, 9:41am

Beautiful explanation! Thank you so much!

That clarifies everything and makes total sense. I downgraded my system from v3.0 on my PYNQ-Z1 a couple months back for a certain reason, but I will be upgrading my system back to v3.0 ASAP.

Thank you again!

Topic		Replies	Views
Allocation problem on run time (pynq 2.7) Support	20	1112	July 17, 2023
Allocation function error in PYNQ v2.7 Support	10	1673	January 7, 2022
PYNQ-Helloworld issue with Allocate function Call Support	4	36	July 19, 2025
Problem with pynq.allocate() Support	8	1481	June 22, 2023
PYNQ Allocate() Max. Size? Support	6	192	August 12, 2024

Pynq Allocate Speed

Related topics