** PYNQ Version **
Release 2021_11_18 14a7328
Board 2021_11_18 14a7328
GitHub - Xilinx/PYNQ: Python Productivity for ZYNQ
- Standard Image
- Board: Pynq-Z1
What I’m trying to do:
I’m trying to get an idea of how fast my M_AXI matrix multiplier IP is by sending in matrices then using “time” library to measure latency.
I noticed that the following commands can take up to 0.103+ seconds:
inbuff = allocate(shape=(mat1_rmat1_c,), dtype=np.uint64)
in2buff = allocate(shape=(mat2_rmat2_c,), dtype=np.uint64)
outbuff = allocate(shape=(mat1_r*mat2_c,), dtype=np.uint64)
The reason why this is an issue is because my matrix multiplier, for the given example, takes 0.00703 seconds.
Is there a way to overcome this slowdown caused by the “allocate” function?
I’ve attached my code to this post:
fpga_mmult_function.ipynb (3.9 KB)
Yes, so the allocate function is different depending on the type of device that you are running on, either edge or x86. When an allocation occurs a check is made to determine what type of device you are on and then a fetch to get a handle for that device.
If we perform some profiling we can clearly see that getting a handle for the device is where the majority of the time is spent during allocation. On v2.7 this is a separate process that is communicated with via sockets (the pl_server), and in v3.0 a global state file is referenced to get information for the current device.
The actual allocation time is quite small compared to this overhead of all the comms with the pl_server, see highlighted above.
On v3.0, if you are loading the overlay in the same process that you want to perform the allocation then internally we grab a handle for the device directly from the overlay object. This allows you to bypass the communication with the server and get a much faster allocation time.
The above example is on v3.0. As we can see, if the overlay has not been loaded (or loaded in a separate process) then the time taken to allocate a buffer is 0.113s. However, once the overlay has been loaded in the current process, then we can grab the device handle directly from it and get a much faster allocation time of 0.0033s. Unfortunately, this will only work if you are on v3.0.
Hope that help explain things a bit.
All the best,
Beautiful explanation! Thank you so much!
That clarifies everything and makes total sense. I downgraded my system from v3.0 on my PYNQ-Z1 a couple months back for a certain reason, but I will be upgrading my system back to v3.0 ASAP.
Thank you again!