What I’m trying to do:
I’m trying to get an idea of how fast my M_AXI matrix multiplier IP is by sending in matrices then using “time” library to measure latency.
Issue:
I noticed that the following commands can take up to 0.103+ seconds:
Yes, so the allocate function is different depending on the type of device that you are running on, either edge or x86. When an allocation occurs a check is made to determine what type of device you are on and then a fetch to get a handle for that device.
If we perform some profiling we can clearly see that getting a handle for the device is where the majority of the time is spent during allocation. On v2.7 this is a separate process that is communicated with via sockets (the pl_server), and in v3.0 a global state file is referenced to get information for the current device.
The actual allocation time is quite small compared to this overhead of all the comms with the pl_server, see highlighted above.
On v3.0, if you are loading the overlay in the same process that you want to perform the allocation then internally we grab a handle for the device directly from the overlay object. This allows you to bypass the communication with the server and get a much faster allocation time.
The above example is on v3.0. As we can see, if the overlay has not been loaded (or loaded in a separate process) then the time taken to allocate a buffer is 0.113s. However, once the overlay has been loaded in the current process, then we can grab the device handle directly from it and get a much faster allocation time of 0.0033s. Unfortunately, this will only work if you are on v3.0.
That clarifies everything and makes total sense. I downgraded my system from v3.0 on my PYNQ-Z1 a couple months back for a certain reason, but I will be upgrading my system back to v3.0 ASAP.