Hello all,
After a short test, PYNQ 2.7 and PYNQ 3.0.1.
Same overlay (FIR Acceleration) the acceleration rate drop almost 3 times.
PYNQ 3.0.1
#PYNQ 2.7
Thank you
Hello all,
After a short test, PYNQ 2.7 and PYNQ 3.0.1.
Same overlay (FIR Acceleration) the acceleration rate drop almost 3 times.
#PYNQ 2.7
Thank you
Hi Brian,
Thanks for trying out the latest image and reporting what you find; itās really helpful for us.
I want to try and recreate the problem on my end to try and take a look at it. Could you share the benchmark you ran and, if possible, the bitstream/hardware handoff files? Even some code snippets would be helpful.
Many thanks,
Shane
Hi Shane,
I just try to repeat what ZYNQ and Ultra-scale ZYNQ on PYNQ 2.7:
Here is the example I follow:
Which is also welly known as a good acceleration common ground on FIR + DMA.
As my board is a custom board, this cannot directly plug and run sorry about that.
In case we can align the script here is the Jupyter script:
fir_filter.ipynb (3.6 KB)
Meantime, I also confirmed the behavior is repeatable which PYNQ 3.0.1 maintained @ 5 acceleration rate.
Bests
Hi Brian,
Thanks for this.
Iāve ran a couple of experiments on my Z2 with a simple loopback DMA overlay timing performance and Iām afraid Iām not seeing the issues that you were seeing. We might need to dive a bit deeper into your specific setup if youāre up for it?
On my Z2 I tried to recreate the same data-sizes in your example notebook. Itās transfering 200K uint32 samples through the loopbacked dma and timing the transfer in the same way.
from pynq import Overlay, allocate
import numpy as np
ol = Overlay("dma.bit")
dma = ol.dma
dma_send = dma.sendchannel
dma_recv = dma.recvchannel
data_size = 200000
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
start = time.time()
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)
dma_send.wait()
dma_recv.wait()
end = time.time()
print(f"{end - start}")
On v2.7
I got a performance of: 0.007594 seconds
On v3.0.1
I got a performance of: 0.004716 seconds
Could we profile your calculation for hw_exec_time
a bit more to see what is going on under the hood?
Would you mind using cProfile to try and get a breakdown of how much cumulative time is spent where? Would you mind changing your code like I have done in my example below using cProfile
and putting the results in this thread?
That might help us to see what is going on in a bit more detail.
Thanks again so much for your help
All the best,
Shane
I am a bit confuse. You using dummy DMA test, but could this removing issue elements from the slow response time?
Also, are there any possible cause reusing old 2020.2 bit and hwh rather than new 2022.1?
Meantime, I would also like to know the overlay load time on your side as the overlay load time is far far slower than PYNQ 2.7.
For the script, the timing is just counting the DMA:
These result are very very puzzle, sometimes x5, x3, x20. A quantum state is found.
`Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.009 0.009 {built-in method builtins.exec}
1 0.000 0.000 0.009 0.009 :1()
1 0.000 0.000 0.009 0.009 /tmp/ipykernel_816/1243142666.py:1(tx_test)
2 0.000 0.000 0.006 0.003 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:106(transfer)
86 0.005 0.000 0.005 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/mmio.py:82(read)
4 0.000 0.000 0.004 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:68(running)
2 0.000 0.000 0.002 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:166(wait)
2 0.001 0.001 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/_3rdparty/xrt.py:517(xclSyncBO)
1 0.000 0.000 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:104(flush)
1 0.000 0.000 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/xrt_device.py:383(flush)
28 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:73(idle)
26 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:82(error)
2 0.000 0.000 0.001 0.000 {built-in method builtins.print}
8 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:518(write)
6 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/mmio.py:130(write_mm)
1 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:109(invalidate)
1 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/xrt_device.py:394(invalidate)
2 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:93(virtual_address)
8 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:429(_is_master_process)
6 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:89(physical_address)
8 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
8 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:448(_schedule_flush)
8 0.000 0.000 0.000 0.000 {method āexitā of ā_thread.RLockā objects}
8 0.000 0.000 0.000 0.000 {method āwriteā of ā_io.StringIOā objects}
8 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
8 0.000 0.000 0.000 0.000 {built-in method builtins.len}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method ādisableā of ā_lsprof.Profilerā objects}
`
Thank you
Update a bit also on the overlay bitstream revision different:
I had tried both 2020.2 generated overlay (bit + hwh) and 2022.1 one.
The result on PYNQ 3.0.1 is maintained same situation acceleration rate every first boot and first run are x5 might +/- 0.2 but almost sure it is x5 result.
So overlay bitstream and hwh are not the case as DMA 7.1 are not updated with newer version but only revision changed.
Thank you
Hi Brian,
Many thanks for gathering that trace. It wasnāt clear from the profiling results where the potential issue was, so I decided to re implement the benchmark you were running for the PYNQ-Z2. Unfortunately I was not able to see the performance issue that you were seeing on my Z2 board.
Below are some screenshots:
v2.7
v3.0.1
The results did vary a bit between runs, the size of the data transfers are probably a bit small for this to be a reliable benchmark, but these are typical values I was seeing.
Iām really not sure what the issue might be with your setup. For your results was the software execution time varying a lot and the hardware execution time remaining constant, or were you seeing large variation in the hardware execution time?
Regarding the overlay loading time: there has been an increase in the time this takes as we are parsing more of the metadata for various reasons, such as deeper runtime introspection of the configured design and for build time automation flows. This increase is most noticeable on the Zynq7000 devices. To mitigate this, we have implemented a caching feature that caches the relevant metadata for subsequent loads, so it should be less noticeable the second time an overlay is loaded.
All the best,
Shane
Thank you for spending time on such investigation.
A) The result of your PYNQ 2.7 is 100% aligned with my PYNQ 2.7 setup.
B) The result on your PYNQ 3.0.1, I would like to know is this the first clean run of the board.
Allow me to elaborate a bit more what do I mean clean run:
i) Power Up
ii) Open Juypter NoteBook
iii) Run All cell in the FIR Acceleration script [ONCE ONLY]
Above the setup of PYNQ 3.0.1 the result will be keep in the same acceleration rate X5.
However, you can clearly see if the same FIR script is ran several times, the result becoming unpredictable.
This result almost fixed and repeatable aka x4.5 acceleration rate.
That is why I first wrap the code with the method style and ran the time check, then add the extra code you are going to compare. Once a recover state is triggered, the acceleration goes from x5 or x2 back to x20 even better than PYNQ 2.7
This is very important that a clean run is the key in here. As if you run multi times or your board is execute other things that is before this comparison, there might trigger some recover stage on the environment? (Which you do see in the extra code you are referring).
Meantime, I also double check about the board build flag settings:
Many thanks,
Brian
Hi Brian,
Thanks for trying a few more things. Let me make sure that I understand correctly. You boot the board and run the notebook and find that the overall acceleration shown is low. However, after running the notebook a few times, the execution times become unpredictable. Is that correct?
In the post I made yesterday, I ran the notebook immediately after the board booted, so Iām afraid that Iām not able to recreate the same problem you are facing on my Z2.
I wanted to check the performance variation you were seeing, so I made a script to run the code from the notebook multiple times on both v2.7 and v3.0.1 and gathered the results. Each script ran the code 50 times, measuring the hw execution time and sw execution time from a freshly booted board. In each case, Jupyter lab was open with two terminals running, one to launch the experiment script and one to monitor ram usage with htop.
Below is a screenshot of the summary results; however, I have attached an excel sheet with the full results. Iām also happy to share my experiment scripts if you wish:
The results show some increase in variation in the execution time between v2.7 and v3.0.1. However, to a much lesser extent than you were seeing. And again, I wonder if this is the best benchmark to be used as the data-transfer size is relatively small and will be sensitive to lots of noise from the OS scheduler.
One thing we have noticed with v3.0.1 is that the RAM usage is higher than in previous versions. If we have many notebooks open, we can use up memory resources, and things can become quite sluggish. Could this explain the significant variation you are seeing in execution time?
All the best,
Shane
summary_results.csv.txt (434 Bytes)
v2.7_res.csv.txt (2.1 KB)
v3.0.1_res.csv.txt (1.2 KB)
Thank you so much of the effort again!
Q1. Correct. A āclean runā show acceleration low while after several runs it become x16 or above.
Worth mention is that if the profile.run is applied in a āclean runā, acceleration become x16 or above.
Q2 I think you misunderstand my case, my result is a CLEAN RUN aka bootup->open Jupyter NoteBook-> run the FIR script ONCE.
T1 I will try what I can also repeat the task you had ran. Letās compare the result.
Thank you
Thanks Brian.
I think I did understand Q2. Unfortunately, I still wasnāt observing that case.
For Q1, profiling will have quite an effect on execution time, itās more to see what is called and how often, so maybe that is affecting the results.
All the best,
Shane
Here is a first hand result:
For ACC MAX MIN I mixed up the label~
Allow me to explain a bit more the setup:
CLEAN RUN is applied to both image test
package only load once
Only DMA data transfer + HW FIR or SW FIR is run 50 times and comparing
Loop body overhead is included
Thank you
Hi Brian,
Oh, interesting. In the average case, the results are similar between us. However, in the worst case, it looks like your performance is a lot worse.
What I suspect is happening for you is that in v3.0.1, some high-priority process is occasionally preempting the task and hogging the processor, causing the occasional long execution time. Unfortunately, itās hard to know what that task is. What Zynq device do you have on your board? The only way we can try to determine what the preempting process might be is to try to catch it in action.
All the best,
Shane
The chips is the highest LE - XC7Z100ffg900-2 ZYNQ ARM is set yo 800M and DDR is set to 1066 full 32bit addressed.
I am bit confused, the build script is using PYNQ 3.0.1 make BOARDS=xxxx nothing are additionally set on my side and only system.xsa base.bit base.hwh base.tcl and the board spec are set. While the base overlay is just an empty overlay so I cannot see how could this make such a different.
Bests
Hi Brian,
I did a bit more digging here, and I was able to recreate the situation you saw occasionally. If I try and run the notebook as fast as I can directly after boot, I observe a slowdown in performance on the first run, and I get the same values you saw.
I did some profiling of the processes running on my Z2 directly after boot and noticed that the unattended-upgrades process was running and consuming a few resources.
I tried removing it with this command:
sudo apt remove unattended-upgrades
Then rebooted the board and tried to run the notebook as fast as I could again.
Once that was removed, I could no longer recreate the same performance issue directly after boot. Disabling unattended-upgrades may help with your issue also.
Many thanks,
Shane
Thank you Shane for making a deep investigation here.
I am not a deep software guy, might need more details what āunattended-upgradesā does and what are the impacts would this removable.
Please enlighten us, so we can learn more on that.
Meantime, I am also confused on why this is the case in PYNQ 3.0.1 but not the PYNQ 2.7.
I guess there are modification on the kernel backbone but the flow should be aligned with old revision, I didnāt paid more attention to the Ubuntu version (maybe this is the case why there is a different?)
Bests,
Brian
Hi Brian,
No problem at all; thanks for raising the issue.
So unattended-upgrades is a service that ensures that all your ubuntu packages are up to date. I believe it is the same as running:
sudo apt-get updates && sudo apt-get upgrade -y
It has been there for a while, but on the latest image it seems to be running more often (like on startup).
These updates are important, so disable them at your own risk, but if you are worried about consistent performance, then it might be better to control the updates yourself.
It is not necessarily just unattended-upgrades that will be causing the issue, just that the system will take a while to settle on boot with various tasks and services running. I just noticed that unattended-upgrades was running so I removed it and was no longer observing the issue you saw.
Here is a StackExchange link discussing unattended-upgrades: Terminate unattended-upgrades or whatever is using apt in ubuntu 18.04 or later editions - Ask Ubuntu
All the best,
Shane
A short test on a CLEAN RUN after removing the update services:
Steps are:
So from the result, the update service is not the case and even make worst as previously is about x4 acceleration rate.
Q1) Any kernel reset or resources free command can be insert to the bootup after the Jupyter service is ready?