PYNQ 3.0.1 Performance Degraded?

Hello all,

After a short test, PYNQ 2.7 and PYNQ 3.0.1.
Same overlay (FIR Acceleration) the acceleration rate drop almost 3 times.

PYNQ 3.0.1

#PYNQ 2.7

Thank you

Hi Brian,

Thanks for trying out the latest image and reporting what you find; it’s really helpful for us.

I want to try and recreate the problem on my end to try and take a look at it. Could you share the benchmark you ran and, if possible, the bitstream/hardware handoff files? Even some code snippets would be helpful.

Many thanks,
Shane

@stf

Hi Shane,

I just try to repeat what ZYNQ and Ultra-scale ZYNQ on PYNQ 2.7:
Here is the example I follow:

Which is also welly known as a good acceleration common ground on FIR + DMA.
As my board is a custom board, this cannot directly plug and run sorry about that.

In case we can align the script here is the Jupyter script:
fir_filter.ipynb (3.6 KB)

Meantime, I also confirmed the behavior is repeatable which PYNQ 3.0.1 maintained @ 5 acceleration rate.

Bests

1 Like

Hi Brian,

Thanks for this.
I’ve ran a couple of experiments on my Z2 with a simple loopback DMA overlay timing performance and I’m afraid I’m not seeing the issues that you were seeing. We might need to dive a bit deeper into your specific setup if you’re up for it?

On my Z2 I tried to recreate the same data-sizes in your example notebook. It’s transfering 200K uint32 samples through the loopbacked dma and timing the transfer in the same way.

from pynq import Overlay, allocate
import numpy as np
ol = Overlay("dma.bit")

dma = ol.dma
dma_send = dma.sendchannel
dma_recv = dma.recvchannel

data_size = 200000
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

start = time.time()
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)
dma_send.wait()
dma_recv.wait()
end = time.time()
print(f"{end - start}")

On v2.7 I got a performance of: 0.007594 seconds
On v3.0.1 I got a performance of: 0.004716 seconds

Could we profile your calculation for hw_exec_time a bit more to see what is going on under the hood?

Would you mind using cProfile to try and get a breakdown of how much cumulative time is spent where? Would you mind changing your code like I have done in my example below using cProfile and putting the results in this thread?

That might help us to see what is going on in a bit more detail.
Thanks again so much for your help :slight_smile:

All the best,
Shane

2 Likes

@stf

I am a bit confuse. You using dummy DMA test, but could this removing issue elements from the slow response time?

Also, are there any possible cause reusing old 2020.2 bit and hwh rather than new 2022.1?

Meantime, I would also like to know the overlay load time on your side as the overlay load time is far far slower than PYNQ 2.7.

For the script, the timing is just counting the DMA:
image

Here is a function wrapped test case:

Here is what you would like me to add:

These result are very very puzzle, sometimes x5, x3, x20. A quantum state is found.

The print logs

`Ordered by: cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.009 0.009 {built-in method builtins.exec}
1 0.000 0.000 0.009 0.009 :1()
1 0.000 0.000 0.009 0.009 /tmp/ipykernel_816/1243142666.py:1(tx_test)
2 0.000 0.000 0.006 0.003 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:106(transfer)
86 0.005 0.000 0.005 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/mmio.py:82(read)
4 0.000 0.000 0.004 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:68(running)
2 0.000 0.000 0.002 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:166(wait)
2 0.001 0.001 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/_3rdparty/xrt.py:517(xclSyncBO)
1 0.000 0.000 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:104(flush)
1 0.000 0.000 0.001 0.001 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/xrt_device.py:383(flush)
28 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:73(idle)
26 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/lib/dma.py:82(error)
2 0.000 0.000 0.001 0.000 {built-in method builtins.print}
8 0.000 0.000 0.001 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:518(write)
6 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/mmio.py:130(write_mm)
1 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:109(invalidate)
1 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/xrt_device.py:394(invalidate)
2 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:93(virtual_address)
8 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:429(_is_master_process)
6 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/buffer.py:89(physical_address)
8 0.000 0.000 0.000 0.000 {built-in method posix.getpid}
8 0.000 0.000 0.000 0.000 /usr/local/share/pynq-venv/lib/python3.10/site-packages/ipykernel/iostream.py:448(_schedule_flush)
8 0.000 0.000 0.000 0.000 {method ‘exit’ of ‘_thread.RLock’ objects}
8 0.000 0.000 0.000 0.000 {method ‘write’ of ‘_io.StringIO’ objects}
8 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance}
8 0.000 0.000 0.000 0.000 {built-in method builtins.len}
2 0.000 0.000 0.000 0.000 {built-in method time.time}
1 0.000 0.000 0.000 0.000 {method ‘disable’ of ‘_lsprof.Profiler’ objects}
`

Thank you

1 Like

@stf

Update a bit also on the overlay bitstream revision different:

I had tried both 2020.2 generated overlay (bit + hwh) and 2022.1 one.
The result on PYNQ 3.0.1 is maintained same situation acceleration rate every first boot and first run are x5 might +/- 0.2 but almost sure it is x5 result.

So overlay bitstream and hwh are not the case as DMA 7.1 are not updated with newer version but only revision changed.

Thank you

Hi Brian,

Many thanks for gathering that trace. It wasn’t clear from the profiling results where the potential issue was, so I decided to re implement the benchmark you were running for the PYNQ-Z2. Unfortunately I was not able to see the performance issue that you were seeing on my Z2 board.

Below are some screenshots:

v2.7

v3.0.1

The results did vary a bit between runs, the size of the data transfers are probably a bit small for this to be a reliable benchmark, but these are typical values I was seeing.

I’m really not sure what the issue might be with your setup. For your results was the software execution time varying a lot and the hardware execution time remaining constant, or were you seeing large variation in the hardware execution time?

Regarding the overlay loading time: there has been an increase in the time this takes as we are parsing more of the metadata for various reasons, such as deeper runtime introspection of the configured design and for build time automation flows. This increase is most noticeable on the Zynq7000 devices. To mitigate this, we have implemented a caching feature that caches the relevant metadata for subsequent loads, so it should be less noticeable the second time an overlay is loaded.

All the best,
Shane

3 Likes

@stf

Thank you for spending time on such investigation.
A) The result of your PYNQ 2.7 is 100% aligned with my PYNQ 2.7 setup.
B) The result on your PYNQ 3.0.1, I would like to know is this the first clean run of the board.
Allow me to elaborate a bit more what do I mean clean run:
i) Power Up
ii) Open Juypter NoteBook
iii) Run All cell in the FIR Acceleration script [ONCE ONLY]

Above the setup of PYNQ 3.0.1 the result will be keep in the same acceleration rate X5.
However, you can clearly see if the same FIR script is ran several times, the result becoming unpredictable.

Result of clean run with FIR-Acceleration PYNQ 3.0.1

This result almost fixed and repeatable aka x4.5 acceleration rate.

The result of clean run with FIR-Acceleration PYNQ 2.7

However, once the block related to DMA is ran couple times on PYNQ 3.0.1

  • Above from these experiment run the code couple times will resolve the degrade state?

That is why I first wrap the code with the method style and ran the time check, then add the extra code you are going to compare. Once a recover state is triggered, the acceleration goes from x5 or x2 back to x20 even better than PYNQ 2.7

This is very important that a clean run is the key in here. As if you run multi times or your board is execute other things that is before this comparison, there might trigger some recover stage on the environment? (Which you do see in the extra code you are referring).

@stf

Meantime, I also double check about the board build flag settings:


The red squared flags are missing before and adding those flags make no different at all.

Result are the same:

Many thanks,
Brian

Hi Brian,

Thanks for trying a few more things. Let me make sure that I understand correctly. You boot the board and run the notebook and find that the overall acceleration shown is low. However, after running the notebook a few times, the execution times become unpredictable. Is that correct?

In the post I made yesterday, I ran the notebook immediately after the board booted, so I’m afraid that I’m not able to recreate the same problem you are facing on my Z2.

I wanted to check the performance variation you were seeing, so I made a script to run the code from the notebook multiple times on both v2.7 and v3.0.1 and gathered the results. Each script ran the code 50 times, measuring the hw execution time and sw execution time from a freshly booted board. In each case, Jupyter lab was open with two terminals running, one to launch the experiment script and one to monitor ram usage with htop.

Below is a screenshot of the summary results; however, I have attached an excel sheet with the full results. I’m also happy to share my experiment scripts if you wish:

image

The results show some increase in variation in the execution time between v2.7 and v3.0.1. However, to a much lesser extent than you were seeing. And again, I wonder if this is the best benchmark to be used as the data-transfer size is relatively small and will be sensitive to lots of noise from the OS scheduler.

One thing we have noticed with v3.0.1 is that the RAM usage is higher than in previous versions. If we have many notebooks open, we can use up memory resources, and things can become quite sluggish. Could this explain the significant variation you are seeing in execution time?

All the best,
Shane

  • ps: Sorry I had to upload the results as .txt files but they are actually .csv as I was not permitted to upload excel worksbooks or csv files.

summary_results.csv.txt (434 Bytes)
v2.7_res.csv.txt (2.1 KB)
v3.0.1_res.csv.txt (1.2 KB)

1 Like

@stf

Thank you so much of the effort again!

Q1. Correct. A “clean run” show acceleration low while after several runs it become x16 or above.
Worth mention is that if the profile.run is applied in a “clean run”, acceleration become x16 or above.

Q2 I think you misunderstand my case, my result is a CLEAN RUN aka bootup->open Jupyter NoteBook-> run the FIR script ONCE.

T1 I will try what I can also repeat the task you had ran. Let’s compare the result.

Thank you

1 Like

Thanks Brian.

I think I did understand Q2. Unfortunately, I still wasn’t observing that case.

For Q1, profiling will have quite an effect on execution time, it’s more to see what is called and how often, so maybe that is affecting the results.

All the best,
Shane

@stf

Here is a first hand result:
image

For ACC MAX MIN I mixed up the label~

Allow me to explain a bit more the setup:

CLEAN RUN is applied to both image test
package only load once
Only DMA data transfer + HW FIR or SW FIR is run 50 times and comparing
Loop body overhead is included

Thank you

Hi Brian,

Oh, interesting. In the average case, the results are similar between us. However, in the worst case, it looks like your performance is a lot worse.

What I suspect is happening for you is that in v3.0.1, some high-priority process is occasionally preempting the task and hogging the processor, causing the occasional long execution time. Unfortunately, it’s hard to know what that task is. What Zynq device do you have on your board? The only way we can try to determine what the preempting process might be is to try to catch it in action.

All the best,
Shane

1 Like

@stf

The chips is the highest LE - XC7Z100ffg900-2 ZYNQ ARM is set yo 800M and DDR is set to 1066 full 32bit addressed.

I am bit confused, the build script is using PYNQ 3.0.1 make BOARDS=xxxx nothing are additionally set on my side and only system.xsa base.bit base.hwh base.tcl and the board spec are set. While the base overlay is just an empty overlay so I cannot see how could this make such a different.

Bests

Hi Brian,

I did a bit more digging here, and I was able to recreate the situation you saw occasionally. If I try and run the notebook as fast as I can directly after boot, I observe a slowdown in performance on the first run, and I get the same values you saw.

I did some profiling of the processes running on my Z2 directly after boot and noticed that the unattended-upgrades process was running and consuming a few resources.

I tried removing it with this command:

sudo apt remove unattended-upgrades

Then rebooted the board and tried to run the notebook as fast as I could again.

Once that was removed, I could no longer recreate the same performance issue directly after boot. Disabling unattended-upgrades may help with your issue also.

Many thanks,
Shane

1 Like

@stf

Thank you Shane for making a deep investigation here.
I am not a deep software guy, might need more details what “unattended-upgrades” does and what are the impacts would this removable.
Please enlighten us, so we can learn more on that.

Meantime, I am also confused on why this is the case in PYNQ 3.0.1 but not the PYNQ 2.7.
I guess there are modification on the kernel backbone but the flow should be aligned with old revision, I didn’t paid more attention to the Ubuntu version (maybe this is the case why there is a different?)

Bests,
Brian

1 Like

Hi Brian,

No problem at all; thanks for raising the issue.

So unattended-upgrades is a service that ensures that all your ubuntu packages are up to date. I believe it is the same as running:

sudo apt-get updates && sudo apt-get upgrade -y

It has been there for a while, but on the latest image it seems to be running more often (like on startup).
These updates are important, so disable them at your own risk, but if you are worried about consistent performance, then it might be better to control the updates yourself.

It is not necessarily just unattended-upgrades that will be causing the issue, just that the system will take a while to settle on boot with various tasks and services running. I just noticed that unattended-upgrades was running so I removed it and was no longer observing the issue you saw.

Here is a StackExchange link discussing unattended-upgrades: Terminate unattended-upgrades or whatever is using apt in ubuntu 18.04 or later editions - Ask Ubuntu

All the best,
Shane

@stf

A short test on a CLEAN RUN after removing the update services:
Steps are:

  • sudo apt remove unattended-upgrades
  • sudo halt
  • power cycle
  • power on and wait Jupyter login
  • A complete 50 run on software then hardware FIR ACC

So from the result, the update service is not the case and even make worst as previously is about x4 acceleration rate.

Result

image

Also I do more experiment on the kernel:

  • Restart and Run All shows a good > x20 fixed acceleration rate: [21.95, 27.04, 26.60]
  • so there must be something holding up the kernel resources after boot and makes the run time slow on the 1st script execute after boot.

Q1) Any kernel reset or resources free command can be insert to the bootup after the Jupyter service is ready?