Is there a value for how many cycles the ARM PS might need to spend to read one value from an AXI peripheral into user-space code?
Is there any example code (in C) to do the reading part as a thread that runs in parallel to another thread, each executing on two different ARM cores?
[Edit: moved this to a new topic; please open new topics for and new questions instead of replying to old threads].
There are two parts to this; the speed of the CPU, and the speed of the Programmable logic. For example, your PL may be running at 100 MHz - 300 MHz.
I think a Python MMIO call takes about 100us which is slow - Python is a productivity language. You can check this yourself.
In C, a (very bad) MMIO read could literally be:
int data = ((unsigned int)register_address);
(You should do some casting to the correct types to avoid warnings/errors).
This would be faster than the Python/PYNQ MMIO call. You can also wrap C code with a thing Python wrapper so you are calling the C from Python. E.g. CFFI example used in IIC . This can give the “best of both worlds” and we use this in PYNQ for some code where we want to benefit from the efficiency of C/C++.
I’m not sure of any specific examples, but you should be able to compile OpenMP by passing -fopenmp
I was able to eke out about 10 M reads per second, and 26 M writes per second in C++, when using mmap in PS. Is this a reasonable value? PS running at 650ish MHz (unmodified).
Curiously, in Python, I was able to get only about 50 K reads per second – that’s about 200 times slower – is that expected or was I maybe doing something wrong? I used the MMIO.mmio_read API.
This is probably correct. Python is a productivity language but the downside to this is performance. If you are doing a lot of MMIO, you probably need to reconsider how you are transferring data and use a DMA or similar instead.