TLDR: is there a way to develop C++ code that interacts with the PL without rebuilding the PYNQ OS image such that I can combine it with python and exploit off-the-shelf PYNQ image and my custom hw?
Extended version Problem:
I would like to combine the exploitation of some custom high-performance C++ code interacting with the PL (that I think I have to develop within Vitis Embedded ) combined with the python infrastructure of PYNQ. I know I can wrap C++ code in python, which is ok.
Correct Dev flow?
I was wondering which should be the development flow for C+±based drivers that interact with the PL
such that I can exploit off-the-shelf PYNQ images.
Should I recreate the environment for Vitis embedded that resembles the PYNQ image and then develop C++ drivers?
Are there already C++ APIs on top of which I can already build on?
I like the idea of combining the productivity of PYNQ and Python with the performance of C++. Unfortunately, we don’t have any C++ APIs that you can use to do this directly, but it is something that we have considered in the past. We have some examples where we use pybind11 with DPU-PYNQ that you might be able to take some inspiration from (https://github.com/Xilinx/DPU-PYNQ/blob/master/pynq_dpu/notebooks/dpu_resnet50_pybind11.ipynb).
In the blog there is a simple example of using PYNQ to configure a simple IP and then using mmap in the C++ notebook to interact with it. If you have a Kria-SoM device, there are also install scripts for setting everything up. However, we have not provided a flow for setting it up for other boards, but you might be able to recreate the environment following our steps.
I’d like to add to Shane’s comment. You can develop natively on target, no need for cross compilation.
If you are targeting MPSoC devices (aarch64 architecture), you can develop remotely using VS Code
The /sys/ interface provide a way to program the bitstreams.
You can read the source code of fpgautils for more information.
But personally, I just flash them with a minimal python code…
Access to memory mapped devices in C++ is done with mmap as @stf said. It is the same in every languages and very well documented.
Xilinx’s AXI IP are very well documented in their Product Guide document. They are also simple to control, you won’t miss much by not having drivers. Interrupts handling will make most of the effort if you really want them.
The main limitation you will face is the usage of DMA, which requires access to a continuous memory with physical address. I have yet to find a proper guide on this and this is the main reason I use Pynq (this and flashing bitstreams). Before Pynq, I was using mmap to fill buffers at a fixed address. It is risky and has some caches problems, but it works.
Indeed, I do not want to eliminate python from my dev flow and my applications, but to some extent, I might need C++ APIs to handle multi-threaded AXI (both lite, so configs, and master, so large portion of memory) communications with the PL and the rest of my flow exploiting python and PYNQ infrastructure, e.g., loading the bitstream and many sw(-hw) parts.
I have many accelerators that have to work independently, and I would like to have heavily independent computations.
Unfortunately, python profoundly limits the multi-threading/process (and I could not find a proper solution, which might be my bad) to this issue.
Last remark: I am targeting both Alveos and ZYNQs, and that is why exploiting PYNQ is fundamental in my approach, but if on Alveos there is the standard OpenCL way, on ZYNQ (with off-the-shelf PYNQ image), I am not aware of such standard.
I might need C++ APIs to handle multi-threaded AXI (both lite, so configs, and master, so large portion of memory)
Then mmap is the way, it is very fast and efficient.
I do not know if concurrent access is safe, but worst case you will know and add a semaphore.
By the way, if you experiment freezing of the system when using this, it is because you are hitting an address which does not responds, and the AXI bus will hang. On ZynqMP you can activate a watchdog to release the bus if it hangs, but not on Zynq-7000.
Unfortunately, python profoundly limits the multi-threading/process (and I could not find a proper solution, which might be my bad) to this issue.
I also noticed that accessing IP registers is very slow on Pynq, slower than 1us, and with an inconsistent speed.
You can try to use mmap from Pynq or Python and see if it improves it’s performance enough for your needs.
Alveo
I do not have experience on Alveo and OpenCL, I am just a Zynq fanboy.
So I am not sure mmap would work with a PCIe card.