Regarding speed of overlay

I am trying Axi lite ip and Pynq GPIO based ip . I found axi GPIO is bit slower. How can I accelerate a 12 bit multiplier on pynq as every time I found sw winning the race on pynq.
Do I need to read more.please refer the article for that
Else please solve my issue

Hi,
Can you increase the clock of your system?
Can you use AXI stream instead of AXI lite to fasten the transfer?

TLDR: To get benefit from the PL, you need to process significant amounts of data in parallel rather than small individual data for simple calculations.

If you are trying to write two 12 bit values to the PL, and read back the result of 1x 12-bit multiply then you are correct, the PS (or SW) will be much faster to do this for the following reasons:

For the calculation itself:
Depending on Which Zynq device you have, your PS (ARM A9 or ARM A53) is likely running at 600MHz+ up to 1.5GHz. PL will typically run at a few 100’s MHz. CPUs excel at sequential calculations on relatively small amounts of data. If you try to do a single multiply, both the CPU and PL can do this, but the CPU will “win” due to the faster clock speed.
The benefit of the FPGA is processing lots of data in parallel. E.g. the CPU can only do multiplications sequentially.

A single multiply is a relatively simple operation that (most) CPUs have dedicated hardware for. FPGAs also have dedicated hardware for multipliers (DSP slices).

The PL can do huge numbers of multiplications in parallel. For example, the Zynq 7020 on the PYNQ-Z2 has 220 DSP slices that allow you to do 25 × 18 bit multiply and accumulate in parallel. The largest Zynq Ultrascale has 2,520 DSP slices (27x18 bit).
These DSPs can run at a few 100’s MHz depending on your design. Even though they may be running at a lower clock speed compared to the PS, they can do ~3 orders of magnitude more multiplies than the PS on just the DSP slices.

You can also build custom multipliers in the PL using LUTs. You could fit a huge number of 12-bit multipliers in the PL.

The PS-PL interfaces are 32-bit in the case of the AXI GP ports, again running in the low 100s MHz. With AXI GPIO on this port, you are likely only transferring 12 bit at a time which is immediately inefficient on what is already intended as a lower performance interface.
Any data you transfer like this is managed by the CPU. ie. it reads a value from DRAM, and writes it out on this interface. Consider how the CPU implements the multiply. It also retrieves the data from DRAM, processes it in it’s ALU and writes it back to DRAM.

The 4x AXI HP ports are 64 bits wide, and have direct access to DRAM. You could transfer 5x 12-bit values in a clock cycle on one of these ports.

There is also an Overhead in transferring a single value from the PS to PL using PYNQ/Python of ~100ms. You could improve the transfer time in a few different ways, but fundamentally, transferring a single 12-bit value at a time will be relatively slow. It takes the PS longer to mover this data to the AXI GP than to execute the multiply.

Ideally would would collect data directly on the PL. E.g. video input, processing it inline, and write back out on PL.

This is already a long reply! I could go on a lot longer and discuss other factors which will influence this, but I hope this illustrates the points to you.

I’d suggest you try to study FPGAs further and think about how the various operations are working on the different types of processing units you are working with.

Cathal

I think it is max 400 MHz for PL
But AXI SPEED NO IDEA as suggested it is 100Mhz
But I can try stream too