PYNQ DMA run time

I am not sure but during some experiments.

The syntax:

start = time.time()
overlay.memory.axi_dma_0.sendchannel.transfer(in_fc)
overlay.memory.axi_dma_0.recvchannel.transfer(out_fc)
overlay.memory.axi_dma_0.sendchannel.wait()
overlay.memory.axi_dma_0.recvchannel.wait()
end = time.time()
exec_time = end - start

time count ~= 0.06xxx, for size ~= 1024 x 16bit data

vs

dma = overlay.memory.axi_dma_0
start = time.time()
dma.sendchannel.transfer(in_fc)
dma.recvchannel.transfer(out_fc)
dma.sendchannel.wait()
dma.recvchannel.wait()
end = time.time()
exec_time = end - start

time count ~= 0.003xxx, for size ~= 1024 x 16bit data

We can see the differences are huge and why it is behaving as such?

Hi Brian,

You’re right. This is quite a large diff in performance; thanks for finding this and bringing it to our attention. It’s really helpful.

I was able to recreate your results, and after doing some profiling, I think I have managed to locate the issue. A while back, we included a PL server, which is a server that manages the loading of the overlays and interactions with the configured bitstream. When you call via overlay.memory.axi_dma_0 you go via this server which involves opening up some UNIX sockets and passing messages back and forth. However, when you create a reference to the driver object, with dma = overlay.memory.axi_dma_0 then you only interact with the server once (and you are not timing the overhead of this interaction in your case).

We are still discussing internally what to do with the PL server and hopefully will be able to rectify this. Thanks again for bringing this particular issue to our attention. In the meantime, I would recommend using the approach in your second snippet for higher performance.

I’ll look into creating an issue on our GitHub repo and post it here so that you can track the progress on this fix; however, we might have another way that we want to track this issue, so I’ll get back to you.

Thanks again,
Shane

P.S. I have included the profiling outputs here.

Going via the overlay class:

Going via a copied reference to the dma driver:

4 Likes

Hello Stf,

Thank you for deep explanation and appreciate all the investigations.
Great job!

Before reading your reply.
My guessing is using a variable invoke concept.

New variable

When creating a variable first it need to create a mapping.
Exec data passing pointers
Passing data

Update variable

Passing new data

So it is very clear that overhead of passing variable is completely shorter and faster.
Same rules now applying to this issue.

So I have no idea should be consider a bad coding method or there are other hidden issue exist in here.