You’re right. This is quite a large diff in performance; thanks for finding this and bringing it to our attention. It’s really helpful.
I was able to recreate your results, and after doing some profiling, I think I have managed to locate the issue. A while back, we included a PL server, which is a server that manages the loading of the overlays and interactions with the configured bitstream. When you call via overlay.memory.axi_dma_0 you go via this server which involves opening up some UNIX sockets and passing messages back and forth. However, when you create a reference to the driver object, with dma = overlay.memory.axi_dma_0 then you only interact with the server once (and you are not timing the overhead of this interaction in your case).
We are still discussing internally what to do with the PL server and hopefully will be able to rectify this. Thanks again for bringing this particular issue to our attention. In the meantime, I would recommend using the approach in your second snippet for higher performance.
I’ll look into creating an issue on our GitHub repo and post it here so that you can track the progress on this fix; however, we might have another way that we want to track this issue, so I’ll get back to you.