Ultra96-V2 DMA performance

Hi all, I am using an Ultra96-V2 board with DMA. I followed the tutorial Tutorial: PYNQ DMA (Part 1: Hardware design) - Learn - PYNQ, instead of using FIFO, I connect the input port with output just to test how much time it would cost.
However, I found that the result would vary using different versions of Ultra96-V2 PYNQ SD card image. Transferring 56 doubles, the v2.5 version spent 0.17ms while the newer v3.0.1 spent 0.34ms, which is 50% slower.

How does this happen? How do I fix this issue?


