This is a very interesting question that I would ask or enquiry.
For CNN tensorflow lite inference is also problematic as memory overflow or other issue (didn’t investigate).

So rebuild CNN basic function blocks are simple and much controllable.
Meantime Numpy from this link do mention about SIMD support:
Do Xilinx ARM7l Numpy run SIMD as well?

So when considering Fully-Connected layers:
Turns out even Numpy can even do faster job than FPGA

FPGA @ 100MHz Runtime # 0.061389923095703125
ARM Runtime # 0.031568288803100586

If FPGA goes to 200MHz ~= 0.061389923095703125/2 = 0.0305
Still this is far good compared to ARM SIMD.

So any suggestions or good explanation that can support such behavior.

  1. HLS might not fully optimized
  2. DSP inherent structure are better on pipelined action as parallel addition is the bottleneck

→ I might be wrong