CNN on pynq (using RTL)

I want to implement a simple CNN operation (e.g. Convolution, Fully connected layer) using PYNQ. However, Most online resources use HLS to design the hardware and generate a BIT file, which is then implemented in RTL. However, I want to design the hardware directly in RTL and use it as a PYNQ overlay. Are there any reference codes or examples for this approach?

Also I want to utilize the following url to make convolution code. but it is hard to transfer any array element using AXI4. please help to solve this problem