Pynq and Vivado references for block design


I’m starting a new project for machine learning algorithm accelerated on FPGA. This project involved the creation of custom overlays that can speed up intensive, but most of all, repetitve operations such euclidean distance e many others dissimilarity measure involved in typical machine learning applications on big amount of data.
I’ve got an electronic engineering background, but I never get in depth with FPGA and digital electronics in general.
I read the tutorial in readthedocs about custom overlay and it seems a good starting point. After that I tried to develop a custom overlay that implements a very simple vector multiplication by a constant that take advantage of DMA. The IP core is provided by website, but the block design seems not trivial to me and feel like I have a lack of knowledge with that.
I would really thankful if someone would provide some references for this design step, with book or tutorial specific for Pynq overlay design.

Thank you,