HLS/IP Bram Usage

Hello. Im trying a neural network with HLS. This is a mix of an issue for xilinx and here i think. Im using HLS4ML for the network compile/syntethize but this part of the issue is more xilinx/pynq related. The board is a PYNQ-Z2 and my NN has 512 data input size and 10 outputs (for classification)

When the projects is builded an HLS program is created. I can then open it with Vivado HLS (2019.2) and Run C Synthesis and Export the IP. Here come the issues.

  1. If i complie with io_type=io_parallel, the following IP is generated

but then this is my utilization estimate:
Screenshot from 2021-08-16 11-30-17

  1. If i compile with io_type=io_serial i get this massive IP block
    with 512 inputs but the utilization estimates are

Screenshot from 2021-08-16 11-32-17
fits the board

As you can see, io_parallel has the best IP block form but the BRAM is way over the maximum of the board.

Here comes the questions:

  1. MOST IMPORTANT: Its possible for the pynq to implement the smaller io_parallel and then to load the BRAM from SD? How can i reduce the BRAM usage? How to do it from python?

  2. If the io_serial is used so the BRAM fits the board, how do i run the connections so then i can generate the overlay and use it from python?