If my memory is stuck recall correctly.
For ZYNQ dual core ARM A9 800MHz it is far far way good to handle such amount of weight and layers.
Eventually I got to just off train and export the weight and just lite Tensorflow (laziness of converting the weight to what AXI or LE can handle) and passed to the neurons via the AXIS DMA.
Meanwhile, I haven’t had time to investigate how good is DDR4 and ultrascale ZYNQ is better. Of cause I do have the board but it is not necessary to do such implementation on my side.
So to enable such installation:
1st you need a cross compiler to export the tensorflow package to the right ARM platform.
2nd you need the hardware itself able to handle such amount of neurons of data aka weight + bias. Remember this is nothing about speed but size itself as below 1GB what do you expect?
3rd if both 1 and 2 is able to break through then a CPU inference is what you are going to experience (full precision or fixed precision) then 1.xx GHz CPU inference speed hopefully it is fun to wait.
What I had done in ZYNQ 7000 series:
Tutorial of simple MNIST tutorial:
Enjoy~ =]