PYNQ: PYTHON PRODUCTIVITY

My accelerator DMA output is giving zeros

I have created the following IP with Vitis HLS 2020.2 (Please check the attached files pobj.cpp (2.5 KB) pobj.hpp (327 Bytes)) and successfully synthesized and generated an IP. After that I created the following block design:


and generated the bitstream also successfully
Then I started the python coding step as follows:

from pynq import (allocate, Overlay)
import numpy as np
ol = Overlay('pobj.bit')

# Define dimensions
M = 64
N = 512
# Allocate memory for DMA transfers
A_buffer = allocate(shape=(M,N), dtype=np.float64, cacheable=False)
X_buffer = allocate(shape=(N,1), dtype=np.float64, cacheable=False)
Y_buffer = allocate(shape=(M,1), dtype=np.float64, cacheable=False)
Z_buffer = allocate(shape=(M,1), dtype=np.float64, cacheable=False)
P_buffer = allocate(shape=(2,1), dtype=np.float64, cacheable=False)

CTRL_REG = 0x00
AP_START = (1<<0) # bit 0
AUTO_RESTART = (1<<7) # bit 7
def run_kernel():
    dma_A.sendchannel.transfer(A_buffer)
    dma_X.sendchannel.transfer(X_buffer)
    dma_YZ.sendchannel.transfer(Y_buffer)
    dma_param.sendchannel.transfer(P_buffer)
    
    dma_YZ.recvchannel.transfer(Z_buffer)
    dma_param.recvchannel.transfer(P_buffer)
    
    pobj_ip.write(CTRL_REG, (AP_START | AUTO_RESTART))  # initialize the module
    
    dma_A.sendchannel.wait()
    dma_X.sendchannel.wait()
    dma_YZ.sendchannel.wait()
    dma_param.sendchannel.wait()
    
    dma_YZ.recvchannel.wait()
    dma_param.recvchannel.wait()

A = np.random.rand(M, N).astype(dtype=np.float64)
X = np.random.rand(N,1).astype(dtype=np.float64)
Y = np.random.rand(M,1).astype(dtype=np.float64)
P = np.zeros((2,1)).astype(dtype=np.float64)
P[0] = 0.5           #lambda

A_buffer[:] = A
X_buffer[:] = X
Y_buffer[:] = Y
P_buffer[:] = P

%%timeit
run_kernel()
# 100 loops, best of 3: 17.1 ms per loop
print(Z_buffer)
# all zeros

First of all the IP is so slow and more importantly, its output is zeros … What’s wrong?
How could I solve these issues?