My accelerator DMA output is giving zeros

I have created the following IP with Vitis HLS 2020.2 (Please check the attached files pobj.cpp (2.5 KB) pobj.hpp (327 Bytes)) and successfully synthesized and generated an IP. After that I created the following block design:


and generated the bitstream also successfully
Then I started the python coding step as follows:

from pynq import (allocate, Overlay)
import numpy as np
ol = Overlay('pobj.bit')

# Define dimensions
M = 64
N = 512
# Allocate memory for DMA transfers
A_buffer = allocate(shape=(M,N), dtype=np.float64, cacheable=False)
X_buffer = allocate(shape=(N,1), dtype=np.float64, cacheable=False)
Y_buffer = allocate(shape=(M,1), dtype=np.float64, cacheable=False)
Z_buffer = allocate(shape=(M,1), dtype=np.float64, cacheable=False)
P_buffer = allocate(shape=(2,1), dtype=np.float64, cacheable=False)

CTRL_REG = 0x00
AP_START = (1<<0) # bit 0
AUTO_RESTART = (1<<7) # bit 7
def run_kernel():
    dma_A.sendchannel.transfer(A_buffer)
    dma_X.sendchannel.transfer(X_buffer)
    dma_YZ.sendchannel.transfer(Y_buffer)
    dma_param.sendchannel.transfer(P_buffer)
    
    dma_YZ.recvchannel.transfer(Z_buffer)
    dma_param.recvchannel.transfer(P_buffer)
    
    pobj_ip.write(CTRL_REG, (AP_START | AUTO_RESTART))  # initialize the module
    
    dma_A.sendchannel.wait()
    dma_X.sendchannel.wait()
    dma_YZ.sendchannel.wait()
    dma_param.sendchannel.wait()
    
    dma_YZ.recvchannel.wait()
    dma_param.recvchannel.wait()

A = np.random.rand(M, N).astype(dtype=np.float64)
X = np.random.rand(N,1).astype(dtype=np.float64)
Y = np.random.rand(M,1).astype(dtype=np.float64)
P = np.zeros((2,1)).astype(dtype=np.float64)
P[0] = 0.5           #lambda

A_buffer[:] = A
X_buffer[:] = X
Y_buffer[:] = Y
P_buffer[:] = P

%%timeit
run_kernel()
# 100 loops, best of 3: 17.1 ms per loop
print(Z_buffer)
# all zeros

First of all the IP is so slow and more importantly, its output is zeros … What’s wrong?
How could I solve these issues?

i try to run your code but i hve missing string.h file
INFO: [HLS 200-10] Analyzing design file ‘roua/pobj.cpp’ …
ERROR: [HLS 207-812] ‘string.h’ file not found: roua/pobj.cpp:2:10
INFO: [HLS 200-111] Finished Command csynth_design CPU user time: 0.76 seconds. CPU system time: 0.55 seconds. Elapsed time: 0.9 seconds; current allocated memory: 197.892 MB.
command ‘ap_source’ returned error code
while executing
“source /home/roua/Desktop/roua/solution1/csynth.tcl”
invoked from within
“hls::main /home/roua/Desktop/roua/solution1/csynth.tcl”