Hi,
I’m dealing with a small deep neural network on PYNQ-Z2.
Briefly about my design, twelve BRAMs are connected to one CDMA block to store the input feature map data. And then, alert run signal to my logic core via AXI-LITE and continuously checks to see if the logic core is in done state.
My logic core is implemented in PL and seems to work fine. (My logic takes about 1-ms)
But the problem is the time consumption in PS. First, let me clarify that I’m not familiar with using python code.
Each size of implemented BRAMs is 4096x64-bits where 64-bits consists of eight 8-bit data.
The process of converting eight 8-bit integers to one 64-bit data takes about 100-ms, which is an excessively long time compared to the PL operation.
I’m wondering if there is a way to be more time-efficient than my method in the code below.
2) To generate a 64-bit data from eight 8-bit integer, first read data off board into a MATLAB files (in_fmap). And declare the numpy buffer and store the read data.
DECLARE DATA BUFFER
buf00 = np.zeros((data_depth,8), dtype=np.uint64)
buf01 = np.zeros((data_depth,8), dtype=np.uint64)
buf02 = np.zeros((data_depth,8), dtype=np.uint64)
buf03 = np.zeros((data_depth,8), dtype=np.uint64)buf10 = np.zeros((data_depth,8), dtype=np.uint64)
buf11 = np.zeros((data_depth,8), dtype=np.uint64)
buf12 = np.zeros((data_depth,8), dtype=np.uint64)
buf13 = np.zeros((data_depth,8), dtype=np.uint64)buf20 = np.zeros((data_depth,8), dtype=np.uint64)
buf21 = np.zeros((data_depth,8), dtype=np.uint64)
buf22 = np.zeros((data_depth,8), dtype=np.uint64)
buf23 = np.zeros((data_depth,8), dtype=np.uint64)
FEATURE DATA IN
ts = time.time()
buf10[1:3585,:] = in_fmap[:,0,0:8]
buf11[1:3585,:] = in_fmap[:,0,8:16]
buf12[1:3585,:] = in_fmap[:,0,16:24]
buf13[1:3585,:] = in_fmap[:,0,24:32]buf20[1:3585,:] = in_fmap[:,1,0:8]
buf21[1:3585,:] = in_fmap[:,1,8:16]
buf22[1:3585,:] = in_fmap[:,1,16:24]
buf23[1:3585,:] = in_fmap[:,1,24:32]
te = time.time()print("2 took ", str(te-ts), “s”)
3) Each numpy buffer does the calculation below table.
ts = time.time()
for i in range(8):
buf10[:,i] = buf10[:,i] * (2**(8i))
buf11[:,i] = buf11[:,i] * (2**(8i))
buf12[:,i] = buf12[:,i] * (2**(8i))
buf13[:,i] = buf13[:,i] * (2**(8i))
buf20[:,i] = buf20[:,i] * (2**(8i))
buf21[:,i] = buf21[:,i] * (2**(8i))
buf22[:,i] = buf22[:,i] * (2**(8i))
buf23[:,i] = buf23[:,i] * (2**(8i))
te = time.time()print("3 took ", str(te-ts), “s”)
4) Sum out in row axis and 5) assign to the SRAM buffer
ts = time.time()
buf00[:,0] = np.sum(buf00, axis=1)
buf01[:,0] = np.sum(buf01, axis=1)
buf02[:,0] = np.sum(buf02, axis=1)
buf03[:,0] = np.sum(buf03, axis=1)buf10[:,0] = np.sum(buf10, axis=1)
buf11[:,0] = np.sum(buf11, axis=1)
buf12[:,0] = np.sum(buf12, axis=1)
buf13[:,0] = np.sum(buf13, axis=1)buf20[:,0] = np.sum(buf20, axis=1)
buf21[:,0] = np.sum(buf21, axis=1)
buf22[:,0] = np.sum(buf22, axis=1)
buf23[:,0] = np.sum(buf23, axis=1)
te = time.time()print("4 took ", str(te-ts), “s”)
ts = time.time()
input_buf00[:] = buf00[:,0]
input_buf01[:] = buf01[:,0]
input_buf02[:] = buf02[:,0]
input_buf03[:] = buf03[:,0]
input_buf10[:] = buf10[:,0]
input_buf11[:] = buf11[:,0]
input_buf12[:] = buf12[:,0]
input_buf13[:] = buf13[:,0]
input_buf20[:] = buf20[:,0]
input_buf21[:] = buf21[:,0]
input_buf22[:] = buf22[:,0]
input_buf23[:] = buf23[:,0]
te = time.time()print("5 took ", str(te-ts), “s”)
The time consumption of each process is here.
Please give me any advice about reduce the operation time in PS…
Thanks.