Hi,

I’m dealing with a small deep neural network on PYNQ-Z2.

Briefly about my design, twelve BRAMs are connected to one CDMA block to store the input feature map data. And then, alert run signal to my logic core via AXI-LITE and continuously checks to see if the logic core is in done state.

My logic core is implemented in PL and seems to work fine. (My logic takes about 1-ms)

But the **problem is the time consumption in PS**. First, let me clarify that I’m not familiar with using python code.

Each size of implemented BRAMs is 4096x64-bits where 64-bits consists of eight 8-bit data.

**The process of converting eight 8-bit integers to one 64-bit data takes about 100-ms,** which is an excessively long time compared to the PL operation.

**I’m wondering if there is a way to be more time-efficient than my method in the code below.**

**2) To generate a 64-bit data from eight 8-bit integer, first read data off board into a MATLAB files (in_fmap). And declare the numpy buffer and store the read data.**

## DECLARE DATA BUFFER

buf00 = np.zeros((data_depth,8), dtype=np.uint64)

buf01 = np.zeros((data_depth,8), dtype=np.uint64)

buf02 = np.zeros((data_depth,8), dtype=np.uint64)

buf03 = np.zeros((data_depth,8), dtype=np.uint64)buf10 = np.zeros((data_depth,8), dtype=np.uint64)

buf11 = np.zeros((data_depth,8), dtype=np.uint64)

buf12 = np.zeros((data_depth,8), dtype=np.uint64)

buf13 = np.zeros((data_depth,8), dtype=np.uint64)buf20 = np.zeros((data_depth,8), dtype=np.uint64)

buf21 = np.zeros((data_depth,8), dtype=np.uint64)

buf22 = np.zeros((data_depth,8), dtype=np.uint64)

buf23 = np.zeros((data_depth,8), dtype=np.uint64)

## FEATURE DATA IN

ts = time.time()

buf10[1:3585,:] = in_fmap[:,0,0:8]

buf11[1:3585,:] = in_fmap[:,0,8:16]

buf12[1:3585,:] = in_fmap[:,0,16:24]

buf13[1:3585,:] = in_fmap[:,0,24:32]buf20[1:3585,:] = in_fmap[:,1,0:8]

buf21[1:3585,:] = in_fmap[:,1,8:16]

buf22[1:3585,:] = in_fmap[:,1,16:24]

buf23[1:3585,:] = in_fmap[:,1,24:32]

te = time.time()print("2 took ", str(te-ts), “s”)

**3) Each numpy buffer does the calculation below table.**

ts = time.time()

for i in range(8):

buf10[:,i] = buf10[:,i] * (2**(8i))i))

buf11[:,i] = buf11[:,i] * (2**(8

buf12[:,i] = buf12[:,i] * (2**(8i))i))

buf13[:,i] = buf13[:,i] * (2**(8

buf20[:,i] = buf20[:,i] * (2**(8i))i))

buf21[:,i] = buf21[:,i] * (2**(8

buf22[:,i] = buf22[:,i] * (2**(8i))i))

buf23[:,i] = buf23[:,i] * (2**(8

te = time.time()print("3 took ", str(te-ts), “s”)

**4) Sum out in row axis and 5) assign to the SRAM buffer**

ts = time.time()

buf00[:,0] = np.sum(buf00, axis=1)

buf01[:,0] = np.sum(buf01, axis=1)

buf02[:,0] = np.sum(buf02, axis=1)

buf03[:,0] = np.sum(buf03, axis=1)buf10[:,0] = np.sum(buf10, axis=1)

buf11[:,0] = np.sum(buf11, axis=1)

buf12[:,0] = np.sum(buf12, axis=1)

buf13[:,0] = np.sum(buf13, axis=1)buf20[:,0] = np.sum(buf20, axis=1)

buf21[:,0] = np.sum(buf21, axis=1)

buf22[:,0] = np.sum(buf22, axis=1)

buf23[:,0] = np.sum(buf23, axis=1)

te = time.time()print("4 took ", str(te-ts), “s”)

ts = time.time()

input_buf00[:] = buf00[:,0]

input_buf01[:] = buf01[:,0]

input_buf02[:] = buf02[:,0]

input_buf03[:] = buf03[:,0]

input_buf10[:] = buf10[:,0]

input_buf11[:] = buf11[:,0]

input_buf12[:] = buf12[:,0]

input_buf13[:] = buf13[:,0]

input_buf20[:] = buf20[:,0]

input_buf21[:] = buf21[:,0]

input_buf22[:] = buf22[:,0]

input_buf23[:] = buf23[:,0]

te = time.time()print("5 took ", str(te-ts), “s”)

**The time consumption of each process is here.**

**Please give me any advice about reduce the operation time in PS…**

Thanks.