Any method to reduce the calculation time on PS (PYNQ-z2)

Hi,
I’m dealing with a small deep neural network on PYNQ-Z2.

Briefly about my design, twelve BRAMs are connected to one CDMA block to store the input feature map data. And then, alert run signal to my logic core via AXI-LITE and continuously checks to see if the logic core is in done state.

My logic core is implemented in PL and seems to work fine. (My logic takes about 1-ms)

But the problem is the time consumption in PS. First, let me clarify that I’m not familiar with using python code.

Each size of implemented BRAMs is 4096x64-bits where 64-bits consists of eight 8-bit data.
The process of converting eight 8-bit integers to one 64-bit data takes about 100-ms, which is an excessively long time compared to the PL operation.

I’m wondering if there is a way to be more time-efficient than my method in the code below.

image

2) To generate a 64-bit data from eight 8-bit integer, first read data off board into a MATLAB files (in_fmap). And declare the numpy buffer and store the read data.

DECLARE DATA BUFFER

buf00 = np.zeros((data_depth,8), dtype=np.uint64)
buf01 = np.zeros((data_depth,8), dtype=np.uint64)
buf02 = np.zeros((data_depth,8), dtype=np.uint64)
buf03 = np.zeros((data_depth,8), dtype=np.uint64)

buf10 = np.zeros((data_depth,8), dtype=np.uint64)
buf11 = np.zeros((data_depth,8), dtype=np.uint64)
buf12 = np.zeros((data_depth,8), dtype=np.uint64)
buf13 = np.zeros((data_depth,8), dtype=np.uint64)

buf20 = np.zeros((data_depth,8), dtype=np.uint64)
buf21 = np.zeros((data_depth,8), dtype=np.uint64)
buf22 = np.zeros((data_depth,8), dtype=np.uint64)
buf23 = np.zeros((data_depth,8), dtype=np.uint64)

FEATURE DATA IN

ts = time.time()
buf10[1:3585,:] = in_fmap[:,0,0:8]
buf11[1:3585,:] = in_fmap[:,0,8:16]
buf12[1:3585,:] = in_fmap[:,0,16:24]
buf13[1:3585,:] = in_fmap[:,0,24:32]

buf20[1:3585,:] = in_fmap[:,1,0:8]
buf21[1:3585,:] = in_fmap[:,1,8:16]
buf22[1:3585,:] = in_fmap[:,1,16:24]
buf23[1:3585,:] = in_fmap[:,1,24:32]
te = time.time()

print("2 took ", str(te-ts), “s”)

3) Each numpy buffer does the calculation below table.
image

ts = time.time()
for i in range(8):
buf10[:,i] = buf10[:,i] * (2**(8i))
buf11[:,i] = buf11[:,i] * (2**(8
i))
buf12[:,i] = buf12[:,i] * (2**(8i))
buf13[:,i] = buf13[:,i] * (2**(8
i))
buf20[:,i] = buf20[:,i] * (2**(8i))
buf21[:,i] = buf21[:,i] * (2**(8
i))
buf22[:,i] = buf22[:,i] * (2**(8i))
buf23[:,i] = buf23[:,i] * (2**(8
i))
te = time.time()

print("3 took ", str(te-ts), “s”)

4) Sum out in row axis and 5) assign to the SRAM buffer

ts = time.time()
buf00[:,0] = np.sum(buf00, axis=1)
buf01[:,0] = np.sum(buf01, axis=1)
buf02[:,0] = np.sum(buf02, axis=1)
buf03[:,0] = np.sum(buf03, axis=1)

buf10[:,0] = np.sum(buf10, axis=1)
buf11[:,0] = np.sum(buf11, axis=1)
buf12[:,0] = np.sum(buf12, axis=1)
buf13[:,0] = np.sum(buf13, axis=1)

buf20[:,0] = np.sum(buf20, axis=1)
buf21[:,0] = np.sum(buf21, axis=1)
buf22[:,0] = np.sum(buf22, axis=1)
buf23[:,0] = np.sum(buf23, axis=1)
te = time.time()

print("4 took ", str(te-ts), “s”)

ts = time.time()
input_buf00[:] = buf00[:,0]
input_buf01[:] = buf01[:,0]
input_buf02[:] = buf02[:,0]
input_buf03[:] = buf03[:,0]
input_buf10[:] = buf10[:,0]
input_buf11[:] = buf11[:,0]
input_buf12[:] = buf12[:,0]
input_buf13[:] = buf13[:,0]
input_buf20[:] = buf20[:,0]
input_buf21[:] = buf21[:,0]
input_buf22[:] = buf22[:,0]
input_buf23[:] = buf23[:,0]
te = time.time()

print("5 took ", str(te-ts), “s”)

The time consumption of each process is here.
image

Please give me any advice about reduce the operation time in PS…:frowning:
Thanks.

1 Like

Hi,

If your data type is 8-bit, you can use type=np.uint8 to allocate pynq buffers, therefore you do not need to manipulate bytes to get the 64-bit representation.

Another thing, only use pynq buffers for data that will be sent or received from the PL, otherwise use numpy.

Mario

1 Like

Thanks for your reply, Mario.

There seems to be some misunderstanding due to my English is not good enough.

The point is that the computation time to convert eight 8-bit data to one 64-bit data on PS is too long. (Designed BRAM width is 64-bit)

For example, the variable ‘in_fmap’ is a MATLAB data file which contains [10, 20, 30, 40, 50, 60, 70, 80].

  1. I declared a numpy buffer like as 'buf00 = np.zeros((1,8), dtype=np.uint64)’ and assign ‘buf00 = in_fmap’

And then to generate a 64-bit data with eight 8-bit data,

  1. 10*(2^56) + 20*(2^48) +30*(2^40) + 40*(2^32) + … + 80*(2^0) is calculated which behaves like a bit shift.

Until this point, only numpy buffers are used.

  1. Finally, I declared PYNQ buffer for transmission to BRAM using CDMA,
    ‘input_buf00 = allocate(shape= (1,), dtype= np.uint64)’ and assign sum out value (10*(2^56) + 20*(2^48) +30*(2^40) + 40*(2^32) + … + 80*(2^0)).

I measured each processing time with time.time() function,
and the 1. takes ~15-ms, 2. takes ~80-ms, and 3. takes ~15-ms.

The operation time is too long compared with my logic which only takes less than 1-ms. Is there any other method to save the calculation time in PS?

No misunderstanding, I understood your question.

Is is correct to assume that each element in in_fmap is 8-bit (un)signed? If this is the case, you can create your input buffer like this

input_buf00 = allocate(shape= (1, <size>), dtype= np.uint8)

Then you can assign input_buf00[:] = in_fmap

input_buf00 will be allocated contiguously in memory and you can access from the PL as 64-bit elements. What changes is only the way elements are interpreted in software. Instead of having to manually pack 8 x 8-bit elements in a 64-bit element, you let numpy to do it, this should be much more efficient.

Mario

1 Like

As proof of this you can make a very quick check.

import pynq
import numpy as np

#### Download your overlay here ####

# Declare test array, an 8-element array from 0 to 7
test_array = np.arange(0, 8, dtype=np.uint8)

#Assign test array to an 8-element (8-bit) pynq buffer
in_buff1 = pynq.allocate((8,), dtype=np.uint8)
in_buff1[:] = test_array

# Assign test array to a 1-element (64-bit) pynq buffer
in_buff2 = pynq.allocate((1,), dtype=np.uint64)
aux = 0;
for i in range(len(test_array)):
    aux += test_array[i] * 2**(i*8)
in_buff2[:] = aux

# Compare byte representation (this is how the elements are stored in memory)
print("Byte representation equal: {}".format(in_buff2.tobytes() == in_buff1.tobytes()))

Hope this helps
Mario

1 Like

@louishin,

I see that you marked the answer as solution.
Can you please comment on the results you are getting? Is it faster?

Hi again, Mario
I appreciated for your kindness reply.
I solved the problem on this topic with your help.
Compared to the previous ~110-ms time consumption, the current result takes ~20-ms.

1 Like

The method you taught me is really really useful to me.

I’m not sure if the size of input data is large, but it still uses pretty much time spent than PL operation.