PYNQ: PYTHON PRODUCTIVITY FOR ZYNQ

MMIO write is too slow

Hi,

I am running a small deep neural network on PYNQ-Z1 PL. The input of this network is a 70x80 image.

The PS (i.e. a python script) has first to write each input image, pixel-by-pixel, to the off-chip DRAM memory using MMIO write( ) before firing the start flag of the PL.

The problem is that writing these 5600 input pixels to the DRAM takes a very long time (~ 150-200ms) as compared to the total time needed by the PL to finish the computation (~ 1ms).

So I’m in a silly situation where communication takes much more time than computation :slight_smile:

Is there any way to speed up the data transfer? Is it possible to move blocks or bursts of data for example?

Below is the corresponding code:

########################################
....
    #  *takes like 200 ms*
    #------------------------------
    # Stream image to memory
    for i in range(image_in.shape[0]):
            for j in range(image_in.shape[1]):
                PL.write(write_address,image_in[i][j])
                write_address = write_address + 4

########################################
    # *takes like 1 ms*
    #------------------------
    # Start PL
    PL.write(0x00,1)

    # Wait until PL finishes
    while(True) :
            bits = PL.read(0x00)
            ap_start = bits & 0x1
            ap_done = bits>>1 & 0x1
            ap_idle = bits>>2 & 0x1
            ap_ready = bits>>3 & 0x1
            if ap_done == 1 :
                break
...
########################################

I appreciate your suggestions.

Thank you so much !

The quick fix is to use the underlying numpy array to write the data - instances the MMIO class as a .array which provides direct access to the registers. If I’m doing this I generally take a slice of the array containing the registers I want to update frequently and then I can just directly assign it.

input_registers = mmio.array[0x100:0x200] # address 0x400 to 0x800
input_registers[:] = input_data

This will be substantially faster than calling write for each entry separately

For even better performance you will want to use a DMA engine to pull the data directly out of the DDR memory. There’s some documentation on this at on our readthedocs page.

Peter

1 Like

Hi Peter,
Thanks for your quick reply.

Normally I use MMIO as:
PL = MMIO(0x43C00000,0x10000) #(IP_BASE_ADDRESS, ADDRESS_RANGE)
and then I use the PL.read( ) or PL.write( )

But I’m not sure how to use the mmio.array or how to define the base address and range for example.
Can you please suggest any documentation about it? Because I also didn’t understand what 0x100:0x200 have to do with 0x400 and 0x800.

Moreover, I have the DMA as a part of my PL (IP), so that the stream is only internal to the PL.
I will take a look at the link you mentioned.

Thanks again !

Hi again,
I guess I figured out how to get this done.
So for initializing the MMIO, the code will be the same:
PL = MMIO(0x43C00000,0x10000) #(IP_BASE_ADDRESS, ADDRESS_RANGE)

While instead of looping for transferring each pixel of the 70x80 input image, I did the following:
input_registers = PL.array[(base_input_address)//4:(base_input_address//4+80*70)]
Well I don’t know why I divided by 4, I saw that Peter divided the range 0x400 to 0x800 by 4 to get the mmio.array index range (PL.array in my case). I guess because the memory word is 32 bit wide while the address is per byte (right ?).

Whenever I want to transfer the input image to the memory, I simply do:
input_registers[:] = image_in.ravel()

By the way, I was using 16 bit per pixel (not considered in the codes I posted before) but I had to change it to 32 bits so that I don’t have to concatenate pixels together before transferring them to the registers, rather direct assignment works as shown above.

It might be interesting for others if you could post the transfer time, and if this is good enough for your application. (I would be interested :wink: )
16 bits instead of 32 bits is effectively halving your bandwidth/doubling the transfer time although you need to balance this against the time taken to pack your data in software. There are python packages to do this efficiently than writing your own python code to do this.

As Peter mentioned, this is a quick fix. You can improve the data transfer by using a DMA, or using a Master interface on your IP that can access memory directly rather than relying on the CPU to transfer data.

Cathal

Gladly.

The transfer time that was originally between (150-200 ms) has become (~1.1 ms) now.
What you said is correct. Halving the bandwidth is a drawback in this case but it really improved the data transfer. Again, it’s a quick fix.

Would you please post an example of the python packages that do the packing work for me?
My code is based on bit shifts. I’m curious to see a more efficient implementation.

I will have to try these two techniques (DMA / Master PL Interface) … Thanks !