Mmio data loading issue

Hi
I am using ZYNQ ultrascale +

Here the data sending size is 4098 sendchannel. wait () So DMA is expecting to finish 4098 to complete the transaction recvchannel() but got only 4096. That is why it keeps running without publishing a result. So where can I modify to get 4098?

Screenshot 2022-05-30 at 08-06-56 FPGA_CNN - Jupyter Notebook

Can you share your Python code?

Cathal

Nagendra

I am very sure that there are nothing wrong with the PYNQ API.
Which I had successfully inference with the github repository (modifications are needed, a lot).
Turns out all are about the HLS code design.
There are so many (mistake on the size and dimension of the CONV and FC).
Please confirm you had the knowledge to test out the blocks are faulty before post the issue here.
From the previous post topic you had connected an ILA and run simulation.

Here is my Jupyter Notebook history log:
mnist.ipynb (11.2 KB)

Hi @briansune @cathalmccabe

Thanks for guiding me.

I’m working with Vivado 2019.2 and the ZCU111 board.
I took this example and modified it to fit the needs of my model. https://github.com/ZhaoqxCN/PYNQ-CNN-ATTEMPT/tree/master/Minst-CNN/CNN HLS

Yes, I changed the HLS code of the convolution and FC layers (input dimensions, kernel dimensions, input and output channels), as well as the parameters AXI Master.h, AXI slave.h, and config.h. However, no pragma functions were changed.

@briansune what do you mean by sanity test on HLS?

mnist example
FPGA_CNN (2).ipynb (23.5 KB)

My python code:

class Convolutional_Neural_Network(DefaultHierarchy):
def init(self, description):
super().init(description)

def loadweight(self, W, index, IFMDim, OFMDim, IFMDim1, OFMDim1):
    KerDim = W.shape[2]
    IFMCH = W.shape[1]
    OFMCH = W.shape[0]
    kernel_val = W.ravel() * 43000
    kernel = np.append([index, 0, KerDim, IFMCH,
                        IFMDim, OFMDim, OFMCH, IFMDim1, OFMDim1], kernel_val)
    
    print('kernel = ', ([index, 0, KerDim, IFMCH,
                        IFMDim, OFMDim, OFMCH, IFMDim1, OFMDim1], kernel_val))
    
    in_buffer =  allocate(shape=(kernel.shape[0]) , dtype=np.int16)
    out_buffer =  allocate(
        shape=(kernel.shape[0]) , dtype=np.int16)
    
    print('input buffer = ', kernel.shape[0] )
    print('output buffer = ', kernel.shape[0])
    
    
    for i, v in enumerate(kernel):
        in_buffer[i] = v
    self.axi_dma_0.sendchannel.transfer(in_buffer)
    print('finished sendchannel.transfer')
    self.axi_dma_0.recvchannel.transfer(out_buffer)
    print('finished recvchannel.transfer')
    self.axi_dma_0.sendchannel.wait()
    print('finished sendchannel.wait')
    self.axi_dma_0.recvchannel.wait()
    print('finished recvchannel.wait')
    
def execute(self, test_data, batch_size, input_ch, input_dim, input_dim1, output_ch, output_dim, output_dim1):
    input_mat = test_data[0:batch_size]
    print('finished input_mat')
    print('input_mat = ', test_data[0:batch_size] )
    input_val = np.append([0, batch_size, 0, input_ch, input_dim, input_dim1, output_ch, output_dim, output_dim1], input_mat.ravel())
    print('finished  input_val')
    print('input_val = ', [0, batch_size, 0, input_ch, input_dim, input_dim1, output_ch, output_dim, output_dim1], input_mat.ravel())
    in_buffer = allocate(shape=(input_val.shape), dtype=np.int16)
    out_buffer = allocate(shape=(9 + output_ch * batch_size * output_dim * output_dim1), dtype=np.int16)
    #np.copyto(in_buffer, input_val.astype(np.int16))
    
    
    
    for i, v in enumerate(input_val):
        in_buffer[i] = v
    print('input buffer = ', input_val.shape )
    print('output buffer = ', 9 + output_ch * batch_size * output_dim * output_dim1)
         
    start_time = time.process_time()
    print('finished copying')
    self.axi_dma_0.sendchannel.transfer(in_buffer)
    print('finished sendchannel.transfer')
    self.axi_dma_0.recvchannel.transfer(out_buffer)
    print('finished recvchannel.transfer')
    self.axi_dma_0.sendchannel.wait()
    print('finished sendchannel.wait')
    self.axi_dma_0.recvchannel.wait()
    print('finished recvchannel.wait')
    end_time = time.process_time()
    print("Elapsed Test Time: ", end_time-start_time)
    output_mat = out_buffer[9:].reshape(batch_size, -1).astype(np.float32)
    print('finished output_mat')

@staticmethod
def checkhierarchy(description):
    if 'axi_dma_0' in description['ip']:
        return True
    return False

A sanity test is a very basic concept on HW SW or any design:
Design flow are:
design constrain
design
smoke test
re-design
sanity test
re-design

So what you are current facing is smoke test fail.
Follow my suggested flow here
DMA engine with Xilinx default AXI-Stream FIFO
Can Jupyter Notebook control work wo any problem?

Also ILA is used for JTAG debug not simulation test.
How on earth u need an ILA when u could probe each signal on Modelsim or Verdi.

Next for each HLS function you the same method to test weight load and inference test.
Any issue found?
Unify functions and final test
Any issue found?

Sometime the phase between smoke and sanity is cloudy

So just consider the sanity of each block is the main goal in here.

Hi @briansune

Thank you for your suggestions. As you said with ILA, it is just to check whether your IP has all the port are present whatever you mentioned in the HLS.

My question is

My data has 4098 bytes for transmission, but it should be 4096 because the DMA is limited to 4096. It means my data is overflowing by 2 bytes.

Eg: The DMA size is 2,4,8,16,…1024,2048,4096.

Is it possible to send 4096 + 2 bytes?

I don’t understand what you mean DMA size?
If it can only be limited to 4096 how could the kernel data i.e. 28x28x128+7 (16 bit) data feed into the memory via AXI4-Stream?

I printed the mmio.read in the dma.py

It got 4096 for senchannel.wait() and recvchannel.wait() for MINST

You can see the file here Mmio data loading issue - #3 by briansune

But in my case, why is it 4098 ? I thought this could be a problem because the mapping size should be the same to complete the transaction.

Screenshot 2022-05-30 at 08-06-56 FPGA_CNN - Jupyter Notebook

Can you connect system ILA and open Jtag use the “tlast” raising edge trigger the ILA and post here so I can see what actually it is happening?

Hi,

I have connected the ILA to the IP but an error is popping up. the debug_core clk connected to the ILA, I have tried two methods to debug but am still getting the same problem.

Methods:

  1. select the IP ports and click debug, and then automatically system ILA will add in the block design.
  2. Add ILA manually after synthesis.

Error:

WARNING: [Labtools 27-3361] The debug hub core was not detected.
Resolution:

  1. Make sure the clock connected to the debug hub (dbg_hub) core is a free running clock and is active.
  2. Make sure the BSCAN_SWITCH_USER_MASK device property in Vivado Hardware Manager reflects the user scan chain setting in the design and refresh the device. To determine the user scan chain setting in the design, open the implemented design and use ‘get_property C_USER_SCAN_CHAIN [get_debug_cores dbg_hub]’.
    For more details on setting the scan chain property, consult the Vivado Debug and Programming User Guide (UG908).
    WARNING: [Labtools 27-3413] Dropping logic core with cellname:‘u_ila_0’ at location ‘uuid_23E7D65A79BC59F7BC47406C1714DFAE’ from probes file, since it cannot be found on the programmed device.

Nagendra what really are you doing?

As I said implement the design with the system ILA attached connect the board with JTAG cable setup the trigger on the simulation window with trigger tlast.

Without waveform how should I see any thing that can easily conclude the behavior of your design.

This is what you should see when debug

Hi @briansune

Thank you so much for guiding me.

I have tried to run a DMA example through my IP using Vitis, but the transaction is not completed.

**example file **
xaxidma_example_simple_poll.c (10.1 KB)

image

It keeps running without showing DMA success results in Vitis.

ILA Waveform

TLAST is zero.

Can you tell me where I made a mistake? What could be the problem with data not processing through the DMA.

1 Like

Nagendra

Since day 1 Cath had nicely mentioned tlast is critical on DMA TX (DMA eng back to Core).
You got all the tools and skill-sets you need to debug and develop I will step out at this point as you going to learn and create this project not me.

I am not sure I am a good teacher but that’s what I can do and not overly do.

Hi @briansune

Thank you for responding. I’m sorry if my conversations made you feel bad. This PYNQ is new to me.

This is a massive project that includes HLS, Vivado, and PYNQ. When I encounter a problem, I become extremely perplexed.
It took some time to realize the significance of each step.

Hi @cathalmccabe

Does the TVALID and TLAST triggers need to rise at the same time to complete the transaction?
I have modified the TLAST value from 0 to 1 and now it’s raising the trigger edge, but still, the DMA transaction is not completing.

What are the modifications should I do ? Can you give me any suggestions based on the ILA waveform?

Transactions occur when TVALID AND TREADY are high at the same time. When TLAST is also high it indicates the last transaction in the data transfer.

I’m not sure how you are modifying TLAST. When you send data from the DMA to your IP, TLAST will be set by the DMA, based on the size of the array you transfer.

When you receive data from the DMA, TLAST is set by the IP (which is the source of the data that is sent to the DMA).

If the DMA transaction doesn’t complete because your IP doesn’t send enough data, then you need to increase the size of the “output” array so that you read enough data to complete the transfer.

If you aren’t expecting to read more data, then TLAST is not being set correctly. This may be a problem in the IP, or a problem in the amount of data you send to the IP.

Why do you think the DMA is limited to 4096? What did you set the length register to? If the DMA only supports 4096, it will set TLAST early when it reaches 4096 and the DMA. You would see the first 4096 results in your array.

Cathal