How to use AXI4 Burst to increase throughput?

Hi,
I am trying to using the burst mode to increase the performance of app in PYNQ. I have conducted several experiments, and I was expected to see throughput gain when I increase the burst length, however no. The followings are details of my implementations and results.

The HLS code

  1. using 512-bit port width
  2. internally, it run the memory access test for several rounds
#define W 512
#define PRECISION 8

typedef ap_uint<W> word_t;
typedef ap_uint<PRECISION> data_t;

void DoCompute(word_t *input, word_t *output, int depth, int test_rnds){
#pragma HLS INTERFACE s_axilite port=return bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=test_rnds bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=depth bundle=CTRL_BUS
#pragma HLS INTERFACE m_axi port=output offset=slave bundle=DATA_OUT max_write_burst_length=256
#pragma HLS INTERFACE m_axi port=input offset=slave bundle=DATA_IN max_read_burst_length=256

	word_t temp;
	for(int rnd = 0; rnd < test_rnds; rnd++){
#pragma HLS LOOP_TRIPCOUNT min=100 max=100 avg=100
		for(int i = 0; i < depth; i++){
#pragma HLS LOOP_TRIPCOUNT min=100 max=100 avg=100
#pragma HLS PIPELINE II=1
			temp = input[i];
		}
	}

	output[0] = temp;
	return;
}

Block design

PS-PL data access HP port setting
Screenshot from 2021-05-25 14-25-19

DRAM settings

My PYNQ script

import numpy as np 
from pynq import Xlnk
from pynq import Overlay
import pynq

xlnk = Xlnk()
xlnk.xlnk_reset()

dt = np.uint8

BIT_WIDTH=512
PRECISION=8

COL=int(BIT_WIDTH/PRECISION)
DEPTH=1024*32

TEST_ROUNDS=10

input_data = xlnk.cma_array(shape=(DEPTH, COL), dtype=dt)
output_data = xlnk.cma_array(shape=(DEPTH, COL), dtype=dt)

print("Allocating memory done")

################### Download the overlay
overlay = Overlay("./design_1.bit")
print("Bitstream loaded")

DoCompute = overlay.DoCompute_0

DoCompute.write(0x10, input_data.physical_address)
DoCompute.write(0x18, output_data.physical_address)
DoCompute.write(0x20, DEPTH)
DoCompute.write(0x28, TEST_ROUNDS)

def start_run(ip_instance):
    ip_instance.write(0x00, 1)
    isready = ip_instance.read(0x00)

    while( isready == 1 ):
        isready = ip_instance.read(0x00)

#######################
# HW execution latency
#######################
import timeit, functools
t = timeit.Timer(functools.partial(start_run, DoCompute))  
etime = t.timeit(1) # in sec.

transfer_size = ((BIT_WIDTH/8)*DEPTH)  # in byte
bandwidth = transfer_size/(etime/TEST_ROUNDS)
print("Effective bandwidth(GB/s) = ", bandwidth*(10**(-9)))

The results, for both BURST=256 and BURST=16 cases, the measured bandwidth are 1.5 GB/s, which is weird.

I am wondering where I missed? or where is my mistake?

Thanks.

Hi,
A few suggestions:

It looks like the inner loop in your HLS design is functionally equivalent to this:
output[0] = input[depth];
… And you have an asymmetry in the amount of input and output data.
I’d suggest you add some extra logic to prevent the tools optimizing this design, and balance the input/output data.

E.g. For inner loop
out[i] = in[i]+1;

Timing this from Python will not be very accurate, so there will be some variability.

I’d suggest if you want to see if the burst are being used, and to get cycle accurate timing measurements, that you add a ChipScope ILA to your design and measure the signals from the PL to see what is actually happening in your design.

Cathal

Thanks, and will try.

Hi Cathal,

I have followed your suggestions to using ILA to measure several signals in AXI4, move the experiment to ZCU104 with DDR4-2133, From the waveform, it seems that my IP takes 4 cycles to get one input. So the bandwidth of off-chip memory is at most 4 GB/s in my system, which is far from theoretical peak bandwidth of a single bank of DDR4-2133, 17 GB/s.

Is there anything I should do to achieve higher bandwidth when using a single DDR4 bank?

The following figure is the obtained waveform by using ILA Debugging tools.

And my design details
(1) Block Diagram

(2) DRAM Configuration

(3) PS-PL Configuration
Screenshot from 2021-06-09 15-06-41

(4) HLS Code

#define WIDTH 512
#define DEPTH 1024
typedef ap_uint<WIDTH> DTYPE ;

void foo_top (DTYPE *in, DTYPE *out, int length) {

#pragma HLS INTERFACE m_axi port=in offset=slave bundle=IN max_read_burst_length=256
#pragma HLS INTERFACE m_axi port=out offset=slave bundle=OUT max_read_burst_length=256
#pragma HLS INTERFACE s_axilite port=length bundle=BUS_A
#pragma HLS INTERFACE s_axilite port=return bundle=BUS_A

	DTYPE buff[DEPTH];
	int i, r;

		for (i = 0; i < length; i++){
#pragma HLS LOOP_TRIPCOUNT min=1024 max=1024 avg=1024
#pragma HLS PIPELINE II=1
			buff[i] = in[i];
		}

		for(i = 0; i < length; i++){
#pragma HLS LOOP_TRIPCOUNT min=1024 max=1024 avg=1024
#pragma HLS PIPELINE II=1
			out[i] = buff[i] + 1;
		}
	return;
}

What clock speed is your IP running at?
The M_AXI_FPD ports your IP connects to are 128 bits wide, so the max read or write bandwidth your IP will consume @ 100 MHz will be 128 bits * 100 MHz = 12800 Mbps or 1.5 GBytes/s per port. If you can increase the clock speed, you can increase the bandwidth used by your IP. i.e. double the clock = double bandwidth.

Also, at the moment both your input and output AXI masters share the interconnect and connect to the same M_AXI port. I’m not sure if the reads/writes are currently overlapping or not, and if they are what affect interleaving the read/write will have on your DRAM performance. Ignoring this for now, if you want to, you can connect each M_AXI on your IP to a dedicated port on the PS and avoid sharing the port.

By the way, I notice you are trying to set your ports to 512 bits wide. If they synthesize at this width, they will be converted down to 128 through the smart connect. This may be where you see a delay of 4 clock cycles.
You can see in the waveform that the AXI master port you are monitoring is 128 bit.
In theory you could split your memory accesses into more AXI master ports and connect each to a dedicated PS AXI ports. I’m not sure if you want to try this for this design.

Finally, when you are looking at waveforms, if you check the AXI masters, you should be able to see if burst transactions are occurring. This signals are not included in the waveform screenshot you posted. However, I don’t think the burst transactions are a problem at this point in your design - I’m guessing it may be clock speed of your IP?

Cathal

Hi Cathal,
The clock speed is 266 MHz, so the bandwidth is 4.2 GB/s.
There seems no easy way to improve, so I would like to pause this experiment for temporally, and move on to experiments on multiple concurrent accesses.

Thank you very much for your suggestions.
Lin.