PYNQ DDR+DMA for Arrays (not streams!)

Hello all,
I’m new to FPGAs and currently trying to implement the following top function:

int fit(float I[N][d], float b[N], float x[N], float p[N], float q[N], float r[N], float lambda, float gamma) {

#pragma HLS INTERFACE m_axi bundle=inI port=I
#pragma HLS INTERFACE m_axi bundle=inb port=b
#pragma HLS INTERFACE m_axi bundle=inx port=x
#pragma HLS INTERFACE m_axi bundle=inp port=p
#pragma HLS INTERFACE m_axi bundle=inr port=r
#pragma HLS INTERFACE m_axi bundle=inq port=q
#pragma HLS INTERFACE s_axilite port=lambda
#pragma HLS INTERFACE s_axilite port=gamma
#pragma HLS INTERFACE s_axilite port=return

I tried implementing the hardware as in the pdf, and it works functionally! but very very slow!
design_1_fit.pdf (79.6 KB)

In Vitis HLS, it was estimated to be 1 second for 100Mhz
in reality, it is 6.5 seconds for 100Mhz.

I think the issue is that I didn’t use a DMA, but instead used the Xlnk() buffers as I saw in one tutorial.

from pynq import Overlay
from pynq import Xlnk #linux buffer to read/write axi
import numpy as np
import time

ol = Overlay('design_1_fit.bit')
ol.download()
fit_ip = ol.fit

#create variables buffers
N = 1024
d = 16
I = Xlnk().cma_array(shape=(N,d), dtype=np.float32)
b = Xlnk().cma_array(shape=(N,), dtype=np.float32)
x = Xlnk().cma_array(shape=(N,), dtype=np.float32)
r = Xlnk().cma_array(shape=(N,), dtype=np.float32)
p = Xlnk().cma_array(shape=(N,), dtype=np.float32)
q = Xlnk().cma_array(shape=(N,), dtype=np.float32)

#read data
z = np.zeros(N)
with open("d1024_16f.csv") as file_name:
    array = np.loadtxt(file_name, delimiter=",")

#assign data to buffers
np.copyto(I, array[:, 0:16])
np.copyto(b, array[:, 16])
np.copyto(x, z)
np.copyto(r, z)
np.copyto(p, z)
np.copyto(q, z)
fit_ip.write(0x00, 0x00)

# memeory addresses: 
# I: 0x18
fit_ip.write(0x18, I.physical_address)
# b: 0x24
fit_ip.write(0x24, b.physical_address)
# x: 0x30
fit_ip.write(0x30, x.physical_address)
# p: 0x3c
fit_ip.write(0x3c, p.physical_address)
# q: 0x48
fit_ip.write(0x48, q.physical_address)
# r: 0x54
fit_ip.write(0x54, r.physical_address)
# lambda: 0x60
fit_ip.write(0x60, 0x3f000000)
# Gamma: 0x68
fit_ip.write(0x68, 0x3dcccccd)

#start the ip and measure the time
fit_ip.write(0x00, 0x01)
t_start = time.time()
while fit_ip.read(0x00) & 0b10 != 0b10 :
    pass
t_stop = time.time()
print(t_stop-t_start)

print("Number of iterations " ,fit_ip.read(0x10)) #return of fit function
#output results
for i in range(N) :
    print(x[i])

Can anyone please help me on what ip blocks I need to save all of these vectors and matrix into DDR memory, and then make the ip talk directly with DDR, (maybe using DMA).
I tried looking for tutorials online, but all I could find is related to DMA with axis (streams). my design doesn’t use streams, and I’m still very confused on what all of the ip blocks mean.

1 Like

I think you are taking the right approach.
However, you have 6 ports and you need to consider how data will be transferred - how much data is transferred on each port and what the sequence or data access pattern is. (By default) HLS tools won’t model DRAM latencies which will exist in your real system. I expect you have contention between your ports which is causing slowdown.
E.g. if your design tries to do a simultaneous access to all 6 ports, they will have to be sequenced.

I’d suggest you study your data access patterns, and also put an ILA on your memory ports, and check how the data transfer is working.

Some questions:
Which version of HLS are you using?
How many HP ports did you enable, and how did you connect them?
i.e. How are your ports connected to the Zynq HP ports?
Are your ports 32-bit? If yes, can you widen to 64-bit and access to elements at a time? This would match the width of the Zynq HP ports and increase your bandwidth efficiency.

Cathal

1 Like

Thank you for the reply!
basically, I only need 3 ports in parallel.
to answer your questions:

  • I’m using Vitis HLS 2021.1.
  • I enabled HP0 (as in the HW PDF attached in the post), I tried doing HP0 and HP2 together, but didn’t affect the performance.
  • To be honest, I just used the auto connection, and it added two AXI interconnects (one for m_axi, and one for s_axilite and return).
  • my data is float (32-bit) and it would worth widening to 64-bit, but it seems like an advanced thing for now!
    Right now, I can’t even figure out how to connect the DMA. The auto connection option didn’t generate a DMA. Only Axi interconnects. I will try using the ILA (didn’t know it exited before ^^)