PYNQ Z1 Matrix Multiplication Accelerator Overlay Example

Hi there! I’m studying about FPGA acceleration with PYNQ Z1 board.
First I tried to make matrix multiplication accelerator.

  • Reference Youtube video Title : Matrix Multiplication using Xilinx Vivado and Vitis
    I don’t know why, but the link doesn’t attach, so I leave the name of the video.

In this video, he used Vivado HLS 2019.2 and PYNQ Z2 board.

  • My environment
    PYNQ-Z1
    Vivado 2018.3

I tried to build an FPGA matrix multiplication accelerator.
I used the following YouTube video as a reference, which uses a VIVADO 2019.2 and a PYNQ Z2 board.

I followed the same steps as the video, but the result of the matrix multiplication was always [18. 18. 18. 18. 27. 27. 27. 27. 27. 27. 27] on the first run,
From the next run, [27. 27. 27. 27. 27. 27. 27. 27. 27. 27. 27. 27. 27] is fixed and output.

Seeing the error in the results, even though I’ve configured the same, I suspect that the difference between the PYNQ Z1 and Z2 boards is causing this problem, but I’d like to get advice from the experts.

What can i do?

import time
from pynq import Overlay
import pynq.lib.dma
from pynq import allocate
import numpy as np
from pynq import MMIO
import random

ol = Overlay('/home/xilinx/pynq/overlays/matmul/matmul.bit') # check your path
ol.download() # it downloads your bit to FPGA

# check IP Blocks --> "pynq.lib.dma.DMA" name --> next line : dma = ol."name"
# ol?
dma = ol.axi_dma_0 # creating a dma instance. Note that we packed smul_dma into streamMul
sadd_ip = MMIO(0x43C00000, 0x1000) # we got this IP from Address Editor
length = 18
length_out = 9

in_buffer = allocate(shape=(length,), dtype=np.float32) # input buffer
out_buffer = allocate(shape=(length_out,), dtype=np.float32) # output buffer

samples = [1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 8, 7, 6, 5, 4, 3, 2, 1]
np.copyto(in_buffer, samples) # copy samples tp inout buffer

sadd_ip.write(0x10, length) # we got this address from Vivado source
t_start = time.time()
dma.sendchannel.transfer(in_buffer)
dma.recvchannel.transfer(out_buffer)
dma.sendchannel.wait() # wait for send channel
dma.recvchannel.wait() # wait for recv channel
t_stop = time.time()

in_buffer.close()
out_buffer.close()

print(in_buffer)
print(out_buffer)

I attach the code and bitstream file I used.

(Second run ~ Last run)
2

matmul.bit (3.9 MB)
matmul.hwh (301.6 KB)