Using multiple accelerators simultaneously(multi-thread)

Hi,
I am a beginner in FPGA.
Currently, I am practicing using FPGA to perform complex computations for the purpose of acceleration.

I have implemented an accelerator for cross-correlation (FFT) using HLS IP.
I have also planned four accelerators on the FPGA using Vivado.

In my application, I use these accelerators simultaneously through multi-thread.
However, I have noticed that the more threads I use, the slower the accelerators run.
Is this expected? (Since the accelerators are independent, their performance should not be affected.)

Additionally, I have observed that different IP objects have the same memory address for their control signals.
Is this normal?

Thank you.

[PYNQ ver.] 3.0.1
[Board] KV260
[Vivado ver.] 2021.1
[HLS ver.] 2021.1

[Reproduce]

  1. unzip project.zip to PYNQ
  2. sudo su
  3. source /etc/profile.d/pynq_venv.sh
  4. enter project folder
  5. python ./multi-thread.test.py

[File] project.zip
project.zip (1.8 MB)

  1. hls code:
    -/project/hls/
  2. vivado block design:
    -/project/vivado_block_design.pdf
  3. python code:
    -/project/multi-thread_test.py
  4. bitstream:
    -/project/bitstream/correlation/
1 Like

Vivado block design

Driver

Runtime comparison

Control signal issue

1 Like

Hi @Oscar_Lin,

Welcome to the PYNQ community.

You are probably running into memory contention. The bandwidth to memory is limited, so when you run multiple IP at the same time they may end up trying to access memory at the same time, hence slowing down the run time.

Mario

1 Like

Hi, @marioruiz

Thank you for your response.

May I ask if there is any method to add a separate memory for each IP to avoid memory contention?
Or are there any documents or tutorials that I can refer to?

Thanks
Oscar

1 Like

Hi @Oscar_Lin,

I am not sure which board you’re using. But, typically all Zynq MPSoC boards only have one memory module.

Mario

@marioruiz

If this is the case, I guess @Oscar_Lin should use a better design hierarchy.
Accelerator should have its own memory pool aka PL DDR memory rather than sharing DDR memory from PS.
When result is finalized interface between accelerator can be shared memory.

The structure of shared memory via PS is a low cost but sub-optimized design.

Same as how Harvard vs Von Neumann.
Both have its pros and cons.

When multi-thread does this mean it is also lead to split channel memory physical support?
I guess answer is NO.
Most cases are score board architecture etc.

So AXI MIG on each accelerator and split CPU execution on split MIG is the most efficiency and faster design. While mutex and data fork join could be another story unless there are better way to handle.

ENJOY~