PYNQ DFX Partial Reconfiguration: DMA Fails or Returns Garbage After Loading Partial Bitstreams

PYNQ DFX Partial Reconfiguration Issue

Hello,

I’m a beginner learning Dynamic Function eXchange (DFX) partial reconfiguration on PYNQ 3.1 with Ubuntu 22.04.5 LTS and Vivado 2024.1. I built a base overlay with one reconfigurable partition (RP) containing two RMs: vec_add and vec_mul.

Here is my BD:

Base Python Code

from pynq import Overlay
import time

# Load base overlay
print("Loading base overlay...")
ol = Overlay("design_1_wrapper_full.bit")
print("Full overlay loaded")

# Decoupler GPIO
gpio = ol.axi_gpio_0

# Load vec_add partial
gpio.write(0, 1)       # Decouple RP
time.sleep(0.2)
ol.rp1.download("rm1_partial.bit")
time.sleep(0.5)
gpio.write(0, 0)       # Reconnect RP
print("vec_add partial loaded")

# Load vec_mul partial
gpio.write(0, 1)
time.sleep(0.2)
ol.rp1.download("rm2_partial.bit")
time.sleep(0.5)
gpio.write(0, 0)
print("vec_mul partial loaded")

Observed Output

1. Loading base overlay...
Full overlay loaded
vec_add partial loaded
TEST 2: Load rm2_partial.bit (vec_mul)
vec_mul partial loaded
DFX Partial Bitstream Reload Test Complete!

Problem

When running the RM computations with the DMAs, I get stuck or garbage results:

Base design loaded
IP blocks: ['rp1', 'axi_dma_0', 'axi_dma_1', 'axi_dma_2', 'axi_gpio_0', 'processing_system7_0']
Decoupler: ON
Decoupler: OFF
VecAdd out:  [524288      0      0      0      0]
Expected: [ 0  3  6  9 12]
✗ VecAdd FAILED
Decoupler: ON
Decoupler: OFF
VecMul out:  [524288      0      0      0      0]
Expected: [ 0  2  8 18 32]
✗ VecMul FAILED

Steps Tried

  • Verified GPIO decoupler toggling.
  • Added sleep after RP decouple and download.
  • Checked DMA channels after partial load.

Edited: Update

I checked the HP port it was 32 i fixed to 64. Also i fixed Apetures. I got vec_add only working.

from pynq import Overlay, allocate
import time
import numpy as np

# Load the base overlay once
overlay = Overlay("design_1.bit")
gpio = overlay.axi_gpio_0
dmaA = overlay.axi_dma_0
dmaB = overlay.axi_dma_1
dmaC = overlay.axi_dma_2

# Define shared data buffers
N = 16
input_a = allocate(shape=(N,), dtype=np.int32)
input_b = allocate(shape=(N,), dtype=np.int32)
output_c = allocate(shape=(N,), dtype=np.int32)

for i in range(N):
    input_a[i] = i + 1
    input_b[i] = (i + 1) * 10

# --- Run Partial Bitstream 1 (vec_add) ---
print("--- Running vec_add ---")
gpio.write(0, 1)  # Decouple ON
time.sleep(0.2)
overlay.rp1.download('rp1rm1_inst_0.bit')
time.sleep(0.5)
gpio.write(0, 0)  # Decouple OFF
time.sleep(0.2)

# GET the IP *after* the partial bitstream is loaded
vec_add = overlay.rp1.vec_add_0

# Configure and run the new IP
vec_add.write(0x00, 0x00)  # reset
vec_add.write(0x10, N)     # set length
time.sleep(0.05)
vec_add.write(0x00, 0x01)  # start

# Perform DMA transfers (starting receiver first is a good practice)
dmaC.recvchannel.transfer(output_c)
time.sleep(0.01)
dmaA.sendchannel.transfer(input_a)
dmaB.sendchannel.transfer(input_b)
dmaA.sendchannel.wait()
dmaB.sendchannel.wait()
dmaC.recvchannel.wait()

print("Input A:", input_a)
print("Input B:", input_b)
print("Output C:", output_c)
print("Expected :", input_a + input_b)


# --- Run Partial Bitstream 2 (vec_mul) ---
print("\n--- Running vec_mul ---")
gpio.write(0, 1)  # Decouple ON
time.sleep(0.2)
overlay.rp1.download('rp1rm2_inst_0.bit')
time.sleep(0.5)
gpio.write(0, 0)  # Decouple OFF
time.sleep(0.2)

vec_mul = overlay.rp1.vec_mul_0

# CLEAR output buffer
output_c[:] = 0

# Reset IP but DON'T start yet
vec_mul.write(0x00, 0x00)
time.sleep(0.1)
vec_mul.write(0x10, N)
time.sleep(0.05)

# Start DMA transfers first
dmaC.recvchannel.transfer(output_c)
time.sleep(0.01)
dmaA.sendchannel.transfer(input_a)
dmaB.sendchannel.transfer(input_b)
time.sleep(0.05)

# NOW start the IP
vec_mul.write(0x00, 0x01)

dmaA.sendchannel.wait()
dmaB.sendchannel.wait()
dmaC.recvchannel.wait()

print("Output C:", output_c)
print("Expected :", input_a * input_b)

Output

--- Running vec_add ---
Input A: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
Input B: [ 10  20  30  40  50  60  70  80  90 100 110 120 130 140 150 160]
Output C: [ 11  22  33  44  55  66  77  88  99 110 121 132 143 154 165 176]
Expected : [ 11  22  33  44  55  66  77  88  99 110 121 132 143 154 165 176]

--- Running vec_mul ---
Output C: [1675100160 1675100160 1675164988 1675164988         10         40
          0          0          0          0          0          0
          0          0          0          0]
Expected : [  10   40   90  160  250  360  490  640  810 1000 1210 1440 1690 1960
 2250 2560]

Question

How should I properly isolate the RP, load the partial, and ensure DMAs are working correctly? Do I need to reinitialize DMA channels after every partial load?

Is there anything in my design that could cause the DMA to fail or produce garbage results after a partial load?

Any guidance or example code for safe partial reconfiguration with DMAs would be very appreciated.

Solved

Fixed Flow Steps

The static region includes:

  • Three AXI DMAs (two input DMAs for A and B, one output DMA for C)
  • AXI GPIO for decoupling the reconfigurable region
  • One Reconfigurable Partition (RP1)

Two HLS-based accelerators share the same reconfigurable region:

  • vec_add — performs element-wise addition
  • vec_mul — performs element-wise multiplication

1. Make sure all RP interfaces are correctly decoupled

Before loading any partial bitstream:

gpio.write(0, 1)  # Decouple ON before PR
# ... download partial bitstream ...
gpio.write(0, 0)  # Decouple OFF after PR

2. Enable reset and snapping mode in XDC

In your .xdc file, ensure the following properties are applied to the RP pblock:

set_property RESET_AFTER_RECONFIG true [get_pblocks pblock_rp1]
set_property SNAPPING_MODE ON [get_pblocks pblock_rp1]

This ensures the RP resets properly and aligns with static region boundaries.


Final Python Script

from pynq import Overlay, allocate
import numpy as np
import time

# ==========================================================
# 1. Load Static Region
# ==========================================================
overlay = Overlay("static_f.bit")
print("Static overlay loaded.")

gpio = overlay.axi_gpio_0

# ==========================================================
# 2. PARTIAL RECONFIGURATION: Load vec_add RM
# ==========================================================
print("\n--- Loading vec_add Partial Bitstream ---")
gpio.write(0, 1)  # Decouple ON
time.sleep(0.5)

overlay.rp1.download('vec_add2.bit')
time.sleep(0.5)
gpio.write(0, 0)  # Decouple OFF
time.sleep(0.2)
print("vec_add RM loaded successfully.\n")

# ==========================================================
# 3. DMA and IP Setup
# ==========================================================
input_a = overlay.axi_dma_0
input_b = overlay.axi_dma_1
output_c = overlay.axi_dma_2
vec_add = overlay.rp1.vec_add_0

def print_dma_status(dma, direction):
    """ direction: 0 = read channel (MM2S), 1 = write channel (S2MM) """
    offset = 0 if direction == 0 else 0x30
    kind = "Read" if direction == 0 else "Write"
    print(f"{kind} Channel:")
    print(f"  Control: {hex(dma.read(0x0 + offset))}")
    print(f"  Status : {hex(dma.read(0x4 + offset))}\n")

print_dma_status(input_a, 0)
print_dma_status(input_b, 0)
print_dma_status(output_c, 1)

# ==========================================================
# 4. Prepare Buffers and Data
# ==========================================================
N = 32
input_buffer_a = allocate(shape=(N,), dtype=np.uint32)
input_buffer_b = allocate(shape=(N,), dtype=np.uint32)
output_buffer_c = allocate(shape=(N,), dtype=np.uint32)

for i in range(N):
    input_buffer_a[i] = i + 1
    input_buffer_b[i] = (i + 1) * 10

print("Input A:", input_buffer_a)
print("Input B:", input_buffer_b)

# ==========================================================
# 5. Run vec_add IP
# ==========================================================
vec_add.write(0x10, N)      # LEN register
vec_add.write(0x0, 0x1)     # Start IP

input_a.sendchannel.transfer(input_buffer_a)
input_b.sendchannel.transfer(input_buffer_b)
output_c.recvchannel.transfer(output_buffer_c)

input_a.sendchannel.wait()
input_b.sendchannel.wait()
output_c.recvchannel.wait()

print("DMA Transfers Completed.\n")

print("Output C:", output_buffer_c)
expected_add = input_buffer_a + input_buffer_b
if np.array_equal(output_buffer_c, expected_add):
    print("vec_add output is correct!\n")
else:
    print("vec_add output mismatch!")
    print("Expected:", expected_add)
    print("Got     :", output_buffer_c)

# ==========================================================
# 6. PARTIAL RECONFIGURATION: Load vec_mul RM
# ==========================================================
print("\n--- Loading vec_mul Partial Bitstream ---")
gpio.write(0, 1)  # Decouple ON
time.sleep(0.5)

overlay.rp1.download('vec_mul1.bit')
time.sleep(0.5)
gpio.write(0, 0)  # Decouple OFF
time.sleep(0.2)
print("vec_mul RM loaded successfully.\n")

# ==========================================================
# 7. Reconnect IP and Run vec_mul
# ==========================================================
vec_mul = overlay.rp1.vec_mul_0  # new HLS block after PR

# Reset output buffer
output_buffer_c[:] = 0

vec_mul.write(0x10, N)      # LEN register
vec_mul.write(0x0, 0x1)     # Start IP

input_a.sendchannel.transfer(input_buffer_a)
input_b.sendchannel.transfer(input_buffer_b)
output_c.recvchannel.transfer(output_buffer_c)

input_a.sendchannel.wait()
input_b.sendchannel.wait()
output_c.recvchannel.wait()

print("DMA Transfers Completed (vec_mul).\n")

print("Output C:", output_buffer_c)
expected_mul = input_buffer_a * input_buffer_b
if np.array_equal(output_buffer_c, expected_mul):
    print("vec_mul output is correct!\n")
else:
    print("vec_mul output mismatch!")
    print("Expected:", expected_mul)
    print("Got     :", output_buffer_c)

# ==========================================================
# 8. Clean Up
# ==========================================================
input_buffer_a.freebuffer()
input_buffer_b.freebuffer()
output_buffer_c.freebuffer()

print("All done. Both PR modules (vec_add & vec_mul) tested successfully.")

:receipt: Example Console Output

Static overlay loaded.

--- Loading vec_add Partial Bitstream ---
vec_add RM loaded successfully.

Read Channel:
  Control: 0x10003
  Status : 0x0

Read Channel:
  Control: 0x10003
  Status : 0x0

Write Channel:
  Control: 0x10003
  Status : 0x0

Input A: [ 1  2  3 ... 31 32]
Input B: [ 10  20  30 ... 310 320]
DMA Transfers Completed.

Output C: [ 11  22  33 ... 341 352]
vec_add output is correct!

--- Loading vec_mul Partial Bitstream ---
vec_mul RM loaded successfully.

DMA Transfers Completed (vec_mul).

Output C: [   10    40    90 ... 9610 10240]
vec_mul output is correct!

All done. Both PR modules (vec_add & vec_mul) tested successfully.

:books: References


Final Verification:
Both vec_add and vec_mul reconfigurable modules successfully loaded, executed, and validated via DMA transfers under the same static design.



1 Like