AXI Stream data shows on simulation, not in PYNQ

Hey there,

I’m using PYNQ-Z2 board, built my IP with Vitis HLS - a bruteforcer.

There are two interfaces- the CTRL_BUS is using AXI-Lite and the progress is using AXI-Stream.
In the main loop I’m streaming i to the progress and co-simulation shows everything works.
When loading the overlay into PYNQ and starting the IP, the logic in the IP works and performs the bruteforce as expected, while the progress stays 0.

The main code in the IP:

void bruteforce(const unsigned int md5bf[4], const unsigned char charset[CHARSET_LENGTH], unsigned char out[LENGTH], int &found, hls::stream<progress_t> &progress)

#pragma HLS INTERFACE axis port=progress
#pragma HLS INTERFACE s_axilite port=md5bf bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=charset bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=out bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=found bundle=CTRL_BUS
#pragma HLS INTERFACE s_axilite port=return bundle=CTRL_BUS

	unsigned int md5bf_reversed[4];
	for (int i=0; i<4; i++) {
		md5bf_reversed[i] = reorderBytes(md5bf[i]);

	found = 0;

    unsigned char current[LENGTH];
    for (int i = 0; i < LENGTH; i++)
        current[i] = charset[0];

    int n = CHARSET_LENGTH;
	int k = LENGTH;
    ap_uint<32> max_iterations = compute_permutations_with_repetition(n, k);

    for (ap_uint<32> i = 0; i < max_iterations; i++)

    	// Send progress data
		progress_t progress_data; = i;
		progress_data.last = (i == max_iterations-1) ? 1 : 0;  // Signal the end of the stream
        char padded[BUFFER_SIZE];
        unsigned int padded_size;
        copy_pad(padded, current, LENGTH, padded_size);

        unsigned int md5out[4];
		md5((unsigned int*)padded, md5out, padded_size);

		bool match = true;
		for (int i=0; i < 4; i++) {
			if (md5bf_reversed[i] != md5out[i]) {
				match = false;

        if (match)
        	copy_uchar(out, current, LENGTH);
        	found = 1;
        bool carry;
        generate_next_combination(carry, current, charset);
        if (carry) {


co-simulation in Vitis HLS waveform shows progress data as expected over time.

In the PYNQ Jupyter I’m running

overlay = Overlay('my_overlay.bit')

size = 1
# Create a contiguous memory buffer to hold data
output_buffer = allocate(shape=(size,), dtype=np.uint32)  # change size and dtype accordingly
dma = overlay.axi_dma # replace with your DMA's name
dma.recvchannel.transfer(output_buffer)  # Receiving streamed output

No errors or anything and IP would perform the task as expected, returned through the AXI-Lite interface, but when reading output_buffer it contains 0 and would not change.

Would highly appreciate any advice you might have on why the progress would not update.

Thank you!

Hi @dvirdc,

I think the output_buffer is not big enough, you are allocating only one 32-bit word.
The first element that gets written to the stream is 0.

I don’t know the value of max_iterations, but size in your Python code should be equal to max_iterations


1 Like

Thanks for the quick response @marioruiz !
max_iterations is 32 bit unsigned integer and in my implemenation cannot exceed 36^5.
Regarding output_buffer I tried to allocate 32 words and it still remains 0 at all words.
Vivado validation was successful, but I’m wondering maybe I didn’t connect the DMA correctly in my block design- still checking that and trying to reimplement that part.

Would really appreciate more tips please!

@marioruiz I was actually missing your point and understand what you meant just now. I tried setting size to max_iterations which is 36^5 and I’m now getting an error.
the code:

size = 36**5
# Create a contiguous memory buffer to hold data
# input_buffer = allocate(shape=(size,), dtype=np.uint32)  # change size and dtype accordingly
output_buffer = allocate(shape=(size,), dtype=np.uint32)  # change size and dtype accordingly
dma = overlay.axi_dma_0 # replace with your DMA's name
# dma.sendchannel.transfer(input_buffer)  # Receiving streamed output
dma.recvchannel.transfer(output_buffer)  # Receiving streamed output

the error:

File /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/, in _xrt_allocate(shape, dtype, device, memidx, cacheable, pointer, cache)
    102     bo, buf, device_address = pointer
    103 else:
--> 104     bo = device.allocate_bo(size, memidx, cacheable)
    105     buf = device.map_bo(bo)
    106     device_address = device.get_device_address(bo)

File /usr/local/share/pynq-venv/lib/python3.10/site-packages/pynq/pl_server/, in XrtDevice.allocate_bo(self, size, idx, cacheable)
    408 bo = xrt.xclAllocBO(self.handle, size, xrt.xclBOKind.XCL_BO_DEVICE_RAM, idx)
    409 if bo >= 0x80000000:
--> 410     raise RuntimeError("Allocate failed: " + str(bo))
    411 return bo

RuntimeError: Allocate failed: 4294967295

I would suggest you try to allocating less memory.

@marioruiz thank you for helping. I’ve tried reducing the amount of writes into the progress (5 times max over all iterations). I’m still unable to recieve any data from that code. The IP functions as expected and executing the logic successfully.

Do you have any other ways to debug this issue?
Thank you!


(36^5)*4Bytes = 231MB
Question to yourself and reduce problem
What is the maximum 1 shot TX size DMA allowed?
231MB / 5 = 46MB

Same as what Mario had mentioned:
I would suggest you try to allocating less memory.
I will suggest you do the

1 Like

@briansune thank you for making this clear, I appreciate your help.
I will definitely try to take another approach with solving that issue- if anything, my intention will be to increase the memory when using larger charsets.

Is there any other approach I can take to retrieve progress from the IP while its running?

Thank you


Can the read back function normally if so then you are try slice the TX and run a big read back to compare the sanity.

Nothing I can help more as design is not open so too little info.


1 Like

I was able to solve my issue and read progress using AXI Lite- in my main IP core HLS code, I add a ap_uint<32> internal_progress outside of the main loop and updated it on each iteration, then sent it to the &progress to be read from the IP core.
I intend on publishing the full solution in a post and will share the link here when I’m done.
For everyone who helped- Thank you for your direction and support!