Double array

Hey :slight_smile:

So I am using vivado/vitis 2022.2, and have been messing around with a project which I thought would extend this example a little, since the original works very nicely (for me):

The idea is it will just add two vectors together pairwise.

#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axiu<32, 0, 0, 0> trans_pkt;

void ssmul(hls::stream< trans_pkt > &INPUT0, hls::stream< trans_pkt > &INPUT1, hls::stream< trans_pkt > &OUTPUT)
{
        #pragma HLS INTERFACE axis port=INPUT0
        #pragma HLS INTERFACE axis port=INPUT1
        #pragma HLS INTERFACE axis port=OUTPUT
        trans_pkt data_p;
        trans_pkt data_q;
        trans_pkt data_r;
        
        INPUT0.read(data_p);
        INPUT1.read(data_q);

        data_r.data = data_p.data + data_q.data;
        OUTPUT.write(data_r);
}

And for the design, my plan was to run automation on the following config:

It compiles nicely all the way through, but stalls in jupyter :frowning:

This is the notebook:

import time, random, numpy
from pynq import Overlay, allocate, MMIO
import pynq.lib.dma

ol = Overlay('./design_1.bit')
ol.download()

dma0 = ol.axi_dma_0
dma1 = ol.axi_dma_1

length = 10

in_buffer0 = allocate(shape=(length,), dtype=numpy.int32) 
in_buffer1 = allocate(shape=(length,), dtype=numpy.int32)
out_buffer = allocate(shape=(length,), dtype=numpy.int32) 

samples = random.sample(range(0, length), length)
numpy.copyto(in_buffer0, samples)
numpy.copyto(in_buffer1, samples)
numpy.copyto(out_buffer, samples)

t_start = time.time()
dma0.sendchannel.transfer(in_buffer0)
dma1.sendchannel.transfer(in_buffer1)
dma0.recvchannel.transfer(out_buffer)
dma0.sendchannel.wait()
dma1.sendchannel.wait()
dma0.recvchannel.wait()
t_stop = time.time()

in_buffer0.close()
in_buffer1.close()
out_buffer.close()

print(t_start - t_stop)

Any one got any ideas?

It would be helpful if you could post the block design after you completed the design to make sure the problem isn’t there/that there isn’t another problem there.

In this case it looks like a problem with your HLS code. You are not setting “LAST” which means the DMA transactions don’t complete.

Cathal

Ace :slight_smile:

About setting LAST, in this example … does this look right?

#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axiu<32, 0, 0, 0> trans_pkt;

void ssmul(hls::stream< trans_pkt > &INPUT0, hls::stream< trans_pkt > &INPUT1, hls::stream< trans_pkt > &OUTPUT)
{
        #pragma HLS INTERFACE axis port=INPUT0
        #pragma HLS INTERFACE axis port=INPUT1
        #pragma HLS INTERFACE axis port=OUTPUT
        trans_pkt data_p;
        trans_pkt data_q;
        trans_pkt data_r;
        
        INPUT0.read(data_p);
        INPUT1.read(data_q);

        data_r.data = data_p.data + data_q.data;
        OUTPUT.write(data_r);

        if(data_r.last){
        	return;
        }
}

?

Also, here is the final block design :slight_smile:

K

@alienflip

Can u expand the signal of the HLS IP block to show us the signals that are generate is correct.
Apart from your HLS design, at minimum case the interface signals need to be correct.

Thank you

Here it is as a hierarchy in .pdf, with expansions:

design_1.pdf (63.0 KB)

On a side note; after some testing, I am sure that it is the line of code

dma0.recvchannel.wait()

which is causing the stall … not sure what to do with that information, though :sweat_smile:

@alienflip
In short there are tlast signal invoked during HLS.

However this is all related to the HLS design now.
Debug using System-ILA via AXI-stream.

As previous post and this post already mentioned.
DMA will only complete via tlast or data return # is met.

Try learning from this example:

Cool thanks!

Not sure I understand what you mean though, this all seems very like advanced NN stuff which my smooth brain can’t fathom!

Could you point out where exactly to execute the fix in my example? I know once I see how tlast works one time in a simple example like this, it will stick :slight_smile:

At the beginning of that tutorial it had mention how AXI-stream define, while the tutorial of cathalmccabe also provide a good tutorial however different case had different signal interface so you can learn two different case on those two tutorial. Even it is a NN example it is still following the same AXI-Stream protocol as this is a very basic and standard protocol.

As I had mentioned interface-wise is good, so you got to debug why data are not flowing out to trigger the tlast.

I read through @cathalmccabe tutorial, and it suggests to do something like:

#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axiu<32, 2, 5, 6> pkt;

void ssmul(hls::stream< pkt > &INPUT0, hls::stream< pkt > &INPUT1, hls::stream< pkt > &OUTPUT)
{
#pragma hls interface s_axilite port=return
#pragma HLS INTERFACE axis port=INPUT0
#pragma HLS INTERFACE axis port=INPUT1
#pragma HLS INTERFACE axis port=OUTPUT

	pkt data_p;
	pkt data_q;
	pkt data_r;

	while(1){
		INPUT0.read(data_p);
		INPUT1.read(data_q);

		data_r.data = data_p.data.to_int() + data_q.data.to_int();
		OUTPUT.write(data_r);

		if(data_r.last){
			break;
		}
	}
}

Which I have now tried aswell … it results in an identical stall :frowning:

This is the block design:

design_1.pdf (63.0 KB)

Gonna sleep for a bit! But thanks for the help so far :slight_smile:

First of all, there is no assignment to data_r.last, so how it ever be true I don’t get it. You have to assign data_r.tlast to 1 at a certain time and write the data to the output afterward.
Another thing, you should debug your Vivado project using ILA to see where it went wrong.

1 Like

Hi,

Just to expand a little here on what @mizan kindly stated. For your output HLS stream data_r you need to set data_r.last to a bool false until it is the last one. For just the last data chunk you need to set data_r.last to bool true. You’re using C++ so bool is a valid data type here, if you like, 1 or 0 also works. When the .last member is set appropriately and the DMA wired properly it should work.

In your latest source example, if the input data streams data_p or data_q have a proper .last set in them, you could assign data_r.last = data_p.last and that would also work.

Kind regards

1 Like

Hi,

Don’t forget to set keep as well, otherwise the design may not work.

Mario

3 Likes

I didn’t realise my first post wasn’t clear enough.
@pynqzen and @mizan have given the answer above. (Thanks to you both :slight_smile: )
To be really clear, and in case it helps someone else, you can set last like this (or similar):

void ssmul(hls::stream< trans_pkt > &INPUT0, hls::stream< trans_pkt > &INPUT1, hls::stream< trans_pkt > &OUTPUT)
{
        #pragma HLS INTERFACE axis port=INPUT0
        #pragma HLS INTERFACE axis port=INPUT1
        #pragma HLS INTERFACE axis port=OUTPUT
        trans_pkt data_p;
        trans_pkt data_q;
        trans_pkt data_r;
        
        INPUT0.read(data_p);
        INPUT1.read(data_q);

        data_r.data = data_p.data + data_q.data;
        data_r.last = data_p.last;  // Set LAST
        OUTPUT.write(data_r);
}

If you have the same number of inputs as outputs you can just assign the value of last for one of the inputs to the last of the output. If you have a different number of inputs to outputs you would add some extra code to calculate when to set last.

In this example, your two input vectors should be the same length so you could set last equal to the value of last for either inputs. You could also do an AND or an OR of the two LAST values if you wanted to but an AND may cause your design to stall if you send the wrong inputs. An OR would cause the transaction to complete early if one of the inputs was too short.

One other point, you should really be simulating designs like this to test them. Either C simulation or RTL simulation. This should help resolve problems like this.

Cathal

3 Likes

Wow, thanks for the suggestions all! Learning a lot here :slight_smile:

So taking a combination of advice from above, I put this together:

#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axiu<32, 1, 1, 1> pkt;

void ssmul(hls::stream< pkt > &INPUT0, hls::stream< pkt > &INPUT1, hls::stream< pkt > &OUTPUT, ap_int<32> constant0, ap_int<32> constant1)
{
#pragma HLS interface s_axilite port=constant0
#pragma HLS interface s_axilite port=constant1
#pragma HLS INTERFACE axis port=INPUT0
#pragma HLS INTERFACE axis port=INPUT1
#pragma HLS INTERFACE axis port=OUTPUT

	pkt data_p;
	pkt data_q;
	pkt data_r;

	while(1){
		INPUT0.read(data_p);
		INPUT1.read(data_q);

		data_r.data = data_p.data + data_q.data;
        data_r.dest = data_p.dest;
        data_r.id = data_p.id;
        data_r.keep = data_p.keep;
        data_r.last = data_p.last;
        data_r.strb = data_p.strb;
        data_r.user = data_p.user;

		OUTPUT.write(data_r);

		if(data_r.last){
			break;
		}
	}
}

The vitis output produced:
output
and

// ==============================================================
// Vitis HLS - High-Level Synthesis from C, C++ and OpenCL v2022.1 (64-bit)
// Tool Version Limit: 2022.04
// Copyright 1986-2022 Xilinx, Inc. All Rights Reserved.
// ==============================================================
// control
// 0x00 : reserved
// 0x04 : reserved
// 0x08 : reserved
// 0x0c : reserved
// 0x10 : Data signal of constant0
//        bit 31~0 - constant0[31:0] (Read/Write)
// 0x14 : reserved
// 0x18 : Data signal of constant1
//        bit 31~0 - constant1[31:0] (Read/Write)
// 0x1c : reserved
// (SC = Self Clear, COR = Clear on Read, TOW = Toggle on Write, COH = Clear on Handshake)

#define XSSMUL_CONTROL_ADDR_CONSTANT0_DATA 0x10
#define XSSMUL_CONTROL_BITS_CONSTANT0_DATA 32
#define XSSMUL_CONTROL_ADDR_CONSTANT1_DATA 0x18
#define XSSMUL_CONTROL_BITS_CONSTANT1_DATA 32


The vivado block design finished as followed:
design_1.pdf (109.1 KB)

In my notebook, I tried:

from pynq import Overlay, allocate
import pynq.lib.dma
import time, random, numpy

ol = Overlay('./design_1.bit')
ol.download()

CONST0_ADDRESS = 0x10
CONST1_ADDRESS = 0x18
ip = ol.ssmul_0
dma0 = ol.axi_dma_0
dma1 = ol.axi_dma_1

length = 100

in_buffer0 = allocate(shape=(length,), dtype=numpy.int32) 
in_buffer1 = allocate(shape=(length,), dtype=numpy.int32)
out_buffer_silicon = allocate(shape=(length,), dtype=numpy.int32) 

samples0 = random.sample(range(0, length), length)
samples1 = random.sample(range(0, length), length)
zeros = numpy.zeros((length,), dtype=int)

numpy.copyto(in_buffer0, samples0)
numpy.copyto(in_buffer1, samples1)
numpy.copyto(out_buffer_silicon, zeros)
ip.write(CONST0_ADDRESS, 1)
ip.write(CONST1_ADDRESS, 1)

t_start = time.time()
dma0.sendchannel.transfer(in_buffer0)
dma1.sendchannel.transfer(in_buffer1)
dma0.recvchannel.transfer(out_buffer_silicon)
dma0.sendchannel.wait()
dma1.sendchannel.wait()
dma0.recvchannel.wait()
t_stop = time.time()

in_buffer0.close()
in_buffer1.close()
out_buffer_silicon.close()

print("Hardware execution speed: ", t_stop - t_start)
print("Array out: ", out_buffer_silicon)

And it works! Thanks everyone :slight_smile:

3 Likes

Great, and thanks for posting back your solution and letting us know it works for you.

Cathal

1 Like