PYNQ: PYTHON PRODUCTIVITY

Vector operation with Stream in HLS and Vivado

hi, I’m starting with vivado and vitis HLS integration. I’ve already done some basic examples, but now I’m trying to design an overlay to perform simple operations with vectors as vector sum, dot product and multiplication by scalar, but it has been too hard. Sometimes the code in pynq notebook get stuck when I run the DMA send and receive functions, and sometimes the overlay only performs the first elemente-wise operation and then it stops. I think it could be something with the Tlast signal, but I really don’t know well how to make it work.

In brief: I need to send one or two fixed length vectors using AXI stream and perform a simple operation in the PL side, then send the result back to the PS side and print it in Pynq notebook.

I’d appreciate your help, thanks in advance.

I’m using an Ultra96V2 board with pynq 2.6, and vivado 2020.2

#include "ap_int.h"
#include <ap_axi_sdata.h>

typedef ap_axis<256,0,0,0> axis_t;

void smul(hls::stream<axis_t> &A, hls::stream<axis_t> &B) {

#pragma HLS INTERFACE s_axilite port = return bundle = CTRL
//#pragma HLS INTERFACE s_axilite port = length bundle = CTRL
#pragma HLS INTERFACE axis port = A
#pragma HLS INTERFACE axis port = B

	int length = 8;
	float V[length];

	for(unsigned int i=0; i<length; i++){
		axis_t tmp = A.read();
		V[i] = tmp.data + 5;
	}

	for(unsigned int i=0; i<length; i++){
		axis_t tmp;
		tmp.data = V[i];

		ap_uint<1> last = 0;
		if (i == length - 1) {
		  last = 1;
		}
		tmp.last = last;
		B.write(tmp);
	}
}

1 Like

Hi @Jadacuor,

The python code you are using would be useful as well.

One of the issues I see is that you are not assigning a value to the keep singal. Some more about this here

Mario

1 Like

I already tested with the keep signal, yet I’m getting the same result.

.
.
.
tmp.last = last;
tmp.keep = -1;
B.write(tmp);
.
.
.

This is the Python Code:

from pynq import (allocate, Overlay)
import numpy as np

overlay = Overlay('vector2.bit')
test_ip = overlay.smul_0
dma = overlay.axi_dma_0

DIM = 8
CTRL_REG = 0x00
AP_START = (1<<0) # bit 0
AUTO_RESTART = (1<<7) # bit 7
  
input_buffer = allocate(shape=(DIM,), dtype=np.int32, cacheable=False)
output_buffer = allocate(shape=(DIM,), dtype=np.int32, cacheable=False)
input_buffer[:] = (100*np.random.rand(DIM,)).astype(dtype=np.int32)

def sumvector():   
    dma.sendchannel.transfer(input_buffer)
    dma.recvchannel.transfer(output_buffer)
    test_ip.write(CTRL_REG, (AP_START | AUTO_RESTART))  # initialize the module
    dma.sendchannel.wait()
    dma.recvchannel.wait()

sumvector()  #here it get stuck and doesn't print anything
Data = np.column_stack((input_buffer, output_buffer))
print(Data)
1 Like

The only problem is the axis data width specified in the hls program. The pynq only support 32 bit as far as I know. If it is supporting more that’s I don’t know-how, but not this way. Here is the output data I got when I have set typedef ap_axis<32,0,0,0> axis_t;

2 Likes

@mizan got a point.

The amount of data the IP expects and the data you send don’t match.

The IP has a vector type that fits 8x 32-bit words, on top of that it expects 8 of these vectors. On the other hand, in the notebook you are only providing 8 integers. Which will translate to only one vector arriving to the IP.

So, the IP is going to wait forever in this loop

	for(unsigned int i=0; i<length; i++){
		axis_t tmp = A.read();
		V[i] = tmp.data + 5;
	}

Also, note that float V[length] and the input vector is ap_axis<256,0,0,0> axis_t. This means that you are dropping 224-bit from the input vector.

Mario

1 Like

Thanks for the explanation. Yes, you are right, it’s actually dropping the rest 224bits that’s why I did not get the result in my project even with the use of a stream width converter.
Edit: I have checked and it is actually working with 8*8 Depth with taking every 1st one in vector and dropping the rest 7 integer from 8 integers. Clearly understood the phenomena, where I was thinking 8 integers will be automatically fitted in the 256 data coming in as they were 32bit each.

This is correct, 8 integers will be pack in the input stream. But the local memory that is being used is 32-bit. So, 7 of them are not used.

So, if you want to make the IP work properly. You may have to do something like this


	axis_t V[length];
	for(unsigned int i=0; i<length; i++){
		axis_t tmp, tmp1 =0;
        tmp = A.read();
		for (unsigned int j=0; i< 256; j+=32) {
        // add unroll pragma here to do the computation in parallel
  		    tmp1(j+31,31) = tmp.data(j+31,j) + 5;
        }
        V[i] = tmp1;
	}

	for(unsigned int i=0; i<length; i++){
		axis_t tmp;
		tmp.data = V[i];

		ap_uint<1> last = 0;
		if (i == length - 1) {
		  last = 1;
		}
		tmp.last = last;
		tmp.keep= -1;
		B.write(tmp);
	}

This part tmp1(j+31,31) = tmp.data(j+31,j) + 5; is key if you want to operate on each element individually. Otherwise, you will be adding 5 to the 256-bit element.

If there are further question about the IP, I would suggest to post them in the Xilinx forums.

Mario

3 Likes

yes I got your point. already done similar. Maybe I have described it wrong. so, one time reading and one-time writing.

	axis_t tmp = A.read();
	for(unsigned int i=0; i<length; i++){
	V[i]=(tmp.data & 0xFFFFFFFF)+5;
	tmp.data = tmp.data>>32;}
	tmp.data=0
        for(unsigned int i=0; i<length; i++){
		tmp.data = tmp.data | (ap_uint<256>) V[i] << (32 * i);
	}
	tmp.last = 1;
	tmp.keep = -1;
	B.write(tmp);

Maybe something like this will do the trick. But it is still not clear how to read the data back in pynq to unpack 256 bit to 8 *32 bit int.
So, the primary solution changing stream to 32 bit wide is kinda minimal solution.

  • it was not my post, I was just trying.
1 Like

You can define the arrays linearly like this

input_buffer = allocate(shape=(8*8,), dtype=np.int32, cacheable=False)
output_buffer = allocate(shape=(8*8,), dtype=np.int32, cacheable=False)

or multidimensional like this

input_buffer = allocate(shape=(8,8), dtype=np.int32, cacheable=False)
output_buffer = allocate(shape=(8,8), dtype=np.int32, cacheable=False)

The underlying infrastructure will take care of the rest.

Mario

Hi, thanks a lot.

I used ap_axis<256,0,0,0> axis_t because I did this calculation: float needs 4 bytes=32bits and I will send 8 elements, so 32*8 = 256. So I thought my bus width should be 256.

In the notebook code I was using float data, and later I changed to int32 just for a test. But I need to send a float type vector of 360 elements, but I’m trying to understand first with a shorter length.

I’m gonna check all what you both say and will let you know what happen.
thanks again.

Jairo.

hi, did you change something else? … I did what you said but it does not work. did you include the keep signal or not?

Yes, ofcourse you have to include the keep signal, otherwise it won’t work.

You are right about this. The problem is I was doing the same stupid mistake of taking the whole 256bit to 1 element of the array as 32bit float as you are doing now in the following line:

So, other bits are discarded, and DMA halts for receiving new data. So, either you have to change data width to 32bit for reading it 8 times (easiest way I think) or read once 256bit data and divide it into 8 elements of the array.

here is sample code for finding difference in array elements:
#include "ap_int.h"
#include <ap_axi_sdata.h>

typedef ap_axis < 256, 0, 0, 0 > axis_t;
signed int temp, temp2;   // needed only for storing data for next circle. Can be removed along with associated 2 if statements in for loops.

void difference(hls::stream < axis_t > & A, hls::stream < axis_t > & B) {

  #pragma HLS INTERFACE axis port = A
  #pragma HLS INTERFACE axis port = B
  #pragma HLS INTERFACE s_axilite port = return

  int length = 8;
  int i = 0;
  signed int V[length];
  signed int V_in[length];


// Reading the input data and splitting it to the array elements
  axis_t tmp = A.read();
  for (unsigned int i = 0; i < length; i++) {
	#pragma HLS UNROLL
    V_in[i] = tmp.data & 0xFFFFFFFF;	// taking the last 32bit for the array element
    tmp.data = tmp.data >> 32;			// shifting array by 32 bit for next array element
    if (i == 7) {						// needed only for storing data for next circle
        	temp2=temp;					// keeping last element of previous array in temp2 variable
        	temp = V_in[i];
    }
  }

// performing some calculations
  for (unsigned int i = 0; i < length; i++) {
	#pragma HLS UNROLL					// unrolling the for loop (greatly reduce the resources and latency)
    if (i == 0) V[i] = temp2 - V_in[i]; // finding difference of first element of the array from last element of previous array
    else V[i] = V_in[i - 1] - V_in[i];	// finding difference of all other elements
  }

// gathering all the array element and writing the result
  tmp.data = 0;
  for (unsigned int i = 0; i < length; i++)
	#pragma HLS UNROLL
    tmp.data(i * 32 + 31, i * 32) = V[i];

  tmp.last = 1;							// last and keep signal must needed for the DMA.
  tmp.keep = -1;
  B.write(tmp);
}
& python code:
from pynq import (allocate, Overlay)
import numpy as np

overlay = Overlay('difference.bit')
differnece = overlay.difference
dma = overlay.dma
differnece.write(0x00,0x81)

DIM = 8*8      # must be multiple of 8 as the ip expects 8 32bit integer in a single stream data.
input_buffer = allocate(shape=(DIM,), dtype=np.int32, cacheable=False)
output_buffer = allocate(shape=(DIM,), dtype=np.int32,cacheable=False)
input_buffer[:] = (100*np.random.rand(DIM,)).astype(dtype=np.int32)

dma.sendchannel.transfer(input_buffer)
dma.sendchannel.wait()   
for j in range (int(DIM/8)):
    dma.recvchannel.transfer(output_buffer[j*8:j*8+8])
    dma.recvchannel.wait()
Data = np.column_stack((input_buffer, output_buffer))
print(Data)

Hi, I’ve just achieved it. I divided the problem into three steps: 1. reading data, 2. performing operations and 3. writing data. I used typedef ap_axis<32,0,0,0> axis_t.

void smul(hls::stream<axis_t> &in, hls::stream<axis_t> &out) {
#pragma HLS INTERFACE s_axilite port = return bundle = control
#pragma HLS INTERFACE axis depth=360 port = in
#pragma HLS INTERFACE axis depth=360 port = out

  DataType l_A[N];
  DataType l_B[N];
 
  int i_limit = 360;
  converter_t converter;

load_A:
  for (int i = 0; i < i_limit; i++) {
    axis_t temp = in.read();
	converter.i = temp.data;
	l_A[i] = converter.d;
  }

  kernel_lidar<DataType>(l_A, l_B);

writeB:
  for (int i = 0; i < i_limit; i++) {
    axis_t temp;
    converter.d = l_B[i];
    temp.data = converter.i;

    ap_uint<1> last = 0;
    if (i == i_limit - 1) {
      last = 1;
    }
    temp.last = last;
    temp.keep = -1; // enabling all bytes
    out.write(temp);
  }
}

I also try with typedef ap_axis<256,0,0,0> axis_t, but to make it work I used #pragma HLS ARRAY_PARTITION to save data in 8 groups of 45 = 360.

This code was very helpful.

thanks a lot @marioruiz and @mizan

2 Likes