hi, I’m starting with vivado and vitis HLS integration. I’ve already done some basic examples, but now I’m trying to design an overlay to perform simple operations with vectors as vector sum, dot product and multiplication by scalar, but it has been too hard. Sometimes the code in pynq notebook get stuck when I run the DMA send and receive functions, and sometimes the overlay only performs the first elemente-wise operation and then it stops. I think it could be something with the Tlast signal, but I really don’t know well how to make it work.
In brief: I need to send one or two fixed length vectors using AXI stream and perform a simple operation in the PL side, then send the result back to the PS side and print it in Pynq notebook.
I’d appreciate your help, thanks in advance.
I’m using an Ultra96V2 board with pynq 2.6, and vivado 2020.2
#include "ap_int.h"
#include <ap_axi_sdata.h>
typedef ap_axis<256,0,0,0> axis_t;
void smul(hls::stream<axis_t> &A, hls::stream<axis_t> &B) {
#pragma HLS INTERFACE s_axilite port = return bundle = CTRL
//#pragma HLS INTERFACE s_axilite port = length bundle = CTRL
#pragma HLS INTERFACE axis port = A
#pragma HLS INTERFACE axis port = B
int length = 8;
float V[length];
for(unsigned int i=0; i<length; i++){
axis_t tmp = A.read();
V[i] = tmp.data + 5;
}
for(unsigned int i=0; i<length; i++){
axis_t tmp;
tmp.data = V[i];
ap_uint<1> last = 0;
if (i == length - 1) {
last = 1;
}
tmp.last = last;
B.write(tmp);
}
}
The only problem is the axis data width specified in the hls program. The pynq only support 32 bit as far as I know. If it is supporting more that’s I don’t know-how, but not this way. Here is the output data I got when I have set typedef ap_axis<32,0,0,0> axis_t;
The amount of data the IP expects and the data you send don’t match.
The IP has a vector type that fits 8x 32-bit words, on top of that it expects 8 of these vectors. On the other hand, in the notebook you are only providing 8 integers. Which will translate to only one vector arriving to the IP.
Thanks for the explanation. Yes, you are right, it’s actually dropping the rest 224bits that’s why I did not get the result in my project even with the use of a stream width converter.
Edit: I have checked and it is actually working with 8*8 Depth with taking every 1st one in vector and dropping the rest 7 integer from 8 integers. Clearly understood the phenomena, where I was thinking 8 integers will be automatically fitted in the 256 data coming in as they were 32bit each.
This is correct, 8 integers will be pack in the input stream. But the local memory that is being used is 32-bit. So, 7 of them are not used.
So, if you want to make the IP work properly. You may have to do something like this
axis_t V[length];
for(unsigned int i=0; i<length; i++){
axis_t tmp, tmp1 =0;
tmp = A.read();
for (unsigned int j=0; i< 256; j+=32) {
// add unroll pragma here to do the computation in parallel
tmp1(j+31,31) = tmp.data(j+31,j) + 5;
}
V[i] = tmp1;
}
for(unsigned int i=0; i<length; i++){
axis_t tmp;
tmp.data = V[i];
ap_uint<1> last = 0;
if (i == length - 1) {
last = 1;
}
tmp.last = last;
tmp.keep= -1;
B.write(tmp);
}
This part tmp1(j+31,31) = tmp.data(j+31,j) + 5; is key if you want to operate on each element individually. Otherwise, you will be adding 5 to the 256-bit element.
If there are further question about the IP, I would suggest to post them in the Xilinx forums.
Maybe something like this will do the trick. But it is still not clear how to read the data back in pynq to unpack 256 bit to 8 *32 bit int.
So, the primary solution changing stream to 32 bit wide is kinda minimal solution.
I used ap_axis<256,0,0,0> axis_t because I did this calculation: float needs 4 bytes=32bits and I will send 8 elements, so 32*8 = 256. So I thought my bus width should be 256.
In the notebook code I was using float data, and later I changed to int32 just for a test. But I need to send a float type vector of 360 elements, but I’m trying to understand first with a shorter length.
I’m gonna check all what you both say and will let you know what happen.
thanks again.
You are right about this. The problem is I was doing the same stupid mistake of taking the whole 256bit to 1 element of the array as 32bit float as you are doing now in the following line:
So, other bits are discarded, and DMA halts for receiving new data. So, either you have to change data width to 32bit for reading it 8 times (easiest way I think) or read once 256bit data and divide it into 8 elements of the array.
here is sample code for finding difference in array elements:
#include "ap_int.h"
#include <ap_axi_sdata.h>
typedef ap_axis < 256, 0, 0, 0 > axis_t;
signed int temp, temp2; // needed only for storing data for next circle. Can be removed along with associated 2 if statements in for loops.
void difference(hls::stream < axis_t > & A, hls::stream < axis_t > & B) {
#pragma HLS INTERFACE axis port = A
#pragma HLS INTERFACE axis port = B
#pragma HLS INTERFACE s_axilite port = return
int length = 8;
int i = 0;
signed int V[length];
signed int V_in[length];
// Reading the input data and splitting it to the array elements
axis_t tmp = A.read();
for (unsigned int i = 0; i < length; i++) {
#pragma HLS UNROLL
V_in[i] = tmp.data & 0xFFFFFFFF; // taking the last 32bit for the array element
tmp.data = tmp.data >> 32; // shifting array by 32 bit for next array element
if (i == 7) { // needed only for storing data for next circle
temp2=temp; // keeping last element of previous array in temp2 variable
temp = V_in[i];
}
}
// performing some calculations
for (unsigned int i = 0; i < length; i++) {
#pragma HLS UNROLL // unrolling the for loop (greatly reduce the resources and latency)
if (i == 0) V[i] = temp2 - V_in[i]; // finding difference of first element of the array from last element of previous array
else V[i] = V_in[i - 1] - V_in[i]; // finding difference of all other elements
}
// gathering all the array element and writing the result
tmp.data = 0;
for (unsigned int i = 0; i < length; i++)
#pragma HLS UNROLL
tmp.data(i * 32 + 31, i * 32) = V[i];
tmp.last = 1; // last and keep signal must needed for the DMA.
tmp.keep = -1;
B.write(tmp);
}
& python code:
from pynq import (allocate, Overlay)
import numpy as np
overlay = Overlay('difference.bit')
differnece = overlay.difference
dma = overlay.dma
differnece.write(0x00,0x81)
DIM = 8*8 # must be multiple of 8 as the ip expects 8 32bit integer in a single stream data.
input_buffer = allocate(shape=(DIM,), dtype=np.int32, cacheable=False)
output_buffer = allocate(shape=(DIM,), dtype=np.int32,cacheable=False)
input_buffer[:] = (100*np.random.rand(DIM,)).astype(dtype=np.int32)
dma.sendchannel.transfer(input_buffer)
dma.sendchannel.wait()
for j in range (int(DIM/8)):
dma.recvchannel.transfer(output_buffer[j*8:j*8+8])
dma.recvchannel.wait()
Data = np.column_stack((input_buffer, output_buffer))
print(Data)
Hi, I’ve just achieved it. I divided the problem into three steps: 1. reading data, 2. performing operations and 3. writing data. I used typedef ap_axis<32,0,0,0> axis_t.
void smul(hls::stream<axis_t> &in, hls::stream<axis_t> &out) {
#pragma HLS INTERFACE s_axilite port = return bundle = control
#pragma HLS INTERFACE axis depth=360 port = in
#pragma HLS INTERFACE axis depth=360 port = out
DataType l_A[N];
DataType l_B[N];
int i_limit = 360;
converter_t converter;
load_A:
for (int i = 0; i < i_limit; i++) {
axis_t temp = in.read();
converter.i = temp.data;
l_A[i] = converter.d;
}
kernel_lidar<DataType>(l_A, l_B);
writeB:
for (int i = 0; i < i_limit; i++) {
axis_t temp;
converter.d = l_B[i];
temp.data = converter.i;
ap_uint<1> last = 0;
if (i == i_limit - 1) {
last = 1;
}
temp.last = last;
temp.keep = -1; // enabling all bytes
out.write(temp);
}
}
I also try with typedef ap_axis<256,0,0,0> axis_t, but to make it work I used #pragma HLS ARRAY_PARTITION to save data in 8 groups of 45 = 360.