Matrix Multiply

Hey again :slight_smile:

Does anyone else know why something like this causes a stall in HLS compilation (vitis 2022.1)?

#include "ap_axi_sdata.h"
#include "hls_stream.h"

typedef ap_axiu<32, 1, 1, 1> pkt;

#define M 256

void mmul(hls::stream<pkt> s_in0[M][M], hls::stream<pkt> s_in1[M][M], hls::stream<pkt> s_out[M][M]) {
#pragma HLS INTERFACE axis port=s_in0
#pragma HLS INTERFACE axis port=s_in1
#pragma HLS INTERFACE axis port=s_out

	pkt data_in0[M][M];
	pkt data_in1[M][M];
	pkt data_out[M][M];

	for (unsigned i = 0; i < M; i++) {
		for (unsigned j = 0; j < M; j++) {
			data_in0[i][j] = s_in0[i][j].read();
			data_in1[i][j] = s_in1[i][j].read();
		}
	}

	for (unsigned i = 0; i < M; i++) {
		for (unsigned j = 0; j < M; j++) {
			for(unsigned k = 0; k < M; k++){
				data_out[i][j].data = data_out[i][j].data + data_in0[i][k].data * data_in1[k][j].data;
			}

			data_out[i][j].dest = data_in1[i][j].dest;
			data_out[i][j].id = data_in1[i][j].id;
			data_out[i][j].keep = data_in1[i][j].keep;
			data_out[i][j].last = data_in1[i][j].last;
			data_out[i][j].strb = data_in1[i][j].strb;
			data_out[i][j].user = data_in1[i][j].user;

			s_out[i][j].write(data_out[i][j]);

			if(data_out[i][j].last){
				break;
			}
		}
	}
}

What am I missing? :sweat_smile:

@alienflip

Alien last post I had enclosed a neural-network example already explain all the detail about element-wise multiplication and filter-wise computation.

See here

So at last you still have chance to use those technic no matter what you plan to do.

1 Like

I think the admins will tell you that this is not a pynq related question, but here is my reply anyways:

In “hls::stream s_in0[M][M]” what does “[M][M]” mean?
This will create MM streams all of type pkt. You don’t need this! remove the “[M][M]” from all streams and it should work fine. In your implementation, you read each of these MM streams only once. What you want is something like this:

void mmul(hls::stream<pkt> s_in0, .......) {

...
	for (unsigned i = 0; i < M; i++) {
		for (unsigned j = 0; j < M; j++) {
			data_in0[i][j] = s_in0.read();
			data_in1[i][j] = s_in1.read();
		}
	}
...
}

Personal Notes:

  1. do you really need the des, id and user signals?
  2. You can define data_in0, data_in1, and data_out as simple u32 (ap-uint<32>) instead of pkt, and you can save up on some resources.
  3. This implementation of mmul has a lot of latency as you may notice, since you need to read all the data in first, and then calculation. Why not do it on the fly? Since you’re dealing with uint and accumulation with II=1 is easily achiveable. Also if you don’t on the fly, you would’t need so many resources.

Best of luck!

2 Likes

Thanks for your time, I will try this!

Hi, This really isn’t a PYNQ question :slight_smile: :frowning:

HLS is an amazing tool but not perfect. HLS allows a software programmer to quickly create hardware designs. It greatly reduces the verification effort vs HDL tools. But at the end of the day you’re still “running” on hardware and still have to have a little understanding of how FPGAs work.

All tools and the PL only have a limited amount of resources. Sometimes tools don’t always make it clear when it runs out of those resources. I suspect that is your issue here.

For many CPU based systems 64K is very doable but I don’t think it makes sense to try to get the HLS tool to create multiple banks of 64K elements in hardware. You probably have exceeded the “stack” here.

Even if this code “compiled” you would likely not be able to run the logic very fast or perhaps it would not meet timing at any clock speeds.

Try setting M to 2 and see if that compiles. Also look at the Fmax estimate from HLS and see if it runs fast enough for you. Then increase M to see how far you can push it.

Kind regards

2 Likes

To HLS or to CHISEL, thats the question.

Zen, people start to forgot what FPGA is made for ancestors CPLD. They are make for small volume of design, proof of concept rather than replace ASIC. It is a pre ASIC test gear as well. Nowadays more likely a learning device. Accelerator as a subset of proof of algorithm on HW pre ASIC application stage as well.

OpenCL said why I am exist?