Bottleneck

Hi everyone,

I’m trying to evaluate the performance of a specific IP on different parallelisms on the Pynq-Z1. In particular, I’m using the IP generated by FINN. Basically I’m trying to evaluate the performance of the accelerator based on different values of PE and SIMD.

These are the result I obtain:

16 PE and 8 SIMD: 36.5ms (this uses 30% of LUTs)
8 PE and 8 SIMD: 37.1ms (this uses 26% of LUTs)
4 PE and 4 SIMD: 38.0ms (this uses 18% of LUTs)
2 PE and 2 SIMD: 73ms (this uses 12% of LUTs)

From my understanding it seems that under 38ms there is some kind of bottleneck. I read in another post on the forum that someone mentioned about 100ms of overhead due to PYNQ libraries (Execution time in PYNQ-Z2 - #2 by cathalmccabe), but I’m getting something like 38ms of overhead.

This i the kernel I’m trying to test

#define AP_INT_MAX_W 512

#include "bnn-library.h"

// includes for network parameters
#include "weights.hpp"
#include "activations.hpp"
#include "mvau.hpp"
#include "thresh.h"
#include "utility.hpp"

// defines for network parameters
#define MW1 576
#define MH1 64

#define SIMD1 8
#define PE1 16
#define WMEM1 288

#define TMEM1 4
#define numReps 784
#define WP1 8

// This is the MatrixVectorActivation_1 of the cnv2w2a extended to 8 bit and with 8 SIMD and 8 PE insted of 16 and 16, it has been renamed for convenience
void MatrixVectorActivation_0(
                    hls::stream<ap_uint<64>> &in0,
                    hls::stream<ap_uint<1024>> &weights,
                    hls::stream<ap_uint<128>> &out,
                    const ap_uint<1> is_input_faulty,
                    const ap_uint<1> is_weight_faulty,
                    const ap_uint<3> bit_faulty,
                    const ap_uint<1> mac_faulty[16][8]
                    )
{
#pragma HLS INTERFACE axis port=in0 name=in0_V
#pragma HLS INTERFACE axis port=out name=out_V
#pragma HLS INTERFACE s_axilite port=is_input_faulty bundle=fault_parameters
#pragma HLS INTERFACE s_axilite port=is_weight_faulty bundle=fault_parameters
#pragma HLS INTERFACE s_axilite port=bit_faulty bundle=fault_parameters
#pragma HLS INTERFACE s_axilite port=return bundle=fault_parameters
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=mac_faulty bundle=fault_parameters
#pragma HLS INTERFACE axis port=weights name=weights_V

// ap_uint<1> is_input_faulty = 1;
// ap_uint<1> is_weight_faulty = 0;
// ap_uint<3> bit_faulty = 2;
ap_uint<8> input_mask = 0;
input_mask.set_bit(bit_faulty, is_input_faulty);
ap_uint<8> weight_mask = 0;
weight_mask.set_bit(bit_faulty, is_weight_faulty);

Matrix_Vector_Activate_Stream_Batch<MW1, MH1, SIMD1, PE1, 8, 8, 3, Slice<ap_int<8>>, Slice<ap_int<8>>, Identity, ap_int<8> >
                (in0, out, weights, PassThroughActivation<ap_int<8>>(), numReps, ap_resource_lut(), input_mask, weight_mask, bit_faulty, mac_faulty);

}

The function MatrixVectorActivateStreamBatch basically execute the convolution given the input and the weights.

On the host part, I’m using 2 DMA, and to measure the execution time I use this piece of code on the python host:

    # Code to initialize everything I need

    start = time.time()
    dma_send_a.transfer(input_buffer)
    dma_send_c.transfer(weights_buffer)
    dma_recv_b.transfer(output_buffer)
    dma_send_a.wait()
    dma_send_c.wait()
    end = time.time()

    # Code to do what I need after the computation of the data

I only add the measuring part of the host since it is pretty big. If you need further details just let me know.

Do you have any guess of what the problem may be? If yes, are there any solutions?

Thanks in advance

Giovanni

Hi @giop98,

I would say that the FINN forum is the best place to ask questions about FINN accelerators performance for different configurations.

I would also suggest that you profile your Python code using the Python profiler tools. This should help you identifying were time is spent.

Mario

Thank you very much @marioruiz

1 Like