Hello PYNQ community:
I wish you all doing great =]
Previous support posts are having trouble on AXI stream HLS about CNN design and DMA:
So I am going to share the details on 2020.2 HLS tool and PYNQ 2.7 design.
The Github Repository Link for this tutorial:
HLS AXI Stream
People keep asking about tlast is missing, which this is common that after version (not sure when) the structure C no longer interpret as we expected:
Old
struct AXI_DMA_IO{
ap_int<16> data;
ap_int<1>last;
};
New
#include <ap_axi_sdata.h>
typedef qdma_axis<16,0,0,0> AXI_DMA_IF;
Some may ask why it is qdma_axis
rather than ap_axiu
?
The control signals I experienced are strb or not.
So if always aligned TX / RX qdma_axis
is enough!
Next, How to TX /RX on HLS?
Block slave → Block master
Remember that internal blocks connection can only use
hls::stream<data width> &<name>
rather than interface AXI stream
#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_fixed.h>
typedef qdma_axis<16,0,0,0> AXI_DMA_IF;
void AXI_DMA_SLAVE(
hls::stream<AXI_DMA_IF> &stream_in,
hls::stream<AXI_VAL> &stream_out)
{
AXI_DMA_IF Inbuf;
Inbuf = stream_in.read();
AXI_VAL status = Inbuf.data;
stream_out.write(Inbuf.data);
}
void AXI_DMA_MASTER(
hls::stream<AXI_VAL> &stream_in,
hls::stream<AXI_DMA_IF> &stream_out)
{
AXI_VAL tmp_val;
AXI_DMA_IF Outbuf;
tmp_val = stream_in.read();
AXI_VAL status = tmp_val;
Outbuf.data = tmp_val;
Outbuf.last = 1;
Outbuf.keep = -1;
stream_out.write(Outbuf);
}
Now we had a complete idea on HLS side:
Next we can construct our Vivado Project:
Top design view
DMA blocks with HLS block
DMA Engine settings
Wonderful after synthesis and compile our design
Put both bit and hwh to the same folder of the PYNQ disk:
This is the Jupyter Note Book design
mnist.ipynb (12.9 KB)
For the CNN MNIST we are going to train on host PC:
Reference https://www.kaggle.com/code/oricou/mnist-without-cnn-and-softmax/notebook
These are the steps:
- Float32 training with minimum layers required to achieve good accuracy
mnist_keras.py (1.9 KB) - Quantization + CNN inference compare
load_mnist.py (1.7 KB) - Model format convert
convert.py (246 Bytes)
We can see that the weights are converted in to 1+7 fixed point format here.
Accuracy are changed from 95.4 to 95.4 0.02% lost great!
After the conversion we will got a file named ‘model.tflite’ this will be load via Jupyter Book with Tensorflow Lite
Final FPGA inference result
Accuracy actually increase by 0.75% and ~- 93%
ARM Run Time # 10000 = 290.4783687591553
FPGA Run Time # 10000 = 39.431140661239624
Total Acceleration 7.366724976452236