MNIST + AXI Stream on PYNQ 2.7 (Attention to Details)

Hello PYNQ community:

I wish you all doing great =]
Previous support posts are having trouble on AXI stream HLS about CNN design and DMA:
So I am going to share the details on 2020.2 HLS tool and PYNQ 2.7 design.

The Github Repository Link for this tutorial:

HLS AXI Stream

People keep asking about tlast is missing, which this is common that after version (not sure when) the structure C no longer interpret as we expected:


struct AXI_DMA_IO{
	ap_int<16> data;


#include <ap_axi_sdata.h>
typedef qdma_axis<16,0,0,0> AXI_DMA_IF;

Some may ask why it is qdma_axis rather than ap_axiu?
The control signals I experienced are strb or not.
So if always aligned TX / RX qdma_axis is enough!

Next, How to TX /RX on HLS?

Block slave → Block master
Remember that internal blocks connection can only use
hls::stream<data width> &<name>
rather than interface AXI stream

#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_fixed.h>

typedef qdma_axis<16,0,0,0> AXI_DMA_IF;

	hls::stream<AXI_DMA_IF> &stream_in,
	hls::stream<AXI_VAL> &stream_out)
	AXI_DMA_IF Inbuf;

	Inbuf =;
	AXI_VAL status =;

	hls::stream<AXI_VAL> &stream_in,
	hls::stream<AXI_DMA_IF> &stream_out)
	AXI_VAL tmp_val;
	AXI_DMA_IF Outbuf;

	tmp_val =;
	AXI_VAL status = tmp_val; = tmp_val;
	Outbuf.last = 1;
	Outbuf.keep = -1;

Now we had a complete idea on HLS side:

Next we can construct our Vivado Project:

Top design view

DMA blocks with HLS block

DMA Engine settings

Wonderful after synthesis and compile our design

Put both bit and hwh to the same folder of the PYNQ disk:

This is the Jupyter Note Book design

mnist.ipynb (12.9 KB)

For the CNN MNIST we are going to train on host PC:


These are the steps:

  1. Float32 training with minimum layers required to achieve good accuracy (1.9 KB)
  2. Quantization + CNN inference compare (1.7 KB)
  3. Model format convert (246 Bytes)

We can see that the weights are converted in to 1+7 fixed point format here.

Accuracy are changed from 95.4 to 95.4 0.02% lost great!

After the conversion we will got a file named ‘model.tflite’ this will be load via Jupyter Book with Tensorflow Lite

Final FPGA inference result

Accuracy actually increase by 0.75% and ~- 93%

ARM Run Time # 10000 = 290.4783687591553
FPGA Run Time # 10000 = 39.431140661239624
Total Acceleration 7.366724976452236


Hi Brian thanks for sharing, great work :smiley: . Having tflite running on Pynq increase the possible applications. Also, thanks for giving some performance measures. That’s really helpful when we need to decide which development path is better for our application.

1 Like

Hello All,

A CONV+ACT+POOL->FC CNN is also committed onto the GitHub.
Acceleration rate is even more amazed:
Total Acceleration x 672.19
Accuracy is ~ 95.x when 8bit fixed point.

1 Like