Residual networks revolutionised DNN research when they were first introduced, allowing deep networks to be effectively trained to high accuracy. Since then, ResNet50 in particular has become one of the most popular neural networks for image classification, and is used both as a benchmark for ML training and inference acceleration, but also as the core of many image processing workloads such as object detection, semantic segmentation, and others.
In this article we describe the first fully quantized, all-dataflow ResNet50 inference accelerator for Xilinx Alveo boards. The source code is available on GitHub and we provide a Python package and Jupyter Notebook to get you started and show how the accelerator is controlled using PYNQ for Alveo.
Built using the FINN dataflow processing architecture, this accelerator showcases the advantage of deep quantization for FPGA acceleration of DNN workloads in the datacenter.
The key performance metrics are:
|FPGA Device||ImageNet Accuracy||Max FPS||Batch-1 Latency||Power @ Max FPS||Power @ Batch-1|
|Alveo U250||65% Top-1 / 85% Top-5||2000||2 ms||70 W||40 W|
In addition to demonstrating the achievable performance of low-precision dataflow acceleration on Alveo, the ResNet50 design serves as proof of concept for two key features of future FINN releases: modular build flows based on Vivado IP Integrator, and pure Python interface to the accelerator. By leveraging PYNQ, we gain access to useful functionality such as power measurement, as demonstrated below.
Modular build flow
FINN accelerators targetting embedded parts, such as the BNN-PYNQ accelerators, implement the entire acceleration functionality in a singe monolithic HLS C++ description. For large datacenter-class designs this approach is not feasible, as the HLS simulation and synthesis times become very large.
Instead, we identify the key computational pattern, the residual block, which we implement as a HLS C++ IP block by assembling multiple Matrix-Vector-Activation Units from the FINN HLS Library. We then construct the accelerator by instantiating and connecting multiple residual blocks together in a Vivado IPI block design, which are then synthesized in parallel and exported as a netlist IP.
In our flow, this IP is linked by Vitis into an Alveo platform, but users are free to integrate the ResNet50 IP in their own Vivado-based flows and augment it with other HLS or RTL IP. See our build scripts and documentation for more information.
Pure Python host interface
Using PYNQ for Alveo, users can interface directly with the ResNet50 accelerator in Python.
To program the accelerator, an Overlay object is created from an XCLBin file produced by Vitis.
import pynq ol=pynq.Overlay("resnet50.xclbin") accelerator=ol.resnet50_1
Before using the accelerator, we must configure the weights of the fully-connected layer in DDR Bank 0. Assuming the weights are already loaded in the NumPy array
fcweights, we allocate a buffer of appropriate size, copy the weights into it, and flush it to the Alveo DDR Bank 0.
fcbuf = pynq.allocate((1000,2048), dtype=np.int8, target=ol.bank0) fcbuf[:] = fcweights fcbuf.sync_to_device()
To perform inference we first allocate input and output buffers for one image, and copy the contents of the NumPy array
img into the input buffer. We then flush the input data to the Alveo DDR Bank 0, and call the accelerator providing as arguments the input and output buffers, the FC layer weights buffer, and the number of images to process, in this case just one. After the call finishes, we pull the output buffer data from the accelerator DDR to host memory and copy its contents to user memory in a NumPy array.
inbuf = pynq.allocate((224,224,3), dtype=np.int8, target=ol.bank0) outbuf = pynq.allocate((5,), dtype=np.uint32, target=ol.bank0) inbuf[:] = img inbuf.sync_to_device() accelerator.call(inbuf, outbuf, fcbuf, 1) outbuf.sync_from_device() results = np.copy(outbuf)
It’s that easy! See our Jupyter Notebook demo and application examples for more details.
Using the PYNQ PMBus extension, we can inspect the power rails of the Alveo card and determine the total card power in real time while the accelerator is executing. Here we plot the total power while changing the batch size. As expected, larger batches provide more frames per second but at an increased power and latency cost. See our Jupyter Notebook demo for an interactive version of this demonstration.