FPGA Data Movement Using NumPy

gnatale · December 10, 2019, 11:28am

top_logo

Without a doubt, NumPy represents today the de facto standard Python package for array-processing. It provides a high-performance array object, and a wide variety of powerful tools to work with these arrays. Indeed, it is used pretty much everywhere! From scientific computing, to machine learning, to computer vision, and is integrated in many Python libraries. And the list goes on!

PYNQ lives and breathes Python, we believe in the potential of this powerful productivity language and whenever we can, we also try to leverage what is available in the Python ecosystem. Indeed, did you know that PYNQ provides native integration with NumPy to manage your Programmable Logic (PL) buffers?
That’s what we are going to explore in this article!

In particular, we are going to use the PYNQ-HelloWorld project, that showcases how to run a resizer IP to resize an image on the FPGA. Instructions on how to deploy this example on your PYNQ-enabled board can be found on the project GitHub page.

The code that will be shown here is run on a Pynq-Z2 board. We are going to refer specifically to the resizer_PL notebook, that make use of a PL resizer IP from the Xilinx xfopencv library, but we will update it slightly to reflect the API changes made with PYNQ v2.5. Let’s get started!

numpy.ndarray

Before diving into the example, let’s take a quick look at the numpy.ndarray interface. It represents the low-level API to instantiate multidimensional arrays in NumPy. Let’s create a simple array of 10 integer elements:

import numpy as np
foo = np.ndarray(shape=(10,), dtype=int)

As you can see, the method here takes a shape tuple representing the shape of the array (in this case a linear array of 10 elements) and a dtype argument that specify the elements data type. This simple snippet of code will be important later on, when we show how to instantiate PL buffers with PYNQ.

Image Resizing in Programmable Logic

This is a block diagram representing the overlay we are going to use. The image is borrowed directly from the original notebook.

Load the Original Image to be Resized

The first thing we need to do is import the required libraries and download the resizer overlay on the FPGA. For this example, we are going to use PIL in conjunction with IPython.display to visualize images in Jupyter, the pynq package to interact with the FPGA and, of course, the focus of this article: numpy!
We are going to show how PL buffers instantiated with pynq can be integrated with popular Python libraries, like the ones we mentioned.

from PIL import Image
import numpy as np
from IPython.display import display
from pynq import Overlay, allocate

resize_design = Overlay("/usr/local/lib/python3.6/dist-packages/helloworld/bitstream/resizer.bit")

We then assign the DMA and resizer IPs from resize_design to the variables dma and resizer respectively.

dma = resize_design.axi_dma_0
resizer = resize_design.resize_accel_0

The next step is to load the target image that we want to resize. To do so, we are going to use PIL.Image.

image_path = "/home/xilinx/jupyter_notebooks/helloworld/images/paris.jpg"
original_image = Image.open(image_path)
original_image.load()

In order to proceed, we now need to create an array of pixels from the loaded image. We rely on numpy.array for this

input_array = np.array(original_image)

And just as done in the original notebook, we are going to retrieve the original image size

old_width, old_height = original_image.size
print("Image size: {}x{} pixels.".format(old_width, old_height))

and display it using PIL.Image.fromarray

input_image = Image.fromarray(input_array)
display(input_image)

Resizing

We are now ready to resize the image using the PL!

We first set a resize_factor (of 2 in the example) and compute the new_width and new_height accordingly

resize_factor = 2
new_width = int(old_width/resize_factor)
new_height = int(old_height/resize_factor)

Then, we allocate the input and output PL buffers in_buffer and out_buffer using pynq.allocate. Notice how the interface is exactly the same as the one defined by numpy.ndarray. Indeed, the buffers are effectively NumPy arrays, but they are contiguous in memory and also provide the physical address we need in order to use them with our overlay’s IPs.

in_buffer = allocate(shape=(old_height, old_width, 3), dtype=np.uint8, cacheable=True)
out_buffer = allocate(shape=(new_height, new_width, 3), dtype=np.uint8, cacheable=True)

That’s where the magic happens! Again, PYNQ buffers are effectively NumPy arrays (with some special properties), therefore it’s important to remark that they can be used seamlessly with NumPy functions and every other function that supports the numpy.ndarray interface (which means also other libraries like PIL as we will see in a moment).

Quick digression: Cacheable vs Non-cacheable Buffers

You may have noticed that when allocating the two PL buffers, we have passed an additional keyword cacheable, set to True. Let’s briefly discuss a little bit about cacheable vs non-cacheable buffers. Cacheable buffers make use of the CPU cache, while non-cacheable buffers do not. This means that for cacheable buffers you will be required to flush (from Processing System (PS) to PL) and invalidate (from PL to PS) the buffers to ensure that changes are visible to either the PS or the PL, based on the direction. By default, pynq.allocate creates non-cacheable buffers, but this can be overridden by setting cacheable to True.

So when it is better to use cacheable buffers, and when non-cacheable ones?

Using non-cacheable buffers might be useful when you need to exchange data between IPs in your overlay. If the PS needs to access only rarely these buffers, or not at all, this is the more sensible choice. Of course, accessing these buffers from the PS will be slower as no cache will be used.

On the other hand, if the buffer you are using holds data that is constantly shared between PS and PL, using a cacheable buffer will be more suitable. However, you will have to pay the price for flushing and invalidating these buffers, inevitably imposing a performance penalty for the flush/invalidate (which might be particularly high on 32-bit ARM architectures).

Back to it…

Let’s now get back at the example. Let’s import the content of input_array into in_buffer using range assignment.

in_buffer[:] = input_array

and to double-check, let’s visualize the image using PIL.Image.fromarray and print the associated resolution.

buf_image = Image.fromarray(in_buffer)
display(buf_image)
print("Image size: {}x{} pixels.".format(old_width, old_height))

As we mentioned previously, notice how the PYNQ buffer in_buffer works seamlessly with PIL since it works as a fully-fledged NumPy array! Also, it’s important to remark that calling PIL.Image.fromarray on in_buffer is a zero-copy operation, no data has been copied in this process.

And to visualize what is going to happen, let’s also display out_buffer. Since the resizer has not been run yet, the buffer does not hold meaningful data (in fact, all elements should be 0). The resulting image will show this.

buf_image = Image.fromarray(out_buffer)
display(buf_image)

We can now run the resizer IP with the code shown below. For details on what is going on, please refer to the original notebook.

resizer.write(0x10, old_height)
resizer.write(0x18, old_width)
resizer.write(0x20, new_height)
resizer.write(0x28, new_width)

def run_kernel():
    dma.sendchannel.transfer(in_buffer)
    dma.recvchannel.transfer(out_buffer)    
    resizer.write(0x00,0x81) # start
    dma.sendchannel.wait()
    dma.recvchannel.wait()

run_kernel()

The PYNQ buffers are designed to work as efficiently as possible with the FPGA, there’s no pre-processing going on under the hood. As a matter of fact, this is also a zero-copy operation! Data is already visible to the PL and consumable as it is!

And to finish up, we can display the resulting image from out_buffer, and the associated resolution, as we did for in_buffer. Again, we use the same code as before, and again, we leverage the fact that PYNQ buffers are NumPy arrays, so getting result from out_buffer is zero-copy.

result = Image.fromarray(out_buffer)
display(result)
print("Resized in Hardware(PL): {}x{} pixels.".format(new_width, new_height))

And this concludes the article. The code shown is available as a GIST. Also notice that, aside from a few little differences, like the use of pynq.allocate introduced with PYNQ v2.5, the code will be almost identical to the original resizer_PL notebook.

Thanks for taking the time to read till the end!

Topic		Replies	Views
Operate on 2D-Array Support	5	963	July 18, 2022
PYNQ DDR+DMA for Arrays (not streams!) Support	2	836	December 1, 2021
fpgaConvNet and PYNQ Community corner	0	288	October 12, 2023
Is there an example using AXI with HLS on the PL and transferring data from the PS? Support	6	1943	January 28, 2020
Need Help: Processing file text in Pynq PL Support	5	1346	July 15, 2020