PYNQ-Torch: a framework to develop PyTorch accelerators on the PYNQ platform

Hi All,

We are pleased to share our work in bringing PyTorch to the world of PYNQ. You can find our git page here.

We have developed a framework which can be used to accelerate any PyTorch-developed Neural Network on the PYNQ platform. In addition, using our guides, developers can

  1. Port PyTorch (v0.4.1 & v1.2.0) on the PYNQ-Z1 board
  2. Design efficient hardware with good practices
  3. Avoid complexities in the overall development process

Happy Developing!

5 Likes

@manoharvhr Do you have an example notebook that we can reproduce?

Hi Naveen,

Check out this link from our Github page.

Hi Naveen,

Is it possible to feature my work in the community page? We also have an IEEE published paper available here.

1 Like

Hi sir,

Thank you for the hard work in developing this useful project.

I am having some trouble getting the “Overlay” command to work with the Regression example you provided in PyTorch_Installation/Example, where the kernel seems to freeze when I call this command to add the backward_lite_features.bit bitstream in PyTorch_Installation/Examples.

I recently successfully built the PYNQv2.4 image for the ZCU102 using this reference.

I followed the helpful PyTorch installation instructions that you provided here.

Due to my workplace network restrictions, I was unable to get an internet connection directly to my board, so I was actually unable to complete most of the beginning of your tutorial, including the “apt-get update/install”, “dd/SWAP file”, and “git clone” related sections.

In fact I ended up skipping to the step where we clone the PyTorch repository. I actually cloned the PyTorch repository and checked out the appropriate branch all from my host PC, and used cyberduck SFTP to copy this repo from my host PC to my ZCU102 board. After copying the PyTorch repo to the board, I ran the “python3 setup.py build/develop” commands, and verified that it seemed to work with your simple test example, shown below:

python3
import torch
x = torch.randn(5,5)
y = torch.randn(5,5)
print(x+y)

Next, I moved to the PyTorch_Installation/Example folder, and tried to work through the RegressionApplication.ipynb example.

Everything went smoothly until I tried to call the Overlay command in cell #5, where the board immediately froze up, as the TCP/IP connection was lost and the serial terminal became nonresponsive.

Is this because I am trying to use this command to add a bitstream that is incompatible with my board, or made for a different board? Or maybe I did not install your PYNQ-Torch package correctly? Or maybe I did not build the PYNQv2.4 image correctly for the ZCU102?

Best,
-Gheorghe Schreiber

Hi Gheorghe,

I suspect that the issue is with the fact that the bitstream I provided is for a different FPGA. I recommend you try creating it for your board with the files I have provided. If you face difficulty doing so, look at our published paper or reply here.

Good luck!

Kind Regards,
Manohar

Hi sir,

Thank you for the advice, it seems promising.

I took your project Regression.xpr, changed the Zynq block to the “Ultrascale Zynq” block, and changed the project’s board part to that of my board, resulting in a block diagram as shown in the attached screenshot.

Unfortunately this project failed to compile, where it seems like I must regenerate the backwards_lite_0 and equation_matrix_0 IP cores using my board part instead of that of the PYNQ Z1. I will post a a screenshot of the error messages in a follow up reply.

Do you know how I can update your IP cores to be compatible with my board, the ZCU102?

Thank you for the help again.

Best,

-Gheorghe S


Attached are the error messages shown when I try to compile the project, where vivado complains about the IP cores being compiled with a different board part.
Thank you for the help.
Best,
-Gheorghe

Hi Gheorghe,

You are right that the IP is not for your board as well. To solve this, you must use Vivado HLS, set up a project with your board configuration and copy-paste the cpp files for those IPs which are on the GitHub page.

1 Like

The exported IP only lists support for Zynq. See this line in the component.xml for the IP:

Assuming the IP is compatible with other devices, as a quick fix, you could manually edit the component.xml to add a line to support ZU+ or other devices.

This is the line you might add to allow the IP to be added to a Zynq Ultrascale design:
<xilinx:family xilinx:lifeCycle=“Production”>zynquplus</xilinx:family>

Cathal

2 Likes

Hi Cathal,

Thats actually a much more convenient solution. I would give that a go as it should be compatible.

1 Like

Hello all,

Thank you for providing such helpful and direct answers to my question on regenerating IP cores.

I was able to use HLS to regenerate these IP cores for my board (backwards_lite and equation_matrix on the ZCU102).
I routed them in a Vivado project for my board, starting with the BSP design and adding components to match your Vivado project for the Z1. I included this Vivado project along with the output .bit, .tcl and .hwh files in this email. backward_lite_features.zip (1.1 MB)

I auto-assigned addresses to all IP cores in this project (my board seems to have different address restrictions than the Z1, so I was unable to directly copy the Z1 addresses from your project to their equivalents in my project). I built the bitstream without errors, and further loaded it onto my board without the board freezing up (improvement from last time!).

I was able to run up to cell #6 of the RegressionApplication.ipynb jupyter notebook that comes with this example project.
While running cell #6, I encountered a DMA-related error shown in the below screenshot. It seems like the command command dma2.recvchannel.wait() seems to hang up unless the user force quits with a keyboard interrupt.
Also shown, it seems like the dma2.recvchannel object is successfully initialized in my bitstream, but is not quite completely working, which is surprising to me because the command dma2.senchannel.wait() seems to work fine.


Do you have any advice for me to help debug my DMA problem?
Thank you so much for your previous help, I look forward to hearing from back you soon.
Best,
-Gheorghe

Hi Gheorghe,

Glad you were able to recreate the IP. I have faced DMA issues in my project as well. This usually happens due to the settings of the DMA IP in Vivado. Have you changed these? For instance, you have to keep the Scatter Gather Engine option unchecked.

Another thing you should watch out for is the ports being connected correctly. If you are uncertain of this, redownload my project and have a look at how things are hooked up.

Finally, you can also refer to Section 3 of my paper which highlights the entire development process.

If you wish to try a simpler application, I have attached my entire Thesis document. I recommend reading Section 4 which is the Framework developed. Honestly, the Regression example is an overkill for the first.

Hardware Accelerator for Recurrent Neural Network-Based Sound Synthesis.pdf (3.1 MB)

Hello Manohar,

Thank you for the attaching your paper, I am very impressed with its rigor, and am surprised at how it was only for your Bachelors degree, I thought this work is worthy of a Masters!

I checked my Vivado project against yours, it seems like we have the same connections between blocks, but I am suspicious that my addressing, which is different from yours, may be causing my DMA problem. I attached a screenshot of my addressing below (your Vivado project is shown on the left, mine is shown on the right).

Do you recommend starting with your SampleRNN example instead of your Regression example? I have already successfully went through the PYNQ overlay tutorial.

Maybe there is another simple example/tutorial for using DMA, similar to the PYNQ Overlay tutorial linked above?

In the end, I am trying to use your PYNQ-torch framework to accelerate the PyTorch neural networks in this project: GitHub - PhysicsNAS/PhysicsNAS

Do you think this is possible? My plan for achieving this is to get the PYNQ-Torch Neural network accelerators from your examples working, and then re-use them for my own pytorch project.

Best,

  • Gheorghe S

Hi Gheorghe,

I suspect that you maybe short on memory. Can you not give it 64K for Data, I recall this being a very important factor.

As for a simpler example, in my Thesis, the last subsection in section 4 is of a very simple RNN implementation. Highly recommend you give that a go after you try the above.

Kind Regards,
Manohar

Dear Manohar,

Thank you for the helpful advice. Even after changing my memory settings, I was not able to get the Regression ipynb working on my ZCU102 board. I saw others online have similar problems, which one person seemed to have solved by increasing the “Width of Buffer Length Register” parameter for the DMA block. I will explore these options later.

In the meantime I decided to buy the PYNQ-Z1 board that you have used for the Regression project, to hopefully get your regression model working out of the box.

I would like to re-use/modify the neural network accelerators that you have built for your Regression project and apply them to accelerate my own PyTorch project:

Do you think you can outline how one would go about using the accelerators that you have built to accelerate another PyTorch project, containing different neural networks?

Using your overlay, where would I “Drop-In” your accelerators in my PyTorch project? For example, we can look at “Embedded_physics_collision.ipynb” where we train a neural network.

If we comment out the “x = x.cuda()…” lines in the third cell of the above example ipynb, defining the training and test functions, we can run the code on a PYNQ board, although it is very slow. How can I use your accelerators to speed up these training and test functions?

Do I have to modify parts of the C-code defining your accelerators for my different neural network? What parts can I re-use?

Thank you for the help and support.

Best,
-Gheorghe S

Hi Gheorghe,

One way you can decide whether you should use the accelerators I designed is through profiling of your own application. This will help you identify where the bottleneck is which can then be brought into hardware.

As for the modifications required for the accelerators I have designed, indeed you will have to tailor the code to suit your inputs/outputs shape. Remember the benefit of an FPGA is not for abstract hardware design but for very specific hardware design to maximise performance for the requirements.

Kind Regards,
Manohar

Dear Manohar,

Thank you for your close guidance in my efforts at learning from your paper and example projects.

After studying your Regression example, it appears to me that you only accelerate computations for data-generation (through function get_batch() calling make_features(), in turn sending and receiving xlnk arrays to dma2), and for backpropagation (where during the model training, function loss, y-output-data and x-input-data is sent to dma1, and gradient data is received from dma1).

To start, I would like to accelerate one of my Pytorch NNs using just the latter, backprop acceleration.

I noticed that your accelerated NN is made of only one single linear layer with 5 inputs and 1 output. My NN has several Linear layers with ReLUs in between. The first linear layer has 7 inputs and 128 outputs, followed by a middle linear layer with 128 inputs and outputs, followed by a final layer with 128 inputs and 2 outputs, all with ReLUs in between.

To accelerate the backpropagation for my multilayer NN, I would guess that one must thoroughly rewrite the C-driver for backwards_lite_ip, not only changing the input and output shapes, but also adding similar code for my large 128x128 hidden layer and relu layers?

Maybe a pseudocode example of a C-driver for this multilayer backpropagation computation could help?

The pytorch NN I wish to accelerate is linked below.

Thank you again for your support, and also your hard work in building this pytorch-on-FPGA platform!
Best,
-Gheorghe Schreiber
PS Please feel free to reach out to me at:
3232290084
gheorgheschreiber@gmail.com

Dear Manohar,
I have made some progress in an effort to generalize your backward_lite driver for single layer networks of arbitrary shape. My plan is to eventually generalize this backprop driver for multilayer networks too. main.cpp (5.0 KB)

I had some questions about your backward_lite driver, written for the Regression example provided in the PYNQ-Torch Github.

  1. What is the purpose of line 55-59, particularly line 58? It seems to normalize the value for “dif” to +/- 1 if abs_dif is evaluated to be < 1. Is this to avoid vanishing gradients?
  2. Why do you use #pragma HLS PIPELINE to optimize the first four for-loops, but not the 5th and 6th for-loop?
  3. Why do you use pipelining instead of unrolling to optimize these for-loops?

I have attached a copy of an edited version of your backward_lite driver, which I edited for accelerating single 2x7 linear layers instead of single 1x5 linear layers. main.cpp (5.0 KB)

Thanks again for your help and hard work in making this PYNQ-Torch project.

Best,