PYNQ: PYTHON PRODUCTIVITY FOR ZYNQ

A (No-Cross) Compile of PyTorch and OpenCV using AWS A1 instances

A (No-Cross) Compile of PyTorch and OpenCV using AWS A1 Instances

If you have ever used embedded ARM platforms (e.g. RPi or Xilinx’s Zynq) and tried to compile big-software natively, oftentimes you run out of memory (or patience) for the build to complete.

At this point, you get creative - you try a cmake or cross-compile flow, you fire up QEMU and see if you can get enough of an image running to build the software, or you pull up Yocto and craft up a recipe to do what you need. You spend a good 15 minutes searching the Internet hoping you’re not alone. You hope those Raspberry Pi tutorials work for your platform. You find a solution and move on.

In this blog, I’ll talk on how we started solving the big-software compile problem within the PYNQ team for our Zynq UltraScale+ (ZU+) boards. Specifically, how we use AWS A1 instances to compile (not cross-compile!) native ARM 64-bit binaries and pull them back onto our ZU+ boards for final use. I’ll show this with two popular open-source projects: PyTorch and OpenCV.

Backgrounder: Cross Compilation…

… is not a bad thing. Cross compilation is the standard way to compile software for most embedded devices. If a device does not have a native compiler (e.g. gcc), then software binaries must be delivered from another build environment that is not served locally.

If you’ve ever run Xilinx tools, you will have seen commands like aarch64-linux-gnu-gcc used to build ZU+ binaries. That executable is the cross compiler build of gcc used to run on x86 machines and compile for 64-bit ARM processors.

bash$  echo 'int main(){}' > main.c
bash$
bash$ gcc main.c -o main_x86
bash$ file main_x86
main_x86: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked 
          ...
bash$ 
bash$ aarch64-linux-gnu-gcc main.c -o main_aarch64
bash$ file main_aarch64
main_aarch64: ELF 64-bit LSB executable, ARM aarch64, version 1 (SYSV), dynamically linked 
              ...

Xilinx ships both Linux and bare metal versions of gcc and g++ cross-compilers. Above, you can see a native- and cross-compile of a simple C file. The Linux file utility determines the type of the output executable and can recognize both the x86-64 and aarch64 executable. Clearly, only the x86 version can run on this x86 build machine.

If you’d like to see what a manual build of a cross-compiler looks like, check out this article. Once you’ve looked at that process, checkout a toolflow we like, crosstool-ng, which automates the compiler tool dependencies and is an actively maintained project.

How can AWS A1 built binaries run on ZU+ devices?

A1 instances use AWS Graviton processors, which are 64bit ARM processors compatible with Cortex-A53’s running on Zynq UltraScale+ devices as they both implement the Armv8-A microarchitecture. Additionally, when you go to AWS to get an EC2 A1 instance, you can select Ubuntu 18.04 as the operating system - PYNQ is based on Ubuntu 18.04, so we have confidence software builds can be moved between the two platforms at both the instruction and operating system levels.

This isn’t a completely automatic porting as you’ll see. Some packages need to be brought onto either the A1 and ZU+ devices to complete an installation.

Why not just build on ZU+?

The AWS A1s are ‘bigger’ than Zynq UltraScale in terms of CPU count and available memory. A typical Zynq UltraScale+ device will have 2-4GB of memory and 4 CPUs running at 1.2 GHz. When we build our binaries on the A1s, we target the 16-core, 32GB RAM a1.4xlarge instances.

The hope of course is to get a significant speedup on the build time by using the A1s - also remember that since the A1s are in the cloud, we can spin up any number of instances depending on how many binaries need built.

Finally, if you are wondering about how much an a1.4xlarge costs, looking at AWS on-demand pricing today, these A1 instances cost about 40 cents an hour. That’s correct, 40 cents an hour… this blog cost me about $2.

Two Example Compilations: PyTorch and OpenCV

PyTorch is the latest and greatest machine learning framework, popular with researchers and quickly growing to compete with TensorFlow in terms of research papers and deployments. OpenCV is the most popular vision software library and used heavily in video, vision and machine learning pipelines.

Both packages most-often require a source build to run on ARM devices which as you will see can take a chunk of time on embedded devices.

The build scripts are here and be re-run as-is on both A1’s or PYNQ enabled Zynq UltraScale+ platforms. The Ultra96v2 SDCard image is here. Finally, if you have a PYNQ enabled ZU+ board and access to an AWS A1, you should be able to replicate all of this.

I actually ran all the A1 scripts starting from an Ultra96v2 board, using ssh/scp to move files and commands between the two platforms. This can be seen in the Jupyter terminal screenshot below where PYNQ is ssh’ing onto a machine.

Results

The build configurations can be found at the gist link above and are shown here:

As you can see from the table below, there is a 15x speedup in compilation for OpenCV while PyTorch never finishes on the Ultra96v2. We have seen this magnitude of speedup across other software builds, but intuitively more cores, more memory, and more networking bandwidth all contribute to the A1 performance.

OpenCV PyTorch
A1 4m24s 36m23s
Ultra96 66m24s never finishes

For that package deployment cost to using A1s, PyTorch actually builds a Python wheel that can be easily moved and deployed on the Ultra96 board. For OpenCV, there is a bit of tricking the final make install step for the new target (i.e. make the folder paths identical), but this is a tradeoff for the compile time savings.

I also included a Jupyter notebook here that does some quick tests for the PyTorch install at the gist link.

In summary - what else can I compile?

We have compiled several big-software packages on the A1’s – both Python and Debian packages are doable and we try out A1s whenever we come across software that takes over an hour to compile (if at all) on ZU+ devices. Also, whenever we see a source-only distribution for ARM 64-bit devices, this is a good sign an A1 native compile may be helpful.

I hope this blog was informative and gives you another compilation option for your embedded platforms – hopefully on PYNQ, hopefully on Xilinx devices, but in any case, the A1’s are a great option for when your embedded platform needs a cloud-based boost.

1 Like