Help debuging chronic PYNQ System Hang

Hi Folks,
Since updating to the 2.5 image we’ve been seeing quasi-regular ZCU111 board hangs, possibly even without an overlay loaded. We’ve also been starting to test new cores much more heavily, so we did change the cpuidle setting so we could use the System ILA in Vivado.

I had been chalking the need to power cycle the board ~2-3 / week up to us causing deadlocks or not properly connecting/disconnecting from the ILA, but this weekend it hung after a fresh boot without us having loaded any of our software for testing or starting anything other that the hw_server on the host PC.

Two questions:

  1. How can I even start to debug this? When it happens after testing for a while and walking away I can tell from the JTAG connection in vivado that some things are still alive, but the Jupyter server isn’t responsive (This is NOT directly a cpuidle issue as I set that and can use python and the ILA concurrently).

  2. Is there a way to trigger a reboot of the board from Vivado or some Xilinx tool on the windows host PC? This would move this from a major headache to a minor inconvenience. The boot from configuration memory in the hardware manager doesn’t seem to “just work”, I’d guess because it isn’t configured properly to be aware of the uboot loader or some such.

Thanks!

Generally full-system hangs are the result of AXI transactions hitting the fabric and not being acknowledged for one reason or another although why that’s happening at boot is don’t know. One thing to try is setting up the AXI watchdog timers on the AXI master connections which will trigger a slave error rather than a hang and see if that helps.

The following code will do this for the LPD master.

mmio  = pynq.MMIO(0xFF416000, 64)

mmio.write(0x18, 3) # Return slave errors when timeouts occur
mmio.write(0x20, 0x1020) # Set and enable prescale of 32 which should be about 10 ms
mmio.write(0x10, 0x3) # Enable transactions tracking
mmio.write(0x14, 0x3) # Enable timeouts

And again for the The FPD

mmio  = pynq.MMIO(0xFD610000 , 64)

mmio.write(0x18, 7) # Return slave errors when timeouts occur
mmio.write(0x20, 0x1020) # Set and enable prescale of 32 which should be about 10 ms
mmio.write(0x10, 0x7) # Enable transactions tracking
mmio.write(0x14, 0x7) # Enable timeouts

You can also enable a system-wide watchdog that’ll reset the board if Linux locks up for some reason. I don’t know of a way of triggering a reboot via JTAG in a nice way. You might be able to write to the watchdog registers via JTAG and use that to reset the board.

Peter

1 Like

Hey there!
I am facing an issue whereby my board locks up when writing to some specific memory mapped registers, and I thought a watchdog timer to reset everything would help with debug.
I tried to run the code you posted above and while I am able to instantiate the pynq.MMIO class, writing to it causes the kernel to die. I am using Pynq 2.5 on a PYNQ-Z2.
Am I doing something wrong (do I need to enable the watchdog and associated timers in the Vivado block design?)? I am quite new to the platform so I might be missing something obvious.

Thanks!

The code posted above is for the ZCU111 which has a ZYnq Ultrascale+ and a 64-bit ARM A53 processor.
The PYNQ-Z2 has a 32-bit ARM A9 so this code won’t work. Rather than replying to old threads, please open a new post with details of your problem.

Cathal