Allocation problem on run time (pynq 2.7)

Hi all,
I have already posted two posts about this before. I found no satisfactory solution, so I stayed with version 2.6 as the problem does not appear there. But still, if someone could have some insight on how to get rid of this problem of allocating memory in version 2.7 that would be appreciable. Because of this, I can not upgrade to version 2.7. I am using the zcu104 board.

The error is similar to this

Traceback (most recent call last):
  File "untitled", line 986, in <module>
    frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/buffer.py", line 172, in allocate
    return target.allocate(shape, dtype, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/device.py", line 292, in allocate
    return self.default_memory.allocate(shape, dtype, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/xrt_device.py", line 169, inallocate
    buf = _xrt_allocate(shape, dtype, self.device, self.idx, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/xrt_device.py", line 124, in_xrt_allocate
    bo = device.allocate_bo(size, memidx, cacheable)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/xrt_device.py", line 414, inallocate_bo
    raise RuntimeError("Allocate failed: " + str(bo))
RuntimeError: Allocate failed: 4294967295

How this problem occurs:
when the allocate function initializes every time inside a while loop which is looping through 60times in a second.
I think, it is not overwriting the buffer and take each time a new buffer so memory is flooded, which is not the case for version 2.6. Also, FPS is reduced to 4.6 FPS, instead of 60.

h=200
w=200
while (run):
        frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)

If I put the allocate function outside and flood with zeros each time before doing any operation on it, it adds extra latency, which I can not afford for the project.
The question is how to solve this problem? Any idea will be really appreciable.

Mizan

1 Like

Hi @mizan,

Why do you have to “flood it with zeros” if you allocate outside of the loop? You should be able to overwrite without clearing the buffer before.

Did you try to del frame after you are done with it in the loop?

Since you have this for 2.7 and 2.6. What frame.physical_address do you get after allocating inside the buffer for 2.6 and 2.7? Same address or different?

Mario

1 Like

Why do you have to “flood it with zeros” if you allocate outside of the loop? You should be able to overwrite without clearing the buffer before.

Because each part of the menu has different attributes(different image data), can’t keep the data from before.

Did you try to del frame after you are done with it in the loop?

No, i did not. Edit: I tried now, still has the same error appearing after some time. The FPS is also greatly reduced (4.6FPS)

Since you have this for 2.7 and 2.6. What frame.physical_address do you get after allocating inside the buffer for 2.6 and 2.7? Same address or different?

I have not checked the address, I will do so and update you about that. I don’t have any issues program running in version 2.6. Just don’t want to be left behind.

Thanks,
Mizan

If you are overwriting the full frame this shouldn’t be a problem. However, not taking into this consideration. Do you get the performance you need if you allocate inside the loop without “flooding the buffer with zero”? It wouldn’t make sense going this route if the basic operations do not give you the performance.

In pynq 2.7, we moved to XRT for memory allocation, so this is a difference with 2.6.

Can you try something like this?

frame_aux = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)
pointer = frame_aux.physical_address
while (run):
        frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True, pointer=pointer)

This should reuse the same contiguous memory for each iteration.

Mario

So, the physical address in version 2.7 is increase continuously until error comes.

0x6b500000
0x6b600000
0x6bd00000
0x6be00000
0x6bf00000
0x6c000000
0x6c100000
0x6c200000
0x6c300000
0x6c400000
0x6c500000
0x6c600000

In version 2.6: it alternates between two addresses only:

0x6b000000
0x6b700000
0x6b800000
0x6b700000
0x6b800000
0x6b700000

showing this error:

  File "untitled", line 987, in <module>
    frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True, pointer=pointer)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/buffer.py", line 172, in allocate
    return target.allocate(shape, dtype, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/device.py", line 292, in allocate
    return self.default_memory.allocate(shape, dtype, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/xrt_device.py", line 169, inallocate
    buf = _xrt_allocate(shape, dtype, self.device, self.idx, **kwargs)
  File "/usr/local/share/pynq-venv/lib/python3.8/site-packages/pynq/pl_server/xrt_device.py", line 122, in_xrt_allocate
    bo, buf, device_address = pointer
TypeError: cannot unpack non-iterable int object

I also have tried something like this:

frame_buffer = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)
frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)
while(True):
       frame[:] = frame_buffer[:]

It outputs a noisy overlay on image. Could you please suggest how to use pointer you have mentioned?

The pointer is a bit more complex that what I had mentioned earlier, please try this

frame_aux = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True)
pointer = (frame_aux.bo, frame_aux.device.map_bo(frame.bo), frame_aux.physical_address)
while(True):
     frame = allocate(shape=(h, w, 3), dtype=np.uint8, cacheable=True, pointer=pointer)

I tried this locally, and I am getting the same physical address inside and outside the loop. Not sure about the performance.

Mario

yes, I have tried this way, the frame rate increased to 8.9FPS (with a lot of noises in the image) and no errors are showing anymore, but unfortunately, nothing fruitful for my project as I need the 60FPS output.

Ok, thank you for checking.

To confirm, if you allocate outside of the loop and reuse the same buffer, what performance do you get? Without clearing the buffer inside the loop.

Where is this buffer ultimately being used? Do you write to the HDMI or DisplayPort?

Mario

I am using FMC connector for DVI output. I have my custom tmds IP which outputs goes through that connector. I am using the buffer to write to video mixer ip for image overlay on mainstream.

If i put the buffer outside, there is lot of noises (black & image data mixed up), can’t see the overlay anymore, but fps is 60, no error.
If i put it inside without your suggested pointer it is 4.6 (no noise), with error coming at the end.
If i put it inside with pointer it is 8.9FPS (less noise than the first one, no error).

Is there any other way to allocate memory faster as like before?

On the composable overlay I am using a VDMA instead of DMA and I am getting the 60 FPS.

For instance, this class will move images from the PS, either file or webcam to the PL. PYNQ_Composable_Pipeline/video.py at main · Xilinx/PYNQ_Composable_Pipeline · GitHub

I am also using vdma. If you are not doing much on PS, just giving the command only for transfer, it stays 60FPS.

The differences i have found between v2.6 and v2.7 is two things that are comparably slow in v2.7 than in v2.6:

  1. allocating a memory space.
  2. doing some operation on that buffer.

I come to a solution (which will not be used though), doing the operation on a NumPy array and transferring the array to the memory buffer (which was initiated outside the loop) at once, it concludes to 52FPS (still some noises are present, but not much). The problem with this approach is with increasing operations on the array reduces the output speed.
Compare to that, in v2.6, even twice as many operations on the buffer itself it doesn’t drop the frame rate. I wish to have that option in v2.7 too if that’s a possibility.

1 Like

Do you have a small example where we can reproduce these number so we can try to track down the problem and hopefully solve it for the next release.

Unfortunately, i don’t have one. The data is coming from the CMOS sensor and everything is done on PL apart from the some controlling, some mathematics, and making some custom images (overlays, logos, symbols, menus). The menu is a big chunk of highlighting what’s going on in the project. Based on the status and user inputs, it always changes, hence the buffer as well. Basically, it uses opencv drawing, text writing, and blending functions.
I think you can reproduce this just by comparing the timing needed for allocating buffer in both versions and doing some OpenCV operations on those.

start_time = time.time()
for i in range (100):
    frame = allocate(shape=(480, 640, 3), dtype=np.uint8, cacheable=True)
print (time.time()-start_time)

took 0.08554~ second in v2.6 and took 11.74823~ second in v2.7

1 Like

We’ll have a look. It may take time to resolve this, and the fix may be incorporated in the next release

2 Likes

Hi @mizan,

Can you help us profile this in the two images?

If you run the following code in two different cells, you will get a detailed profiling of the call. With this we may be able to narrow down the problem.

from pynq import Overlay, allocate
import numpy as np

ol = Overlay('base.bit')

import cProfile
import pstats
from pstats import SortKey
profile = cProfile.Profile()
profile.run('frame = allocate(shape=(1080,1920,3), dtype=np.uint8, cacheable=True)')
ps = pstats.Stats(profile)
ps.sort_stats(SortKey.TIME).print_stats()
ps.sort_stats(SortKey.TIME).print_callees()

If possible, please provide the logs in separate files for each image.

Mario

1 Like

Python 3.6 in v2.6 doesn’t support SortKey.
v2.6.txt (10.8 KB)
v2.7.txt (40.3 KB)
v2.7_with_sortkey.txt (40.3 KB)

1 Like