Pynq.allocate: allocation and address alignment using PL DDR4

Hi Folks,
I’m working on a subclass of DefaultIP to drive the Xilinx Multichannel DMA in Scatter Gather mode so we can capture to the PL DDR4. I’ve got an overlay that uses the MIG and a SmartConnect to tie both the MCDMA and an AXI HP Master from the PS into PL DDR. I’m planning on SG Buffer descriptors in PS DDR and data going to PL DDR.

I suspect that I could manually populate all the addresses for the PL DDR4 based on the address space and that the MCDMA would merrily clobber anything already there or otherwise using those addresses (nothing is). Ultimately I’m going to be repeatedly filling up the DDR4 and doing analysis on the results so getting a numpy array of good bits of the data will be important. While I I think I could pull it out via MMIO.read my understanding is that this is going to be painfully slow.

Thoughts? Is this not a good approach?

Roughly speaking I’d like to be able to modify this snippet or extend allocate so I can do something like this:

class MCS2MMBufferChain:
    def __init__(self, n, buffer_size=8096, contiguous=True, zero_buffers=True):
        # TODO ensure starting address is a multiple of 0x40
        self._chain = allocate(16*n, dtype=np.uint32)   
        if contiguous:
            self.contiguous = True
            # TODO ensure starting address is on a stream width (here 0x40) boundary as DRE is disabled
            # TODO allocate in the PL DDR4
            buff = allocate(buffer_size*n, np.uint8) 
            self._buffers = [buff]
            self.bd_addr = [buff.device_address+buffer_size*i for i in range(n)]
        else:
            # TODO ensure starting address is on a stream width (here 0x40) boundary as DRE is disabled
            # TODO allocate in the PL DDR4
            self._buffers = [allocate(buffer_size, np.uint8) for i in range(n)]
            self.bd_addr = [b.device_address for b in self._buffers]

Have you tried the mmio read? I don’t think the performance is too bad; also, if you design your own API, it’s better to have intuitive API, rather than making some assumptions like starting address, boundaries, etc.

I’ve not tried MMIO beyond verifying that I can read and write to the PL DDR4. My understanding is that MMIO is based on 32 bit, non-bursted reads and writes which would impose a serious bottleneck. With a DDR interface of 512bits and a PL Master of 128 at 333MHz I’d gather that I’m looking at a (naive) difference of between 400ms vs 100ms to read out the full 4GB PL DD4 (assuming one AXI4 per clock, which is wrong). I’m not sure what the transaction overhead is for a single read cycle off hand and I’d image that linux on the PS system side is going to dominate here. Without an allocated address (ultimately backed from the xlnk library, right?) I’m not going to be able to do things like stream from the PLDDR4 to the gigabit port.

Our instrument is going to be repeatedly capturing data streams that fill or mostly fill the DDR4 (We may even try swapping for a larger SODIMM per the pdf docs) and then ship that data off. The gigbit NIC should be the ultimate speedbottle neck here, but my understanding was that for robust access to that memory space I needed the Xilinx memory library to have an allocation to it. Am I misunderstanding something here?

I’m inferring from your response that what I was asking about is either unsupported or a VERY HARD THING (e.g. involving a rebuild of a shared lib or some such)?

@baileyji
I’m going through your same route, writing an MCDMA driver and trying to allocate a Numpy array in PL DDR. How far did you go on this? Are you reading the data using MMIO, or were you able to allocate and move blocks of data?

Thanks for any insight.
-Pat