AXI burst size in Pynq Z2 vs Vitis HLS

Hello everyone,

I am facing a hard time understanding a peculiarity about how I can code in Vitis HLS for the Pynq Z2. In particular, I am working on memory interface, and I would like to interface my kernel with the DDR of the FPGA.

In the Vitis HLS documentation the port size for AXI interface can be widened up to 512 bits, because “Vitis HLS supports” it. Still, HLS must be eventually translated in hardware implementation, and here’s where I don’t understand. For Pynq Z2 the maximum burst size is 64 bits, hence what happens when I widen the port of the M_AXI_Adapter in Vitis HLS up to 512 bits? How can it be implemented if the hardware will eventually face a 64 bit-wide port?

I have gone through lots of documentation, internet, chatGPT, but I can’t find an answer. I am a CS guy and I don’t know much about this detailed implementations.

Thank you so much in advance.

Best,
Davide

@davide-giacomini

Unaligned transferring.
AKA masking bus content when High to Low:
Buffer content to block and transfer sequentially.
When Low to High:
Simply offset address and masking.

So you can see that High to Low introduce bottleneck.
And Low to High introduce throughput wastage.

ENJOY~

Thank you for your immediate answer.

I did not understand though, I am sorry. What is it “masking bus content when High to Low”?

Thank you again
Davide

@davide-giacomini

Remarks:
[High/Low] bit-width (BW) bus on AXI

Masking is a simple method to remove unused content in the BUS.
For example I need address [0x00, 0x03] byte data in a 64 BW data bus:

Data [HH HH HH HH HH HH HH HH]
Mask [00 00 00 00 11 00 00 11]
Data result: [XX XX XX XX DD XX XX DD]

So simply said the resulting data will be push to the byte/word buffer to next block.

ENJOY~

I understand the masking now, thank you. Although, this doesn’t seem stable. What I mean is, how can I be sure that this method always ensures a constant port width of 512bits?

Do you know where I can find more detailed information? It isn’t in the User Guide of Pynq or Vitis.

Best,
Davide

@davide-giacomini

The AXI can only set to i.e. 32 64 128 256 512 bit where 2^N of byte order.
This is predefine in the HLS of the data you are trying to handle.

HLS CPP header is what you have define a structure where the bus BW is using.

ENJOY~

Still, I think I cannot understand how can Vitis HLS read 512 bits at a time while the Pynq Z2 sends 64 bits at a time… I mean, even though the DDR has a higher frequency, how the PL has the caability to implement ports 512 bits wide? How is it possible?

Davide

@davide-giacomini

How do you share a pizza to your friend when you only order 12 inch size?
You split it and pass it one by one.
How about, sharing pizza to a group of people?
You wait until the pizza is ready from kitchen and settle part of the group at a time.
Simple~

ENJOY~

Thank you, but do you know how Vitis HLS implements ports 512 bits wide? I cannot find it.

Thank you again,
Davide

@davide-giacomini

I had mentioned:
There is a structure in the header or define section to defines the interfaces of the function passing I/O.
Look for it and you can also find a AXI-Stream tutorial here or online.

For example in MNIST CNN design:

ENJOY~

I am sorry, I think we are misunderstanding each other. I know how to tell Vitis with pragmas or directives how to implement it, but I would like to know how it is implemented, more or less, in hardware. I need it to adapt my directives to the FPGA that I am using under the hood. If, for example, I am using an FPGA that doesn’t have enough resources, maybe I cannot actually widen the port to 512 bits.

I only need to understand where is this information, as I can perfectly find the ports width of the DDR Controller, or other features, but nothing about the M AXI Adapter of Vitis HLS: I can only find how to tell Vitis what to do (pragmas, directives, etc.).

I am sorry again, and thank you.
Davide

@davide-giacomini

This is not logical.
Width bus is utilizing more resource as it is very sure buffer size is more larger.
Meanwhile, narrow bus design can use pipeline to reduce resource but trade off the cycle time to pass information. Hence, buffer vs stall time.

What HLS does is what hardware dose HLS is just a conversion of high-level language and final result is what logic element had implemented.
These are simply net-list to LUT FFs and MUX.

So I am not sure what you are trying to ask here?
If you are not familiar with FPGA from beginning you need more study i.e, video on VLSI design.
AXI or AXIS can both utilize a cross-bar design to handle different size of interface as long as protocol are the same AKA AXI-4 from ARM original.

However, for ZYNQ ARM hard CORE, you cannot modify the bus width this is not programmable and yet it is already fixed to the inherent IC design.

ENJOY~

I feel so dumb, all right, I will try to write here my understanding so that you can tell me if and where I am wrong, because I don’t wanna waste your time.

I have a Pynq Z2 and I want to set up a master axi interface between the PL and the DDR of the board, using AXI HP protocol.
I saw that the DDR Controller has a much higher frequency than the PL, so the data can and will be buffered in a FIFO (a block RAM I think).
The DDR Controller has 2 ports of maximum 64 bits connected to the Memory Interconnect, which in turn connects to the HP AXI Controller (PL) through 4 AXI HP ports 64 bits wide.
At this point, the HP AXI Controller is PL fabric, so it is implemented in the FPGA. If I tell Vitis HLS to widen each port up to 512 bits, the PL will be implemented so that I will be able in fact to read 512 bits at a time for each of the four ports of the PL.

Hence, the DDR can ony transfer 64 bits for 2 ports but the PL will be able to read 512 bits for 4 ports. I am wondering how many resources will be used to implement all of this and also if this is something feasible for any FPGA or only for certain families of FPGAs.

I have attached the block diagram here as reference. I am pretty sure I am missing something, but I hope to be clear now, sorry.

Thank you,
Davide

@davide-giacomini

No, you cannot think from the DDR speed you are referring to 533MHz!
→ AKA 1066 t/s
This is not actual data arrival time you hope to get.
Maximum ARM DDR is 32bit and considering the time to switch CAS RAS etc.
I guess around 256 / 32 and additional turn overtime, would NOT even higher than 200MHz.
You need some test to verify this. Try a simple long read and long write on different DMA BUS BW settings.

From left top corner can you see?
Where are the source data from?
Same as what I had mentioned in the pizza example.
What do you think it is the case?
Remember the DDR controller cannot break the physical rule here.

You do able to use four channel of AXI BUT, what do you think the possible issue here?
Apply the pizza feeding rules:
How do you think the actions are required to make quad channel forming 512 BW?

This inherently need real implementation to verify the final answer.
But from what I guess quad channel forming one won’t do any good.
I will suggest dual forming one aka 128 BW BUS if you really need to.

Now this is no longer PYNQ question more and more bounded to information theory.

ENJOY~

Hi David,
If you are accessing the DRAM through the HP ports, you are correct, these HP ports are 64 bits wide.
HLS can generate IP interfaces that are varying sizes, including 512 bit. To connect a 512 bit interface from your IP to the HP port, you need an adapter that will convert between the two interfaces. See “AXI Interconnect” or “AXI Smartconnect” IP in Vivado)
Your 512 bit interfaces in the adapter will be broken down into 8 blocks of 64 bits each, and the adapter will manage the transfer of data over a number of clock cycles.
You can run your IP, and the HP ports at different clock speeds. You can specify different clocks for each side of the adapter. However, it is probably unlikely that you want to build a design like this with a ratio of 512:64 , as you probably won’t be running your IP 8x slower than the max speed of the HP ports. It is much more likely you would use say a 128 bit interfaces, so you only have a 2x step up or down.
The 512 bit interface on your IP will also consume more FPGA resources than a smaller width interface.
Unless you can benefit from a wider width, you will be better using a 64-bit width for your HLS IP.
For info, if you have internal interfaces between IP in your design, you may choose to use wider interfaces for these.

One nice thing about HLS and how the design is connected in Vivado is that it is relatively easy to try different configurations. The adapter should automatically manage the different port sizes.

Cathal

1 Like

Thank you so much! You have been extremely helpful and clear. I will follow your advice.

Best,
Davide

1 Like

@cathalmccabe
@davide-giacomini

Apricate Cathal summarizing my posts. But instead of just summarizing idea, why not providing real measurement data?
I guess it is better to provide real measurement result:
referenced from:

Quad concurrent read channel maximum measured: 255MByte
Quad concurrent write channel maximum measured: 340MByte

When interleaving accessing single channel throughput is not showing great W/R bandwidth.
This is inherently what interleaving does in DRAM topology.

BTW 255MB / 150MHz = 13.6 bit width equivalent bus TRX.
So It shows that the AXI addressing:
Not using Translation Lookaside Buffer (TLB)
But simple circular buffer.

Hope this can ends this post.

ENJOY~

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.