PYNQ: PYTHON PRODUCTIVITY

DMA Channel Not Started when changing data rates

Hi all!
I am using a Pynq Z2 and I was attempting accelerating a FIR filter function as detailed in this tutorial: http://www.fpgadeveloper.com/2018/03/how-to-accelerate-a-python-function-with-pynq.html

I wish to implement an interpolating filter, which seems to increase the output data rate by a the zero-packing factor.

I have configured the IP as such to have a interpolation rate value of 4.

On the DMA configuration side, it seems like the output data width is correctly updated to match the higher data rate - the Stream Data width is 128-bit, 4 times of the input data width.

I also disabled the scatter-gather engine as the above tutorial mentioned that Pynq-Z2 does not support this option.

Subsequently, I created this top level block design including the Zynq PS:


I added a concat with a AXI interrupt controller to the DMA controller, as I read that “Interrupt” interferes with importing some underlying Pynq framwork (correct me if my implementation is wrong)

After generating bitstream, I call the hardware function as follows:

# Import the hardware overlay
from pynq import Overlay
import pynq.lib.dma

overlay = Overlay('/home/xilinx/pynq/overlays/fir_interpolation/fir_interpolation.bit')

dma = overlay.filter.fir_dma
sendstatus = dma.sendchannel.running
recvstatus = dma.recvchannel.running
print("DMA Channel status: Send: ", sendstatus, " Recv: ", recvstatus)

overlay?

The print statement returns the following, showing that initially, the DMA channel is running.
DMA Channel status: Send: True Recv: True
The result of the overlay? is:

Type:            Overlay
String form:     <pynq.overlay.Overlay object at 0xac352a30>
File:            /usr/local/lib/python3.6/dist-packages/pynq/overlay.py
Docstring:      
Default documentation for overlay /home/xilinx/pynq/overlays/fir_interpolation/fir_interpolation.bit. The following
attributes are available on this overlay:

IP Blocks
----------
filter/fir_dma       : pynq.lib.dma.DMA
axi_intc_0           : pynq.overlay.DefaultIP

Hierarchies
-----------
filter               : pynq.overlay.DefaultHierarchy

Interrupts
----------
None

Oddly enough, the FIR_compiler IP does not appear in the overlay list, and interrupts are not reflected either.
I have included a .hwh file in the same directory as the .bit file imported above, with the same file name.

Subsequently, I execute the following:

# Test the overlay
from pynq import Xlnk
import numpy as np

# Allocate buffers for the input and output signals
xlnk = Xlnk() # create a Xlnk object, xlnk.
in_buffer = xlnk.cma_array(shape=len(x_in), dtype=np.int32)  
out_buffer = xlnk.cma_array(shape=len(x_in*rate), dtype=np.int32) 

# Copy the samples to the in_buffer
np.copyto(in_buffer, x_in)

# Trigger the DMA transfer and wait for the result
print("Sending")
dma.sendchannel.transfer(in_buffer)
print("Recieving")
dma.recvchannel.transfer(out_buffer)
print("Waiting for buffers to be idle")
dma.sendchannel.wait()
dma.recvchannel.wait()

print("Output:", out_buffer)

# Free the buffers!
in_buffer.close()
out_buffer.close()

However, this returns:

Sending
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-23-a63a2d2d630f> in <module>()
     17 # Trigger the DMA transfer and wait for the result
     18 print("Sending")
---> 19 dma.sendchannel.transfer(in_buffer)
     20 print("Recieving")
     21 dma.recvchannel.transfer(out_buffer)

/usr/local/lib/python3.6/dist-packages/pynq/lib/dma.py in transfer(self, array)
    118                               array.nbytes, self._size))
    119         if not self.running:
--> 120             raise RuntimeError('DMA channel not started')
    121         if not self.idle and not self._first_transfer:
    122             raise RuntimeError('DMA channel not idle')

RuntimeError: DMA channel not started

This is a rather obscure warning error. What could be causing it?

Thanks in advance for the help and advice!

Hi I seem to be having a similar problem, outlined here: https://discuss.pynq.io/t/running-fft-with-axi-dma-channel-wait-dma-channel-not-started/1479 Did you find any solutions in the mean time?

Unfortunately not. I plan to downsample before passing the data off the board through the DMA anyway, so the output data rate remains similar to the input, but I would like to know the reasons behind this nonetheless!

I am just guessing: try to avoid using PS interrupt 0? You can actually have a concat block in between axi interrupt and PS.

Also, it might be even easier to avoid using interrupt at all? That help debugging at least.

Hi rock, I removed the PS interrupt as well. In addition, I downsampled the output data from the FIR compiler, such that the input and output data rates are now the same. The “DMA Channel not started” error does not show now - but I am getting all zeroes from my output data buffer.

How would you suggest to debug this issue? Thanks!

Looks like the IP (filter) has some issues? Have you done any testing for the IP? You might have some control registers for your IP, but from your Python code, I did not see you start the IP? It looks like you only started the DMA, but I am not sure if the IP has been ready for processing data. Is there like a status register in your IP to poke (through AXI-lite port of the IP)?

1 Like

Hi Rock,
Thank you for the reply. I have not checked this, and have only configured the IP core for use with the AXI-streaming interface from the DMA in mind. I did not include any AXI-lite control or status ports with the IP, only a reset pin that is controlled via the GPIO outputs of the ZYNQ.

I can share how my IP is laid out when I get back to the office on Monday.

Thank you very much for the help!

Hi Rock,
Below is my overlay.
It consists of a few IP blocks written in VHDL to zero-pad the data, push into a FIR filter, and subsequently perform an interpolation.

The output data rate should be roughly the same as the input data rate.
Am wondering if there are some best practices that I am missing out on?

Thanks for the help!

To check the DMA is working, you can do a loopback test. Are you initializing your output buffer with some known data and do you see it being overwritten with zeros? i.e. if you are getting zeros, make sure to initialize the buffer to something else, so you know it is being written.

Is your RTL output block setting TLAST properly? This can mes sup DMA transactions.

If this is working, then your problem probably isn’t specific to PYNQ, and more likely an FPGA design problem. In this case, you would be better asking on the XIlinx forums.
Have you simulated your IP and verified it is all working correctly together?
If it simulates correctly, you could also add ILA probes to the data into the FIR compiler, and on the output. This should check if there is an issue with your IP or the DMA.

Cathal

Hi Cathal,

My DMA buffer (at least on the PYNQ side) is not initialised. I simply use the copyto function to copy the output from the overlay to an array into the Jupyter Notebook.

What are some things to note for the RTL output? For now, I simply hold TLAST high.

I have simulated the IP within the sub-hierarchy and ascertained that it works, but not with the attached ZYNQ modules. How would you suggest for me to do so (and what are the best practices in general?)

Thanks a lot!

TLAST signals the end of a transfer. If you hold it high, it ends after the first data beat no matter what the transfer length was.
If you have an AXI stream input and output, you could pass through TLAST (from input to output) adding any required delays. Otherwise you would need to calculate when you need TLAST to be high and generate this yourself.

Cathal

Dear Cathal,

I did not realize that! I changed my TLAST implementation to calculate when TLAST should go high, and pull it high till the PL overlay is reset.

I also took a look at the DMA tutorial here, which worked on my device. https://github.com/Xilinx/PYNQ_Workshop/blob/master/Session_4/6_dma_tutorial.ipynb

Hence, I tried adding a AXIS FIFO in between the last RTL block and the DMA.

However, in the PYNQ notebook, it once again reads “DMA Channel not started”.

My code for transferring data is as follows:

from pynq import allocate
import numpy as np

sine_hw_interp = np.zeros(len(i_prime), dtype=np.int32)
# Allocate buffers for the input and output signals
in_buffer = allocate(shape=(len(sine_test) + 3), dtype=np.int32)  
out_buffer = allocate(shape=len(sine_hw_interp), dtype=np.int32)

metadata = np.array([i_max_write, x_n2_write, var_i_write])
sine_test_in = np.append(metadata, sine_test) # Add in metadata 

# Copy the samples to the in_buffer
np.copyto(in_buffer, sine_test_in)

# Trigger the DMA transfer and wait for the result
overlay_rst.write(1) # toggle reset of overlay
overlay_rst.write(0)
dma.sendchannel.transfer(in_buffer)     # On subsequent runs, error arises here
dma.recvchannel.transfer(out_buffer)
dma.sendchannel.wait()     # On initial run, error arises here
dma.sendchannel.wait()
dma.recvchannel.wait()

np.copyto(sine_hw_interp, out_buffer)

print(out_buffer)

# Free the buffers!
in_buffer.close()
out_buffer.close()

I feel like I do not understand the operation of the AXI DMA and how to use it sufficiently, but at the same time lack the experience/resources to learn how to do so effectively.

This kind of problem is usually because your IP AXI interfaces have some issues. But that is all I know. If you can paste the AXI interface RTL code, maybe we can take a quick look there.

Hi Rock,

It is below. Broadly, there is an input for a maximum value of i, i_max_in. This is input as a 2’s complement negative number, so it is inverted beforehand. There is also a counter which increments by way of another IP. The current value of the counter is i_curr. If both i_max_in and the relevant bits of i_curr are equal, then TLAST is asserted to the downstream DMA.

The rest of the IP is a linear interpolator.

DMA settings are screenshot below.

Lastly, does TKEEP affect anything? I have set it to always high in the impression that it acts as sort of a bit mask to keep bytes being transmitted.

Thank you so much for the help!

library ieee;
use ieee.std_logic_1164.all; -- for standard logic vector
use ieee.numeric_std.all; -- for unsigned/signed/integers

-- Based on fixed point interpolation logic developed previously
entity lin_interp is
    generic(
        BIT_WIDTH : integer := 32;
        CTR_WIDTH : integer := 16;
        PAD_WIDTH : integer := 4; -- Upsampling rate
        WHOLE_FP : integer := 16; -- Whole part of i'prime
        FRAC_FP : integer := 32 -- Fractional part of i'prime
        );   -- Input/output bit width
    port(
        -- tvalid signals (both in and out) are not required as this takes and passes every sample.
        CLK : in std_logic;
        RST : in std_logic; 
        din_tdata : in std_logic_vector(47 downto 0); -- [32.0]
        din_tvalid : in std_logic;
        din_tready : out std_logic;
        i_max_in : in std_logic_vector(BIT_WIDTH-1 downto 0); -- Stop signal
        i_prime_valid : in std_logic; -- Drives the tvalid signal
        i_prime : in std_logic_vector(WHOLE_FP+FRAC_FP-1 downto 0); -- [16.32]
        i_curr : in std_logic_vector(CTR_WIDTH+PAD_WIDTH-1 downto 0); -- [16.7]
        calc_done :  in std_logic;
        fir_done : in std_logic;
        dout_tdata : out std_logic_vector(BIT_WIDTH-1 downto 0);
        dout_tvalid: out std_logic;
        dout_tready : in std_logic; -- Not needed. Must connect to a buffer.
        dout_tlast : out std_logic;
        dout_tkeep : out std_logic_vector(3 downto 0) := "1111"
        -- Output hooked up to a AXI buffer before being sent toward the DMA.
    );
end entity;

architecture rtl of lin_interp is
constant PAD_MAX : integer := 2**PAD_WIDTH-1; -- max val that pad val can hold

type shiftreg is array (PAD_MAX downto 0) of std_logic_vector(BIT_WIDTH-1 downto 0); -- How a shift reg is declared in vhdl
signal sr : shiftreg;

signal i_prime_max : signed(WHOLE_FP-1 downto 0); -- Convert
signal i_prime_store : std_logic_vector(i_prime'high downto 0);

-- GENERICISED!
alias i_prime_whole : unsigned(WHOLE_FP-1 downto 0) is unsigned(i_prime_store(i_prime'high downto i_prime'high+1-WHOLE_FP)); -- 
alias i_prime_frac : unsigned(FRAC_FP-1 downto 0) is unsigned(i_prime_store(FRAC_FP-1 downto 0)); -- Fractional part is 32bit wide
alias i_curr_whole : unsigned(WHOLE_FP-1 downto 0) is unsigned(i_curr(i_curr'high downto i_curr'high+1-WHOLE_FP)); -- Whole portion; 
alias i_prime_index : unsigned(PAD_WIDTH-1 downto 0) is unsigned(i_prime_store(FRAC_FP-1 downto FRAC_FP-PAD_WIDTH)); -- 

begin
    -- clock
    process (CLK) is
        variable diff_full : signed(BIT_WIDTH-1 downto 0); -- Full precision to be multiplied
        variable mult_trunc : signed(FRAC_FP-1 downto 0);
        variable mult_shift : signed(BIT_WIDTH+FRAC_FP-1 downto 0) := (others => '0'); -- Holding point for the shifted value
        variable i_prime_mult : signed(FRAC_FP-PAD_WIDTH downto 0) := (others => '0'); -- Force unsigned
        -- Because i_prime'frac - pad_width = 25.
        -- Product is [32.0]*[1.25] = [33.25] -> [32.0]
        variable product : signed(BIT_WIDTH+FRAC_FP-PAD_WIDTH downto 0);
        variable interp : signed(BIT_WIDTH-1 downto 0); -- hold value to be put out
        
--        variable i_curr_cmp : std_logic_vector(15 downto 0);
        variable index_cmp : std_logic;
        
    begin
        -- Write i' vs current_i selection
        if rising_edge(CLK) then
            if (RST = '1') then
                dout_tvalid <= '0';
                dout_tlast <= '0';
            else
                if (calc_done='1' and fir_done='1') then -- Precalc and fir_calculations have completed. Time to begin interpolating.
                
                    -- If i_prime_whole and i_curr_whole values match, run interpolation
                    -- The interp should happen as the i_curr value JUST overflows (xxxx.0000) while i_prime is xxxx-1.yyyy
                    
                    -- Perform interp on whole CTR values
                    -- Assume pad width=7
                    if (i_curr(PAD_WIDTH-1 downto 0) = std_logic_vector(to_unsigned(0, PAD_WIDTH)) ) then
                        -- diff: reg[i+1] - reg[i] (32.0)
                        -- Formula: diff*i_prime + reg[i] (32.0*0.26)=32.26
                        -- 
                        diff_full := signed(sr(PAD_MAX-to_integer(unsigned(i_prime_store(FRAC_FP-1 downto FRAC_FP-PAD_WIDTH)))-1)) - signed(sr(PAD_MAX-to_integer(unsigned(i_prime(FRAC_FP-1 downto FRAC_FP-PAD_WIDTH)))));
                        -- Don't overwrite the MSB
                        i_prime_mult(i_prime_mult'high-1 downto 0) := signed(i_prime_store(i_prime_mult'high-1 downto 0));
                        product := diff_full * i_prime_mult; -- CHECK THIS!
                        -- [32.0]*[1.25] = [33.25]=>[32.0]
                        interp := product(product'high-1 downto product'high-BIT_WIDTH) + signed(sr(PAD_MAX-to_integer(unsigned(i_prime_store(FRAC_FP-1 downto FRAC_FP-PAD_WIDTH))))); -- 60:28->32.0
                        -- Use simple bit truncation; can add a better rounding scheme in the future.
                        --  https://zipcpu.com/dsp/2017/07/22/rounding.html
                        dout_tdata <= std_logic_vector(interp);
                        dout_tvalid <= '1';
                    else
                        dout_tvalid <= '0';
                    end if;
                    
                    -- Last (controls interface to DMA)
                    -- Triggers last when i_prime_max is the same as the largest permssible i_ctr value
                    if (signed(i_prime_store(i_prime_store'high downto i_prime_store'high-WHOLE_FP+1)) = i_prime_max) then
                        dout_tlast <= '1';
                    else
                        dout_tlast <= '0';
                    end if;
                    
                    -- Hold the i_prime value
                    if (i_prime_valid = '1') then
                        i_prime_store <= i_prime;
                    end if;
                    
                    -- Update shift register. It assumes that the data in is always valid, 
                    -- as the counter cannot wait.
                    for i in sr'high downto sr'low +1 loop
                        sr(i) <= sr(i-1);
                    end loop;
                    sr(sr'low) <= din_tdata(din_tdata'high downto din_tdata'high+1-BIT_WIDTH);
                    din_tready <= '1'; -- Always ready to accept data
                    
                    
                    i_prime_max <= -signed(i_max_in(WHOLE_FP-1 downto 0));
                    
                end if;
            end if;
        end if;
    end process;

end architecture ; -- rtl

At a quick glance, it doesn’t look like you have implemented the AXI protocol

You are ignoring dout_tready which you can’t do. You can’t just enable/disable valid whenever you want. You can only transfer data when ready and valid are both high i.e. the downstream is high.

I’d suggest you use the Vivado IP packager to create the AXI interface logic, and add your RTL logic into this design.

To add, yes, Keep is like a byte mask.

Cathal