Hello

I use a blog from pynq which teach us how to use axi-master do sqrt. The blog has no problem by my verification. After that, I moified the project and add more complex computation, but I find the result is wrong in pynq board(right in HLS simulation).

I am not sure where is problem. So I try to add my algorithm step by step. After 20 projects, I find if I add more computation after somewhere of the algorithm, The result in pynqboard will be wrong(still right in HLS simulation). I don’t know how to deal with it. When I add more code in notation ‘B:’, the board’s result will be wrong!

```
#include "sampen.hpp"
#include <string.h>
#include <math.h>
void axi4_sampen(float *in, float *out, int len)
{
#pragma HLS INTERFACE s_axilite port=return bundle=sqrt
#pragma HLS INTERFACE s_axilite port=len bundle=sqrt
#pragma HLS INTERFACE m_axi depth=50 port=out offset=slave bundle=output
#pragma HLS INTERFACE m_axi depth=50 port=in offset=slave bundle=input
#pragma HLS INTERFACE s_axilite port=in
#pragma HLS INTERFACE s_axilite port=out
float buff[100];
float sampen[1];
float D[100][100];
int N = len;
int m = 2; float r = 20;
memcpy(buff, (const float*) in, len * sizeof(float));
for(int i = 0; i < len; i ++){
for(int j = 0; j < len; j ++){
if(abs(buff[i] - buff[j]) <= r){
D[i][j] = 1;
}
}
}
float count1[1] = {0};
for(int i = 0; i < len - m + 1; i ++){
for(int j = 0; j < len - m + 1; j ++){
count1[0] = count1[0] + (D[i][j] and D[i+1][j+1]);
}
}
count1[0] = count1[0] - len + m - 1;
B:
float B[1] = {0};
B[0] = (float)count1[0]/((len-m+1)*(len-m));
float count2[1] = {0};
for(int i = 0; i < len - m ; i ++){
for(int j = 0; j < len - m ; j ++){
count2[0] = count2[0] + (D[i][j] and D[i+1][j+1] and D[i+2][j+2]);
}
}
count2[0] = count2[0] - len + m;
float A[1] = {0};
A[0] = (float)count2[0]/((len-m)*(len-m-1));
memcpy(out, (const float*) A, 1 * sizeof(float));
}
```

pynq board is test in jupyter:

```
sqrt_ip.write(0x20, length)
sqrt_ip.write(0x10, inpt.physical_address)
sqrt_ip.write(0x18, outpt.physical_address)
sqrt_ip.write(0x00, 1)
```

Is the algorithm too long, so when we read data from out port, the computation is still on. But how could we wait for a moment when we send data in port ‘in’ by write 0x00 1?