Implementation issues for a SHA256 hasher on PYNQ-Z2

Board: - TUL PYNQ-Z2(XC7Z020-1CLG400C)
PYNQ v2.5 framework

I have been creating an HLS code for a SHA256 hashing engine for some time now. I have managed to successfully implement the SHA256 algorithm as C code.The issue I face is that the design I have created runs quite slow. It runs so slow that the implementation on software for the same code runs at the same speed as that of the FPGA.

The implemented block design runs at a max speed of 142 MHz whereas the HLS synthesis shows a max estimated freq of 250 MHZ or so, sometimes greater. The variation between the implementation and synthesis is understandable because of some assumptions made when synthesis of the design is done.

The issue is that when the overlay is downloaded into PYNQ board and the design run using driver code in Python, the average time taken to get the output is about 1.9 ms.(Which is exorbitantly slow).

Down below is the block design:-

HLS code is as follows:-

sha256_hls.cpp (5.9 KB)

The Python code running on PYNQ is as follows:-

sha256_pynq.py (1.2 KB)

This is the output i get:-

[ 84 104 101  32 103 105  97 110 116  32 109 111 117 115 101  32 108 101
109 117 114 115  32  40 103 101 110 117 115  32  77 105 114 122  97  41
32  97 114 101  32 112 114 105 109  97 116 101 115  32 110  97 116 105
118 101  32 116 111  32  77  97 100  97 103  97 115  99  97 114  44  32
108 105 107 101  32  97 108 108  32 111 116 104 101 114  32 108 101 109
117 114 115  46  32  84 104 101  32 116 119 111  32 100 101 115  99 114
105  98 101 100  32 115 112 101  99 105 101 115  44  32 116 104 101  32
110 111 114 116 104 101 114 110  32  40 112 105  99 116 117 114 101 100
41  32  97 110 100  32  67 111 113 117 101 114 101 108  39 115  32 103
105  97 110 116  32 109 111 117 115 101  32 108 101 109 117 114 115  44
32  97 114 101  32 102 111 117 110 100  32 105 110  32 116 104 101  32
119 101 115 116 101 114 110  32 100 114 121  32 100 101  99 105 100 117
111 117 115  32 102 111 114 101 115 116 115  44  32  83  97 109  98 105
114  97 110 111  32 118  97 108 108 101 121  32  97 110 100  32  83  97
104  97 109  97 108  97 122  97  32  80 101 110 105 110 115 117 108  97
46  32  73 110  32  49  56  55  48  44  32  66 114 105 116 105 115 104
32 122 111 111 108 111 103 105 115 116  32  74 111 104 110  32  69 100
119  97 114 100  32  71 114  97 121  32  97 115 115 105 103 110 101 100
32 116 104 101 109  32 116 111  32  77 105 114 122  97  44  32  98 117
116  32 116 104 101  32  99 108  97 115 115 105 102 105  99  97 116 105
111 110  32 119  97 115  32 110 111 116  32 119 105 100 101 108 121  32
97  99  99 101 112 116 101 100  32 117 110 116 105 108  32 116 104 101
32  49  57  57  48 115  44  32 102 111 108 108 111 119 105 110 103  32
116 104 101  32 114 101 118 105 118  97 108  32 111 102  32 116 104 101
32 103 101 110 117 115  32  98 121  32  65 109 101 114 105  99  97 110
32 112  97 108 101 111  97 110 116 104 114 111 112 111 108 111 103 105
115 116  32  73  97 110  32  84  97 116 116 101 114 115  97 108 108  32
105 110  32  49  57  56  50  46  32  71 105  97 110 116  32 109 111 117
115 101  32 108 101 109 117 114 115  32 119 101 105 103 104  32  97 112
112 114 111 120 105 109  97 116 101 108 121  32  51  48  48  32 103  32
40  49  49  32 111 122  41  32  97 110 100  32 104  97 118 101  32  97
32 108 111 110 103  44  32  98 117 115 104 121  32 116  97 105 108  46
0]
[ 61  97 199 242  83 125   2  59  63  34  66 130  14 130 237  74 149 103
228  58 108  58 115 222 216   1 190  12 234 244 108  98]
0.0019867420196533203

If I change the size of the string to something much smaller such as “abc” even then the time output remains approx the same(For “abc” - 0.001973867416381836).

  1. Is the latency observed due to the way I am calling the DMA in PYNQ?

  2. How will I possibly improve my HLS and Python code?

Thanks in Advance.

I think you are right - if different data sizes result in the same long latency, it might mean the DMA is taking the majority of time. I think the streaming is not done properly so computation waits for the DMA data coming in, and DMA data start go out after the whole computation is done. The correct way is to overlay them.

Thanks again for your reply @rock . Yes you are right,it does the process of inputting data, processing and then outputting it sequentially.

By overlaying do you mean pipelining?

I think pipelining the above flow is not possible when only a single string data is handled because the sha256 algorithm can be computed for the 1st data recieved only after getting the whole data whereas this would work for the processing of intermediate data when sending a batch of data to be hashed.

Could you suggest some possible ways to maybe improve the initial latency observed?

Thanks