Board: - TUL PYNQ-Z2(XC7Z020-1CLG400C)
PYNQ v2.5 framework
I have been creating an HLS code for a SHA256 hashing engine for some time now. I have managed to successfully implement the SHA256 algorithm as C code.The issue I face is that the design I have created runs quite slow. It runs so slow that the implementation on software for the same code runs at the same speed as that of the FPGA.
The implemented block design runs at a max speed of 142 MHz whereas the HLS synthesis shows a max estimated freq of 250 MHZ or so, sometimes greater. The variation between the implementation and synthesis is understandable because of some assumptions made when synthesis of the design is done.
The issue is that when the overlay is downloaded into PYNQ board and the design run using driver code in Python, the average time taken to get the output is about 1.9 ms.(Which is exorbitantly slow).
Down below is the block design:-
HLS code is as follows:-
sha256_hls.cpp (5.9 KB)
The Python code running on PYNQ is as follows:-
sha256_pynq.py (1.2 KB)
This is the output i get:-
[ 84 104 101 32 103 105 97 110 116 32 109 111 117 115 101 32 108 101
109 117 114 115 32 40 103 101 110 117 115 32 77 105 114 122 97 41
32 97 114 101 32 112 114 105 109 97 116 101 115 32 110 97 116 105
118 101 32 116 111 32 77 97 100 97 103 97 115 99 97 114 44 32
108 105 107 101 32 97 108 108 32 111 116 104 101 114 32 108 101 109
117 114 115 46 32 84 104 101 32 116 119 111 32 100 101 115 99 114
105 98 101 100 32 115 112 101 99 105 101 115 44 32 116 104 101 32
110 111 114 116 104 101 114 110 32 40 112 105 99 116 117 114 101 100
41 32 97 110 100 32 67 111 113 117 101 114 101 108 39 115 32 103
105 97 110 116 32 109 111 117 115 101 32 108 101 109 117 114 115 44
32 97 114 101 32 102 111 117 110 100 32 105 110 32 116 104 101 32
119 101 115 116 101 114 110 32 100 114 121 32 100 101 99 105 100 117
111 117 115 32 102 111 114 101 115 116 115 44 32 83 97 109 98 105
114 97 110 111 32 118 97 108 108 101 121 32 97 110 100 32 83 97
104 97 109 97 108 97 122 97 32 80 101 110 105 110 115 117 108 97
46 32 73 110 32 49 56 55 48 44 32 66 114 105 116 105 115 104
32 122 111 111 108 111 103 105 115 116 32 74 111 104 110 32 69 100
119 97 114 100 32 71 114 97 121 32 97 115 115 105 103 110 101 100
32 116 104 101 109 32 116 111 32 77 105 114 122 97 44 32 98 117
116 32 116 104 101 32 99 108 97 115 115 105 102 105 99 97 116 105
111 110 32 119 97 115 32 110 111 116 32 119 105 100 101 108 121 32
97 99 99 101 112 116 101 100 32 117 110 116 105 108 32 116 104 101
32 49 57 57 48 115 44 32 102 111 108 108 111 119 105 110 103 32
116 104 101 32 114 101 118 105 118 97 108 32 111 102 32 116 104 101
32 103 101 110 117 115 32 98 121 32 65 109 101 114 105 99 97 110
32 112 97 108 101 111 97 110 116 104 114 111 112 111 108 111 103 105
115 116 32 73 97 110 32 84 97 116 116 101 114 115 97 108 108 32
105 110 32 49 57 56 50 46 32 71 105 97 110 116 32 109 111 117
115 101 32 108 101 109 117 114 115 32 119 101 105 103 104 32 97 112
112 114 111 120 105 109 97 116 101 108 121 32 51 48 48 32 103 32
40 49 49 32 111 122 41 32 97 110 100 32 104 97 118 101 32 97
32 108 111 110 103 44 32 98 117 115 104 121 32 116 97 105 108 46
0]
[ 61 97 199 242 83 125 2 59 63 34 66 130 14 130 237 74 149 103
228 58 108 58 115 222 216 1 190 12 234 244 108 98]
0.0019867420196533203
If I change the size of the string to something much smaller such as “abc” even then the time output remains approx the same(For “abc” - 0.001973867416381836).
-
Is the latency observed due to the way I am calling the DMA in PYNQ?
-
How will I possibly improve my HLS and Python code?
Thanks in Advance.