# GEORGE UNIVERSITY

# **A Zynq-based Testbed for the Experimental Benchmarking of Algorithms Competing in Cryptographic Contests**



**Research Group** 

**Farnoud Farahmand, Ekawat Homsirikamol, and Kris Gaj Department of Electrical and Computer Engineering, George Mason University, Fairfax, Virginia 22030, USA**

### INTRODUCTION

- $\blacktriangleright$  Hardware performance evaluation of candidates competing in cryptographic contests, such as SHA-3 and CAESAR, is very important for ranking their suitability for standardization.
- $\triangleright$  One of the most essential performance metrics is the throughput, which highly depends on the algorithm, hardware implementation architecture, coding style, and options of tools. The maximum throughput is calculated based on the maximum clock frequency supported by each algorithm.
- $\blacktriangleright$  In this project, we have developed a universal testbed, which is capable of measuring the maximum clock frequency experimentally, using a prototyping board. We are targeting cryptographic hardware cores, such as implementations of SHA-3 candidates. Our testbed is designed using a Zynq platform and takes advantage of software/hardware co-design.

 $\triangleright$  We measured the maximum clock frequency and the execution time of 12 Round 2 SHA-3 candidates experimentally on ZedBoard and compared the results with the frequencies reported by Xilinx Vivado.



Experimental benchmarking of cryptographic algorithms has been performed previously on different platforms other than Zynq.

# RESULTS: Maximum Frequency

- . Maximum frequency of SHA-256 has been measured experimentally using the SLAAC-1V board based on Xilinx Virtex VCV 1000.
- 2. Experimental measurement of the hardware performance of 14 round 2 SHA-3 candidates has been performed using the SASEBO-GII FPGA board.

**Simplified block diagram of the PL side with the indication of two independent clocks:**



**Block Diagram of the Testbed with the division into Programmable logic (PL), Interconnects, and Processing System (PS):**

- $\triangleright$  A universal testbench has been developed in the Vivado environment to verify the operation of our testbed using simulation.
- $\triangleright$  ATG (AXI Traffic Gen) IP has limitation in case of generating specific data in AXI stream mode through tdata port. As a result, we used a separate FIFO which is already filled with our desired data and AXI stream ATG only provides control signals.
- ► AXI Lite ATGs are used to configure Output FIFO and AXI Stream ATG.

#### SYSTEM DESIGN



- I ZedBoard and Vivado 2015.4 have been used for result generation. All options of Vivado design suite including synthesis and implementation settings are set to default mode.
- . On the software side, the bare metal environment and Xilinx SDK are used for running the C code on the ARM core of Zynq.

► Max Freq. Experimental was determined as a worst case value across all investigated input sizes from 10 to 5000 kB.



Throughput Based on Exp. HW Exe. Time was obtained by dividing the message input size by the actual execution time of hashing in hardware, measured using AXI Timer for the input size equal to 1000 kB.

#### <sup>I</sup> **Custom IPs:**



# VERIFICATION METHODOLOGY

 $\rightarrow$  Only algorithms with optimized software implementation and ARM architecture support are shown in this graph.

#### RESULTS: Data Transaction Overhead

■ Max Freq. Experimental Board1 [MHz] ■ Max Freq. Experimental Board2 [MHz] **Avr. Max Freq. Experimental [MHz]**

#### **Universal testbench for Vivado environment**



- The testbed can be used to correctly measure performance of designs with the maximum throughput up to 64 bit  $\cdot$  150 MHz = 9.6 Gbit/s.  $\triangleright$  For all the investigated hash functions, the overhead of the communication between PS and PL was below 5% for 100 kB messages and negligible for messages above 500 kB.
- $\triangleright$  All algorithms have also demonstrated significant speed up vs. their execution in software on the same chip, in spite of the substantial speed of the ARM core, operating at 667 MHz.
- $\triangleright$  Our experiments have also demonstrated that the maximum experimental clock frequency was always higher than the post-place and route frequency calculated by Vivado using static timing analysis.
- $\triangleright$  At the same time, somewhat unexpectedly, the spread of ratios experimental to post-place and route frequency is very large, ranging from 1 to 2. This fact can be explained by a different influence of parameter variations and operating conditions on the critical path of each hash core, due to a different physical location (placement) of these critical paths in the FPGA fabric

**Maximum clock frequencies obtained using static timing analysis and the experimental measurement, respectively**



■ Max Freq. Static Timing Analysis [MHz] ■ Max Freq. Experimental [MHz]

#### **Maximum frequencies and throughputs**





#### **Formulas for the execution time and throughput** Notation: T - clock period in ns, N - number of input blocks



# RESULTS: SpeedUp vs. Software



**BLAKE ECHO Fugue Luffa Shabal SHAvite-3 Skein 10 kB 500 kB 5000 kB**

**DMA core running at 100 MHz for all algorithms**



**DMA core running at 150 MHz in case of Luffa and 100 MHz for all other algorithms**

Input Size (KB)

#### RESULTS: Experiment on Two Different ZedBoards



#### CONCLUSIONS

#### Acknowledgment

This material is based upon work supported by the National Science Foundation under Grant No. 1314540.

Cryptographic Engineering Research Group (CERG) Department of Electrical and Computer Engineering George Mason University http://cryptography.gmu.edu