SHA-2 Preimage Attacks with Intel SIMD Extensions --[ 1. Introduction Successfully carrying out a preimage or second preimage attack on modern cryptographic hashing algorithms is hard. Very hard. However, in certain circumstances, technical application of such attacks remain feasible due to non-algorithmic factors in their application, such as predictable or limited preimage space. Typically this refers to the determination of hashed password preimages, although the principals are consistent through other applications, such as the "proof of work" algorithms used in most modern crypto-currencies. This article investigates the viability of using SIMD instructions to increase the success rate of key space exhaustion techniques in preimage attacks. ---[ 1.1. SIMD Instructions SIMD is by no means a recent development. Intel introduced its MMX extensions to the x86 instruction set in 1996, and prior to that, specialist microprocessors capable of performing SIMD instructions date back as far as the early 1970s. However, it is fair to say that the introduction of MMX and later AVX extensions truely cemented SIMD as a viable candidate for consumer-level high performance computing. But what is SIMD? Single Instruction Multiple Data is a class of instructions in which single compute operation is performed across multiple data lanes. The theory, of course, being that there is significant cost savings in code performance. The recognised trade-off being increased code complexity and development time. Whilst the application of SIMD is usually associated with the processing of graphics and audio data, it is generally applicabe to any scenario wherein precisely the same instructions can be applied to multiple data sources without condition from the resultant data itself. For example in image processing, SIMD is often used to apply simple filters to bitmaps as the same operations are performed to each pixel without variance in the filtering algorithm itself. ---[ 1.2 Pre-Image and Second Pre-Image Attacks In cryptography, a pre-image is the term given to the input value to a message digest (hash). As hashing algorithm's are, by design, unidirectional lossy algorithms, it is often desirable to compute the possible set of pre-images given a resultant digest. Or, in the case of second pre-image attacks, an additional pre-image. Naturally this is a difficult problem, and whilst several popular digests are known to have algorithmic weaknesses, attacks on modern hashing algorithms commonly require resorting to brute-force means. The use of SIMD in this article is a means by which we can optimise the use of compute resources to exhaust the possible pre-image space. The techniques discussed here will also be applicable to the computation of the proof of work as defined in modern cryptocurrency standards. Whilst the algorithms and techniques are similar, the work required to complete a Bitcoin block is to find a value which, when combined with a known data block and put through a SHA256 digest, results in a hash with precisely n leading 0 bits. However, this is not a second pre-image attack so is mentioned only as a point of interest. ---[ 1.3 SHA2 & SIMD The SHA family of hashing algorithms operate on fixed-sizes 32-bit words and do not contain any conditional branches. These properties make SHA a perfect candidate for instruction-level parallelisation. Additionally, there are several optimisations which can be applied to the final four rounds of the SHA family of algorithms which can be used to save a valuable compute cycles. These optimisations will be described in depth in a later section. --[ 2. Implementation of SHA256 This article will focus on the SHA256 compression function. Whilst the benfits of SIMD are applicable throughout the entirety of the digest, the compression function forms the core of the routine and this is where most [sic] of the compute cycles are expended. The SHA2 compression function is as follows for i from 0 to 63 S1 := (e rotr 6) xor (e rotr 11) xor (e rotr 25) ch := (e and f) xor ((not e) and g) temp1 := h + S1 + ch + k[i] + w[i] S0 := (a rotr 2) xor (a rotr 13) xor (a rotr 22) maj := (a and b) xor (a and c) xor (b and c) temp2 := S0 + maj h := g g := f f := e e := d + temp1 d := c c := b b := a a := temp1 + temp2 It is extremely difficult to draw comparissons between the computational cost of using SIMD and parallell instructions, we can compare the sum of the cycle count of each instruction but modern pipelining, caching and speculative execution techniques make this comparison somewhat redundant. Additionally, the concept of port contention in intel's SIMD implementation is not addressed here. The following table outlines the counts of each instruction in the compression function along with the latency of the AVX2 implementation from Intel's manual. Please note that there is no single instruction in x86 for rotate right, as such this equates to a right shift, a left shift and an or instruction as follows: rotr(x,n) = ((x >> n) | (x << (32 - n))) Additionally, assignment operators are assumed to incur no cost although this is unlikely to be true. Ins. | Count | AVX2 Ins. | Latency | Throughput ------|--------|-----------|---------|------------ shr | 64 | vpsrld | 1 | 0.5 shl | 64 | vpslld | 1 | 0.5 sub | 64 | vpsubd | 1 | 0.33 or | 64 | vpor | 1 | 0.33 xor | 448 | vpxor | 1 | 0.33 and | 320 | vpand | 1 | 0.33 not | 64 | vpandn | 1 | 0.33 add | 320 | vpadd | 1 | 0.33 / 0.5 As you can see, the latency of every SIMD instruction used in Sha2's compression function is 1. This means that the performance gain from using SIMD should roughly scale to the width of the register size. --[ 3. Results The following are the results from an implementation of Sha256 both using SIMD instructions (AVX2) and linear implementation. Care was taken to ensure that both implementations differ in SIMD implementation alone. SIMD Performance over 1000000 iterations Fastest (8h/s): 2040816 Slowest (8h/s): 67485 Average (8h/s): 1928033 Hashes/core/s : 15424264 Linear Performance over 1000000 iterations Fastest (h/s): 2857142 Slowest (h/s): 268312 Average (h/s): 2783811 Hashes/core/s: 2783811 Taking the average hashes per core per second (h/c/s) we see a 5.54x performance increase over linear implementations. --[ 4. Attacks on the Final 4 Rounds The Sha 2 compression function is repeated a total of 64 times. In each iteration, the values of working variables a and e are modified, and all variables rotate one position (a => b, b => c, etc). There is a small weakness in this algorithm which means that the final value of the fourth and eigth dword of the digest can be determined in round 61, without the need to complete rounds 62-64. When performing a preimage attack on the digest, this means that we can abandon computation of the hash before having to complete all 64 rounds should these values not match the target. For example, if the value of d + temp1 is assigned to e in round 61. In round 62, e is simply assigned (unmodified) to f, in round 63 it is assigned to g, and in round 64 it is assigned to h. Finally, the value of h is added to the eighth DWOWD of the rolling hash value (H) to form the final digest. Therefore we can take the value of H from the start of the compression function, subtract the eigth dword of the target digest and compare it to the value of e in round 61. If this does not match, then this digest is incorrect. Similar logic can be applied to all variables (a-h) throughout roungs 61-64. This technique reduces the work required to perform a preimage attack by up to a 1/16th.