SHA-2 Preimage Attacks with Intel SIMD Extensions

--[ 1. Introduction

Successfully carrying out a preimage or second preimage attack on modern
cryptographic hashing algorithms is hard. Very hard. However, in certain
circumstances, technical application of such attacks remain feasible due
to non-algorithmic factors in their application, such as predictable or
limited preimage space. Typically this refers to the determination of
hashed password preimages, although the principals are consistent
through other applications, such as the "proof of work" algorithms used
in most modern crypto-currencies.

This article investigates the viability of using SIMD instructions to
increase the success rate of key space exhaustion techniques in preimage
attacks.

---[ 1.1. SIMD Instructions

SIMD is by no means a recent development. Intel introduced its MMX
extensions to the x86 instruction set in 1996, and prior to that,
specialist microprocessors capable of performing SIMD instructions date
back as far as the early 1970s. However, it is fair to say that the
introduction of MMX and later AVX extensions truely cemented SIMD as a
viable candidate for consumer-level high performance computing.

But what is SIMD? Single Instruction Multiple Data is a class of
instructions in which single compute operation is performed across
multiple data lanes. The theory, of course, being that there is
significant cost savings in code performance. The recognised trade-off
being increased code complexity and development time. Whilst the
application of SIMD is usually associated with the processing of
graphics and audio data, it is generally applicabe to any scenario
wherein precisely the same instructions can be applied to multiple data
sources without condition from the resultant data itself. For example in
image processing, SIMD is often used to apply simple filters to bitmaps
as the same operations are performed to each pixel without variance in
the filtering algorithm itself.

---[ 1.2 Pre-Image and Second Pre-Image Attacks

In cryptography, a pre-image is the term given to the input value to a
message digest (hash). As hashing algorithm's are, by design,
unidirectional lossy algorithms, it is often desirable to compute the
possible set of pre-images given a resultant digest. Or, in the case of
second pre-image attacks, an additional pre-image.

Naturally this is a difficult problem, and whilst several popular
digests are known to have algorithmic weaknesses, attacks on modern
hashing algorithms commonly require resorting to brute-force means. The
use of SIMD in this article is a means by which we can optimise the use
of compute resources to exhaust the possible pre-image space.

The techniques discussed here will also be applicable to the computation
of the proof of work as defined in modern cryptocurrency standards.
Whilst the algorithms and techniques are similar, the work required to
complete a Bitcoin block is to find a value which, when combined with a
known data block and put through a SHA256 digest, results in a hash with
precisely n leading 0 bits. However, this is not a second pre-image
attack so is mentioned only as a point of interest.

---[ 1.3 SHA2 & SIMD

The SHA family of hashing algorithms operate on fixed-sizes 32-bit words
and do not contain any conditional branches. These properties make SHA a
perfect candidate for instruction-level parallelisation.

Additionally, there are several optimisations which can be applied to
the final four rounds of the SHA family of algorithms which can be used 
to save a valuable compute cycles. These optimisations will be described 
in depth in a later section.

--[ 2. Implementation of SHA256

This article will focus on the SHA256 compression function. Whilst the
benfits of SIMD are applicable throughout the entirety of the digest,
the compression function forms the core of the routine and this is where
most [sic] of the compute cycles are expended.

The SHA2 compression function is as follows

    for i from 0 to 63
        S1 := (e rotr 6) xor (e rotr 11) xor (e rotr 25)
        ch := (e and f) xor ((not e) and g)
        temp1 := h + S1 + ch + k[i] + w[i]
        S0 := (a rotr 2) xor (a rotr 13) xor (a rotr 22)
        maj := (a and b) xor (a and c) xor (b and c)
        temp2 := S0 + maj
 
        h := g
        g := f
        f := e
        e := d + temp1
        d := c
        c := b
        b := a
        a := temp1 + temp2

It is extremely difficult to draw comparissons between the computational
cost of using SIMD and parallell instructions, we can compare the sum of
the cycle count of each instruction but modern pipelining, caching and
speculative execution techniques make this comparison somewhat
redundant. Additionally, the concept of port contention in intel's SIMD
implementation is not addressed here.

The following table outlines the counts of each instruction in the
compression function along with the latency of the AVX2 implementation
from Intel's manual. Please note that there is no single instruction in
x86 for rotate right, as such this equates to a right shift, a left
shift and an or instruction as follows:

    rotr(x,n) = ((x >> n) | (x << (32 - n)))

Additionally, assignment operators are assumed to incur no cost although
this is unlikely to be true.

     Ins. |  Count | AVX2 Ins. | Latency | Throughput
    ------|--------|-----------|---------|------------
     shr  |  64    |  vpsrld   |  1      | 0.5
     shl  |  64    |  vpslld   |  1      | 0.5
     sub  |  64    |  vpsubd   |  1      | 0.33
     or   |  64    |  vpor     |  1      | 0.33
     xor  |  448   |  vpxor    |  1      | 0.33
     and  |  320   |  vpand    |  1      | 0.33
     not  |  64    |  vpandn   |  1      | 0.33
     add  |  320   |  vpadd    |  1      | 0.33 / 0.5

As you can see, the latency of every SIMD instruction used in Sha2's
compression function is 1. This means that the performance gain from
using SIMD should roughly scale to the width of the register size.

--[ 3. Results

The following are the results from an implementation of Sha256 both
using SIMD instructions (AVX2) and linear implementation. Care was taken
to ensure that both implementations differ in SIMD implementation alone.

    SIMD Performance over 1000000 iterations
    Fastest (8h/s): 2040816
    Slowest (8h/s): 67485
    Average (8h/s): 1928033
    Hashes/core/s : 15424264

    Linear Performance over 1000000 iterations
    Fastest (h/s): 2857142
    Slowest (h/s): 268312
    Average (h/s): 2783811
    Hashes/core/s: 2783811

Taking the average hashes per core per second (h/c/s) we see a 5.54x
performance increase over linear implementations.

--[ 4. Attacks on the Final 4 Rounds

The Sha 2 compression function is repeated a total of 64 times. In each
iteration, the values of working variables a and e are modified, and all
variables rotate one position (a => b, b => c, etc).

There is a small weakness in this algorithm which means that the final
value of the fourth and eigth dword of the digest can be determined in
round 61, without the need to complete rounds 62-64. When performing a
preimage attack on the digest, this means that we can abandon
computation of the hash before having to complete all 64 rounds should
these values not match the target.

For example, if the value of d + temp1 is assigned to e in round 61. In
round 62, e is simply assigned (unmodified) to f, in round 63 it is
assigned to g, and in round 64 it is assigned to h. Finally, the value
of h is added to the eighth DWOWD of the rolling hash value (H) to form
the final digest. Therefore we can take the value of H from the start of
the compression function, subtract the eigth dword of the target digest
and compare it to the value of e in round 61. If this does not match,
then this digest is incorrect.

Similar logic can be applied to all variables (a-h) throughout roungs
61-64. This technique reduces the work required to perform a preimage
attack by up to a 1/16th.