Computing Adler32 Checksums at 41 GB/s

While looking through the fpng source code, I noticed that its vectorized adler32 implementation seemed somewhat complicated, especially given how simple the scalar version of adler32 is. I was curious to see if I could come up with a simpler method, and in doing so, I came up with an algorithm that can be up to 7x faster than fpng's version, and 109x faster than the simple scalar version.

If you are unfamiliar with adler32, it is a simple checksumming algorithm used by rsync and zlib (and by extension PNG), designed to optimize for speed over reliability.

The 32-bit check value is composed of two unsigned 16 bit counters, modulo 65521 (the highest unsigned 16-bit prime):

sum, which is incremented by the value of the current byte each iteration
sum2, which is incremented by the value of sum each iteration

The final checksum is then created by setting the lower 16 bits to sum, and the upper 16 bits to sum2.

A simple implementation usually looks something like this:

#define ADLER_HASH_INIT 1
#define ADLER_MOD 65521

uint32_t Adler32Simple(uint32_t start, uint8_t *data, size_t len) {
  uint32_t sum  = start & 0xFFFF;
  uint32_t sum2 = start >> 16;

  const uint8_t *end = data + len;
  while (data != end) {
    sum += *data;
    sum %= ADLER_MOD;
    sum2 += sum;
    sum2 %= ADLER_MOD;
    data++;
  }

  return  (sum2 << 16) | sum;
}

There are two main issue with performance here:

Processing each byte requires two very expensive modulo operations by a non power of two (so it cannot be sped up by using strength reduction to bitwise and)
Each iteration has a dependency on the previous iteration, so unrolling won't allow instruction level parallelism to help much, and the modulo operation will always bottleneck the loop.

Deferring modulo

Both of these are relatvely easy to fix and produce a fast scalar implementation similar to what zlib uses (somewhat surprisingly, zlib appears to lack any form of vectorized checksumming algorithm).

Knowing that (a+b) mod n = ((a mod n) + (b mod n)) mod n means that the modulo operations can be moved outside of the loop, removing the bottleneck:

#define ADLER_HASH_INIT 1
#define ADLER_MOD 65521

uint32_t Adler32DeferMod(uint32_t start, uint8_t *data, size_t len) {
  uint32_t sum  = start & 0xFFFF;
  uint32_t sum2 = start >> 16;

  const uint8_t *end = data + len;
  while (data != end) {
    sum += *data;
    sum2 += sum;
    data++;
  }

  sum %= ADLER_MOD;
  sum2 %= ADLER_MOD;

  return  (sum2 << 16) | sum;
}

However, this does not account for overflow in sum or sum2.

To calculate the maximum number of bytes we can process before overflow might happen, we need to find a number n such that:

max_sum(n) = maximum value of sum after n bytes
max_sum2(n) = maximum value of sum2 after n bytes
max(max_sum(n), max_sum2(n)) < 2^bits - 1

We can easily determine that max_sum2(n) >= max_sum(n), because sum is added to sum2 each iteration, and both are always positive or zero, so we can simplify to max_sum2(n) < 2^bits - 1.

A loose upper bound of max_sum(n) * n can be put on max_sum2(n), and we know that max_sum(n) = 255 * n, because 255 is the max value for each byte, giving us 255*n*n. However, this is a conservative upper bound, because sum will not start out at max_sum(n).

To find an accurate upper bound, we can recognize that max_sum2(n) is an arithmetic series:

sum(x) = 255 * x
max_sum2(n) = sum 255*i where i=1..n
; sum of an arithmetic series is (n/2) * (2a + (n-1)d)
; where a is the first term, and d is the difference between each term
max_sum2(n) = (n/2) * (2(255) * (n-1)(255))
max_sum2(n) = (n/2) * (n-1+2)(255)
max_sum2(n) = (n/2) * (n+1)(255)
max_sum2(n) = 255n(n+1))/2

255n(n+1)/2 < 2^bits - 1

This will get us close to the true upper bound, but undershoots a bit because it does not account for the starting value of each counter. The maximum starting value for each counter is the modulo (65521) minus one:

255n(n+1)/2 + (n+1)(65521-1) < 2^bits - 1
when bits = 32, floor(n) = 5552 bytes
when bits = 64, floor(n) = 380368439 bytes (~380 megabytes)

Now that we have determined the upper bound, we can process input in n-sized chunks accordingly:

#define ADLER_INIT 1
#define ADLER_MOD 65521
#define ADLER_CHUNK_LEN_32 5552

/* 64 bit version omitted due to it's similarity */
uint32_t Adler32DeferMod32(uint32_t start, uint8_t *data, size_t len) {
  uint32_t sum  = start & 0xFFFF;
  uint32_t sum2 = start >> 16;

  while (len) {
    size_t chunk_len = len;
    if (chunk_len > ADLER_CHUNK_LEN_32)
      chunk_len = ADLER_CHUNK_LEN_32;
    len -= chunk_len;

    const uint8_t *chunk_end = data + chunk_len;
    while (data != chunk_end) {
      sum  += *data;
      sum2 += sum;
      data++;
    }

    sum  %= ADLER_MOD;
    sum2 %= ADLER_MOD;
  }

  return  (sum2 << 16) | sum;
}

The 32 bit version ends up around 5x faster than the original version, but surprisingly the 64 bit version is either the same speed as the 32 bit version, or slower in some cases.

SIMD/AVX2

We can speed this up further by using SIMD to operate on chunks of bytes at the same time. However, we need to break the dependency between each iteration of the loop to do so.

By manually unrolling the loop, we can simplify it algebraically into a form that can easily be computed through SIMD:

/* unrolled two loop iterations */
sum += data[0];
sum2 += sum;
sum += data[1];
sum2 += sum;

/* split dependency between sum and sum2 for each element */
start_sum = sum;
sum += data[0];
sum2 += start_sum + data[0];
sum += data[1];
sum += start_sum + data[0] + data[1];

/* combine like terms */
start_sum = sum;
sum += data[0] + data[1];
sum2 += 2*start_sum + 2*data[0] + data[1];

/* remove temporary variable */
sum2 += 2*start_sum + 2*data[0] + data[1];
sum += data[0] + data[1];

/* generalize into sum for n bytes */
sum += data[0] + data[1] + ... + data[n];
sum2 += n*start_sum + n*data[0] + (n-1)*data[1] + ... + (n-n)*data[n];

Now that we have a generalized algorithm for computing adler32 in blocks, we can implement it using 32-byte blocks using AVX2. Due to the chunk size of 32, the maximum chunk length must be lowered to the highest multiple of 32 under the 32 bit chunk length to avoid having to compute a remainder every cycle.

#define ADLER_HASH_INIT 1
#define ADLER_MOD 65521
#define ADLER_CHUNK_LEN_32 5552
#define ADLER_CHUNK_LEN_SIMD_32\
  (ADLER_CHUNK_LEN_32/32)*32

uint32_t Adler32AVX(uint32_t start, uint8_t *data, size_t len) {
  const __m256i zero_v = _mm256_setzero_si256();
  const __m256i one_epi16_v = _mm256_set1_epi16(1);
  const __m256i coeff_v = _mm256_set_epi8(
    1,   2,  3,  4,  5,  6,  7,  8,
    9,  10, 11, 12, 13, 14, 15, 16,
    17, 18, 19, 20, 21, 22, 23, 24,
    25, 26, 27, 28, 29, 30, 31, 32
  );

  uint32_t sum  = start & 0xFFFF;
  uint32_t sum2 = start >> 16;

  while (len >= 32) {
    size_t chunk_len = len;
    chunk_len -= chunk_len % 32;
    if (chunk_len > ADLER_CHUNK_LEN_SIMD_32)
      chunk_len = ADLER_CHUNK_LEN_SIMD_32;
    len -= chunk_len;

    __m256i sum_v = _mm256_setzero_si256();
    __m256i sum2_v = _mm256_setzero_si256();

    uint8_t *chunk_end = data + chunk_len;
    while (data < chunk_end) {
      __m256i chunk_v = _mm256_loadu_si256(data);
      data += 32;

      /* multiply each byte by the coefficient, and sum adjacent bytes into
       * 16 bit integers */
      __m256i mad = _mm256_maddubs_epi16(chunk_v, coeff_v);
      sum2_v = _mm256_add_epi32(sum2_v, _mm256_madd_epi16(mad, one_epi16_v));

      /* add n*sum to sum2 */
      sum2_v = _mm256_add_epi32(sum2_v, _mm256_slli_epi32(sum_v, 5));

      /* sum every consecutive 8 bytes together into 4 64-bit integers, then
       * add to sum_v */
      sum_v = _mm256_add_epi32(sum_v, _mm256_sad_epu8(chunk_v, zero_v));
    }

    sum2 += sum * chunk_len;
    sum2 += Sum256epi32(sum2_v);
    sum += Sum256epi32(sum_v);

    sum %= ADLER_MOD;
    sum2 %= ADLER_MOD;
  }
  return Adler32DeferMod32((sum2 << 16) | sum, data, len);
}

This algorithm is considerably simpler than the fpng version, using just 2 loop variables for sum and sum2, and 8 instructions per iteration.

Results

# environment:
#  * i5-10210U (ultranotebook/low power cpu, 256KiB L1, 1MiB L2, 6MiB L3)
#  * 2667 MHz DDR4 memory
#  * compiled using g++ 10.2.1 & benchmarked using nanobenchmark
# implementations:
#  * normal, defer32, defer64, and avx are all compiled as written above
#  * fpng-sse is taken directly from fpng (https://github.com/richgel999/fpng)
#  * avx64 is very similar to avx, except it works on 64 bytes per iteration,
#    and computes the sum*n section outside of the loop.
|- 16KB chunks -----------------------------|
| Fits within L1 cache.                     |
| 20,000 iterations per benchmark           |
|-------------------------------------------|
| normal   |   381 MB/s |  0.18 bytes/cycle |
| defer32  |  2.33 GB/s |  1.17 bytes/cycle |
| defer64  |  1.24 GB/s |  0.62 bytes/cycle |
| fpng-sse |  7.34 GB/s |  3.70 bytes/cycle |
| avx      | 26.54 GB/s | 13.55 bytes/cycle |
| avx64    | 41.70 GB/s | 21.39 bytes/cycle |
|- 30MB chunks -----------------------------|
| Roughly the size of an uncompressed 4k    |
| image. Won't fit in my CPU cache at all.  |
| 200 iterations per benchmark              |
|-------------------------------------------|
| normal   |   355 MB/s |  0.18 bytes/cycle |
| defer32  |  2.08 GB/s |  1.06 bytes/cycle |
| defer64  |  2.06 GB/s |  1.05 bytes/cycle |
| fpng-sse |  5.89 GB/s |  2.99 bytes/cycle |
| avx      | 11.19 GB/s |  5.85 bytes/cycle |
| avx64    | 11.91 GB/s |  6.16 bytes/cycle |
|- 256MB chunks ----------------------------|
| 20 iterations per benchmark               |
|-------------------------------------------|
| normal   |   358 MB/s |  0.18 bytes/cycle |
| defer32  |  2.15 GB/s |  1.09 bytes/cycle |
| defer64  |  2.16 GB/s |  1.09 bytes/cycle |
| fpng-sse |  6.80 GB/s |  3.44 bytes/cycle |
| avx      | 26.19 GB/s | 12.50 bytes/cycle |
| avx64    | 30.72 GB/s | 15.32 bytes/cycle |
|-------------------------------------------|

As expected, the normal version is by far the slowest, and all of the vectorized implementations are considerably faster than the scalar versions.

However, some things that are surprising here:

As mentioned before, defer64 is either slower or the same speed at best compared to defer32, despite needing to compute modulo less often.
Both avx and avx64 are considerably slower on 30MB chunks compared to 16KB and 256MB. It isn't surprising that it is slower than 16KB, because it won't fit in the CPU cache, but it's unclear to me why it's faster on 256MB, which also doesn't fit in cache.

There is still a lot of room to micro-optimize both the avx and avx64 implementation, but there is diminishing returns especially due to it working faster than the speed of my RAM (2667MT/s * 8 = ~21 GB/s).