Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 4)

Requirements/ System Specifications.

Argon2 Password hashing function package:

https://github.com/P-H-C/phc-winner-argon2

Machine 1:

Aarch64 Fedora 28 version of Linux operating system

Cortex-A57 8 core processor

Two sticks of Dual-Channel DIMM DDR3 8GB RAM (16GB in total)

Machine 2:

Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz

Four sticks of 8GB DIMM DDR4 RAM at 2.4 GHz (32 GB of RAM in total)

x86_64 Fedora 28 version of Linux Operating System

Continuation of Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 3) blog:

I have test the modified code seen here:

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#elif defined(__aarch64__)
return 1;
#else
return 0;
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - (double)start.tv_sec )
+ ( (double)stop.tv_nsec - (double)start.tv_nsec ) / BILLION;

double mcycles = accum * BILLION;
mcycles = mcycles / (1UL << 20);
uint64_t delta = accum * BILLION;
delta = delta / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time += run_time / (CLOCKS_PER_SEC);

/*run_time += accum;
printf("%2.4f seconds\n\n", (double)run_time);*/
}

/*run_time = 0;*/
run_time += accum;
printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}

This was the bench.c file from the argon2 password hashing function.

The following was the results from machine 2 running the modified program:

Argon2i 3 iterations 1 MiB 1 threads: 3.54 cpb 3.54 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 3.20 cpb 3.20 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 2.73 cpb 2.73 Mcycles
0.0029 seconds

Argon2i 3 iterations 1 MiB 2 threads: 2.92 cpb 2.92 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 2.34 cpb 2.34 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 2.40 cpb 2.40 Mcycles
0.0025 seconds

Argon2i 3 iterations 1 MiB 4 threads: 1.97 cpb 1.97 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 1.87 cpb 1.87 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 1.94 cpb 1.94 Mcycles
0.0020 seconds

Argon2i 3 iterations 1 MiB 8 threads: 3.21 cpb 3.21 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 3.00 cpb 3.00 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 2.81 cpb 2.81 Mcycles
0.0030 seconds

Argon2i 3 iterations 2 MiB 1 threads: 1.40 cpb 2.79 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 1.21 cpb 2.42 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 1.04 cpb 2.08 Mcycles
0.0022 seconds

Argon2i 3 iterations 2 MiB 2 threads: 1.44 cpb 2.88 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 1.36 cpb 2.72 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 1.37 cpb 2.73 Mcycles
0.0029 seconds

Argon2i 3 iterations 2 MiB 4 threads: 0.99 cpb 1.99 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.11 cpb 2.21 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.05 cpb 2.11 Mcycles
0.0022 seconds

Argon2i 3 iterations 2 MiB 8 threads: 1.67 cpb 3.35 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 1.54 cpb 3.08 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 1.51 cpb 3.02 Mcycles
0.0032 seconds

Argon2i 3 iterations 4 MiB 1 threads: 1.41 cpb 5.65 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 1.09 cpb 4.38 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 0.98 cpb 3.92 Mcycles
0.0041 seconds

Argon2i 3 iterations 4 MiB 2 threads: 1.28 cpb 5.13 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 1.21 cpb 4.85 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 1.23 cpb 4.93 Mcycles
0.0052 seconds

Argon2i 3 iterations 4 MiB 4 threads: 0.79 cpb 3.18 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 0.79 cpb 3.18 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 0.81 cpb 3.22 Mcycles
0.0034 seconds

Argon2i 3 iterations 4 MiB 8 threads: 1.00 cpb 4.00 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 0.89 cpb 3.58 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 0.91 cpb 3.64 Mcycles
0.0038 seconds

Argon2i 3 iterations 8 MiB 1 threads: 1.47 cpb 11.79 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 1.13 cpb 9.08 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 0.97 cpb 7.80 Mcycles
0.0082 seconds

Argon2i 3 iterations 8 MiB 2 threads: 1.27 cpb 10.18 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 0.87 cpb 6.95 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 0.88 cpb 7.00 Mcycles
0.0073 seconds

Argon2i 3 iterations 8 MiB 4 threads: 0.91 cpb 7.31 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 0.80 cpb 6.42 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 0.59 cpb 4.70 Mcycles
0.0049 seconds

Argon2i 3 iterations 8 MiB 8 threads: 0.82 cpb 6.53 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 0.83 cpb 6.63 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 0.81 cpb 6.47 Mcycles
0.0068 seconds

Argon2i 3 iterations 16 MiB 1 threads: 1.89 cpb 30.20 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 1.33 cpb 21.22 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 1.17 cpb 18.70 Mcycles
0.0196 seconds

Argon2i 3 iterations 16 MiB 2 threads: 1.17 cpb 18.80 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 0.81 cpb 13.03 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 0.79 cpb 12.57 Mcycles
0.0132 seconds

Argon2i 3 iterations 16 MiB 4 threads: 0.80 cpb 12.79 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 0.56 cpb 8.97 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 0.53 cpb 8.45 Mcycles
0.0089 seconds

Argon2i 3 iterations 16 MiB 8 threads: 0.60 cpb 9.57 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 0.64 cpb 10.22 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 0.68 cpb 10.83 Mcycles
0.0114 seconds

Argon2i 3 iterations 32 MiB 1 threads: 1.64 cpb 52.53 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 1.50 cpb 47.89 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 1.49 cpb 47.84 Mcycles
0.0502 seconds

Argon2i 3 iterations 32 MiB 2 threads: 1.28 cpb 41.08 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 1.29 cpb 41.17 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 1.38 cpb 44.31 Mcycles
0.0465 seconds

Argon2i 3 iterations 32 MiB 4 threads: 0.86 cpb 27.46 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 0.74 cpb 23.58 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 0.65 cpb 20.68 Mcycles
0.0217 seconds

Argon2i 3 iterations 32 MiB 8 threads: 0.68 cpb 21.81 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 0.69 cpb 22.09 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 0.68 cpb 21.73 Mcycles
0.0228 seconds

Argon2i 3 iterations 64 MiB 1 threads: 1.61 cpb 103.11 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 1.58 cpb 101.05 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 1.58 cpb 101.25 Mcycles
0.1062 seconds

Argon2i 3 iterations 64 MiB 2 threads: 1.44 cpb 92.42 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 1.18 cpb 75.76 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 1.18 cpb 75.28 Mcycles
0.0789 seconds

Argon2i 3 iterations 64 MiB 4 threads: 0.76 cpb 48.48 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 0.65 cpb 41.49 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 0.63 cpb 40.49 Mcycles
0.0425 seconds

Argon2i 3 iterations 64 MiB 8 threads: 0.58 cpb 37.08 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 0.61 cpb 38.88 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 0.61 cpb 39.02 Mcycles
0.0409 seconds

Argon2i 3 iterations 128 MiB 1 threads: 1.72 cpb 220.68 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 1.65 cpb 211.20 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 1.61 cpb 206.66 Mcycles
0.2167 seconds

Argon2i 3 iterations 128 MiB 2 threads: 1.12 cpb 143.16 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 1.11 cpb 142.53 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 1.11 cpb 142.67 Mcycles
0.1496 seconds

Argon2i 3 iterations 128 MiB 4 threads: 0.68 cpb 87.52 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 0.68 cpb 86.96 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 0.68 cpb 86.78 Mcycles
0.0910 seconds

Argon2i 3 iterations 128 MiB 8 threads: 0.59 cpb 75.56 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 0.55 cpb 70.96 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 0.58 cpb 74.02 Mcycles
0.0776 seconds

Argon2i 3 iterations 256 MiB 1 threads: 1.75 cpb 447.73 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 1.62 cpb 414.48 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 1.62 cpb 415.25 Mcycles
0.4354 seconds

Argon2i 3 iterations 256 MiB 2 threads: 1.17 cpb 299.72 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 1.07 cpb 274.17 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 1.14 cpb 291.48 Mcycles
0.3056 seconds

Argon2i 3 iterations 256 MiB 4 threads: 0.70 cpb 180.25 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 0.71 cpb 182.79 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 0.70 cpb 180.23 Mcycles
0.1890 seconds

Argon2i 3 iterations 256 MiB 8 threads: 0.54 cpb 137.75 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 0.54 cpb 139.23 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 0.53 cpb 134.82 Mcycles
0.1414 seconds

This is strange as the original had a result of this:

2292451852727619283Argon2i 3 iterations 1 MiB 1 threads: 10574.63 cpb 10574.64 Mcycles
9176590593415145417Argon2d 3 iterations 1 MiB 1 threads: 10573.79 cpb 10573.79 Mcycles
16050798784100622823Argon2id 3 iterations 1 MiB 1 threads: 10571.93 cpb 10571.94 Mcycles
0.0100 seconds

2290633554493452044Argon2i 3 iterations 1 MiB 2 threads: 10574.07 cpb 10574.07 Mcycles
29783368801178634129Argon2d 3 iterations 1 MiB 2 threads: 10571.67 cpb 10571.67 Mcycles
36635109851864293143Argon2id 3 iterations 1 MiB 2 threads: 10572.13 cpb 10572.13 Mcycles
0.0160 seconds
Note: The beginning of each line has a random set of numbers. The cpb and the Mcycles were really long meaning the CPU is slower to hash the result.

I will now change the optimization level to -O3 and retest the program.

Result:
Argon2i 3 iterations 1 MiB 1 threads: 3.42 cpb 3.42 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 3.18 cpb 3.18 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 2.72 cpb 2.72 Mcycles
0.0029 seconds

Argon2i 3 iterations 1 MiB 2 threads: 2.49 cpb 2.49 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 2.33 cpb 2.33 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 2.30 cpb 2.31 Mcycles
0.0024 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.23 cpb 2.23 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.06 cpb 2.06 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 1.71 cpb 1.71 Mcycles
0.0018 seconds

Argon2i 3 iterations 1 MiB 8 threads: 3.17 cpb 3.17 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 3.00 cpb 3.00 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 2.99 cpb 2.99 Mcycles
0.0031 seconds

Argon2i 3 iterations 2 MiB 1 threads: 1.41 cpb 2.82 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 1.23 cpb 2.47 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 1.04 cpb 2.07 Mcycles
0.0022 seconds

Argon2i 3 iterations 2 MiB 2 threads: 1.39 cpb 2.79 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 1.36 cpb 2.73 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 1.34 cpb 2.69 Mcycles
0.0028 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.02 cpb 2.04 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 0.99 cpb 1.99 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.00 cpb 1.99 Mcycles
0.0021 seconds

Argon2i 3 iterations 2 MiB 8 threads: 1.71 cpb 3.43 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 1.68 cpb 3.37 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 1.64 cpb 3.29 Mcycles
0.0034 seconds

Argon2i 3 iterations 4 MiB 1 threads: 1.37 cpb 5.49 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 1.10 cpb 4.40 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 1.01 cpb 4.06 Mcycles
0.0043 seconds

Argon2i 3 iterations 4 MiB 2 threads: 1.35 cpb 5.40 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 1.18 cpb 4.71 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 1.19 cpb 4.78 Mcycles
0.0050 seconds

Argon2i 3 iterations 4 MiB 4 threads: 0.91 cpb 3.65 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 0.91 cpb 3.63 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 0.90 cpb 3.62 Mcycles
0.0038 seconds

Argon2i 3 iterations 4 MiB 8 threads: 1.02 cpb 4.08 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 1.01 cpb 4.03 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 0.95 cpb 3.80 Mcycles
0.0040 seconds

Argon2i 3 iterations 8 MiB 1 threads: 1.40 cpb 11.22 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 1.16 cpb 9.25 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 0.99 cpb 7.93 Mcycles
0.0083 seconds

Argon2i 3 iterations 8 MiB 2 threads: 1.42 cpb 11.40 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 0.88 cpb 7.03 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 0.75 cpb 6.02 Mcycles
0.0063 seconds

Argon2i 3 iterations 8 MiB 4 threads: 0.94 cpb 7.49 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 0.74 cpb 5.96 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 0.55 cpb 4.44 Mcycles
0.0047 seconds

Argon2i 3 iterations 8 MiB 8 threads: 0.71 cpb 5.67 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 0.76 cpb 6.11 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 0.75 cpb 5.97 Mcycles
0.0063 seconds

Argon2i 3 iterations 16 MiB 1 threads: 1.62 cpb 25.97 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 1.27 cpb 20.26 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 1.14 cpb 18.20 Mcycles
0.0191 seconds

Argon2i 3 iterations 16 MiB 2 threads: 1.35 cpb 21.65 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 0.98 cpb 15.62 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 0.92 cpb 14.74 Mcycles
0.0155 seconds

Argon2i 3 iterations 16 MiB 4 threads: 0.84 cpb 13.44 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 0.54 cpb 8.65 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 0.58 cpb 9.27 Mcycles
0.0097 seconds

Argon2i 3 iterations 16 MiB 8 threads: 0.61 cpb 9.80 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 0.61 cpb 9.72 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 0.67 cpb 10.75 Mcycles
0.0113 seconds

Argon2i 3 iterations 32 MiB 1 threads: 1.58 cpb 50.49 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 1.47 cpb 46.95 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 1.47 cpb 47.09 Mcycles
0.0494 seconds

Argon2i 3 iterations 32 MiB 2 threads: 1.46 cpb 46.79 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 1.39 cpb 44.55 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 1.42 cpb 45.41 Mcycles
0.0476 seconds

Argon2i 3 iterations 32 MiB 4 threads: 0.85 cpb 27.25 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 0.63 cpb 20.09 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 0.67 cpb 21.30 Mcycles
0.0223 seconds

Argon2i 3 iterations 32 MiB 8 threads: 0.65 cpb 20.74 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 0.67 cpb 21.54 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 0.67 cpb 21.34 Mcycles
0.0224 seconds

Argon2i 3 iterations 64 MiB 1 threads: 1.60 cpb 102.66 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 1.55 cpb 99.24 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 1.55 cpb 99.25 Mcycles
0.1041 seconds

Argon2i 3 iterations 64 MiB 2 threads: 1.22 cpb 78.43 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 1.26 cpb 80.65 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 1.20 cpb 76.73 Mcycles
0.0805 seconds

Argon2i 3 iterations 64 MiB 4 threads: 0.76 cpb 48.88 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 0.68 cpb 43.39 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 0.74 cpb 47.31 Mcycles
0.0496 seconds

Argon2i 3 iterations 64 MiB 8 threads: 0.65 cpb 41.82 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 0.63 cpb 40.18 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 0.67 cpb 42.62 Mcycles
0.0447 seconds

Argon2i 3 iterations 128 MiB 1 threads: 1.66 cpb 212.21 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 1.72 cpb 219.73 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 1.64 cpb 209.82 Mcycles
0.2200 seconds

Argon2i 3 iterations 128 MiB 2 threads: 1.24 cpb 158.31 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 1.11 cpb 142.63 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 1.19 cpb 152.53 Mcycles
0.1599 seconds

Argon2i 3 iterations 128 MiB 4 threads: 0.75 cpb 95.45 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 0.68 cpb 86.76 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 0.68 cpb 87.00 Mcycles
0.0912 seconds

Argon2i 3 iterations 128 MiB 8 threads: 0.57 cpb 72.78 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 0.58 cpb 74.95 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 0.59 cpb 75.34 Mcycles
0.0790 seconds

Argon2i 3 iterations 256 MiB 1 threads: 1.76 cpb 451.19 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 1.69 cpb 433.36 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 1.60 cpb 408.90 Mcycles
0.4288 seconds

Argon2i 3 iterations 256 MiB 2 threads: 1.16 cpb 296.43 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 1.09 cpb 279.88 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 1.18 cpb 301.38 Mcycles
0.3160 seconds

Argon2i 3 iterations 256 MiB 4 threads: 0.74 cpb 189.06 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 0.68 cpb 174.25 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 0.71 cpb 180.84 Mcycles
0.1896 seconds

Argon2i 3 iterations 256 MiB 8 threads: 0.50 cpb 128.98 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 0.55 cpb 141.48 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 0.52 cpb 132.25 Mcycles
0.1387 seconds

Argon2i 3 iterations 512 MiB 1 threads: 1.75 cpb 895.61 Mcycles
Argon2d 3 iterations 512 MiB 1 threads: 1.65 cpb 844.13 Mcycles
Argon2id 3 iterations 512 MiB 1 threads: 1.65 cpb 843.89 Mcycles
0.8849 seconds

Argon2i 3 iterations 512 MiB 2 threads: 1.10 cpb 563.01 Mcycles
Argon2d 3 iterations 512 MiB 2 threads: 1.12 cpb 573.63 Mcycles
Argon2id 3 iterations 512 MiB 2 threads: 1.12 cpb 575.07 Mcycles
0.6030 seconds

Argon2i 3 iterations 512 MiB 4 threads: 0.67 cpb 341.87 Mcycles
Argon2d 3 iterations 512 MiB 4 threads: 0.69 cpb 351.20 Mcycles
Argon2id 3 iterations 512 MiB 4 threads: 0.66 cpb 337.59 Mcycles
0.3540 seconds

Argon2i 3 iterations 512 MiB 8 threads: 0.50 cpb 255.14 Mcycles
Argon2d 3 iterations 512 MiB 8 threads: 0.49 cpb 253.08 Mcycles
Argon2id 3 iterations 512 MiB 8 threads: 0.50 cpb 258.21 Mcycles
0.2708 seconds

The result runs fairly fast. This is expected as the optimization level is -O3.

Test 3 (Extra):

I will be testing on a third machine.

Specifications:

8 core aarch64 X-Gene CPU
Two sticks of DDR3 4096 MB RAM @ 1600 MHz
Fedora 28 64-bit Linux Operating System

Result:

This is with optimization level -O2.

Building without optimizations
cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/bench.c -o bench
Argon2i 3 iterations 1 MiB 1 threads: 5.51 cpb 5.51 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 5.18 cpb 5.18 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.78 cpb 4.78 Mcycles
0.0050 seconds

Argon2i 3 iterations 1 MiB 2 threads: 4.00 cpb 4.00 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.67 cpb 3.67 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.76 cpb 3.76 Mcycles
0.0039 seconds

Argon2i 3 iterations 1 MiB 4 threads: 3.16 cpb 3.16 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.95 cpb 2.95 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 3.07 cpb 3.07 Mcycles
0.0032 seconds

Argon2i 3 iterations 1 MiB 8 threads: 5.75 cpb 5.75 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 5.90 cpb 5.90 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 6.04 cpb 6.04 Mcycles
0.0063 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.48 cpb 10.96 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 5.27 cpb 10.53 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.80 cpb 9.59 Mcycles
0.0101 seconds

Argon2i 3 iterations 2 MiB 2 threads: 3.18 cpb 6.35 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 3.14 cpb 6.27 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 3.05 cpb 6.10 Mcycles
0.0064 seconds

Argon2i 3 iterations 2 MiB 4 threads: 2.38 cpb 4.76 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 2.33 cpb 4.67 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 2.36 cpb 4.72 Mcycles
0.0050 seconds

Argon2i 3 iterations 2 MiB 8 threads: 3.62 cpb 7.23 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 3.58 cpb 7.15 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 3.67 cpb 7.34 Mcycles
0.0077 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.58 cpb 22.32 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 5.09 cpb 20.35 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.84 cpb 19.36 Mcycles
0.0203 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.87 cpb 11.49 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.86 cpb 11.45 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.84 cpb 11.38 Mcycles
0.0119 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.89 cpb 7.54 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.82 cpb 7.30 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.80 cpb 7.21 Mcycles
0.0076 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.47 cpb 9.90 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.55 cpb 10.19 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 2.63 cpb 10.51 Mcycles
0.0110 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.82 cpb 46.54 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.33 cpb 42.66 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 5.04 cpb 40.33 Mcycles
0.0423 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.84 cpb 22.69 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.78 cpb 22.22 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.83 cpb 22.65 Mcycles
0.0237 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.65 cpb 13.20 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.63 cpb 13.07 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.64 cpb 13.11 Mcycles
0.0137 seconds

Argon2i 3 iterations 8 MiB 8 threads: 2.09 cpb 16.73 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.95 cpb 15.62 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 2.36 cpb 18.85 Mcycles
0.0198 seconds

Argon2i 3 iterations 16 MiB 1 threads: 6.14 cpb 98.25 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.70 cpb 91.25 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.47 cpb 87.54 Mcycles
0.0918 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.98 cpb 47.67 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.93 cpb 46.88 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.94 cpb 47.08 Mcycles
0.0494 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.62 cpb 25.96 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.61 cpb 25.72 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.62 cpb 25.90 Mcycles
0.0272 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.79 cpb 28.67 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.75 cpb 28.07 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.82 cpb 29.16 Mcycles
0.0306 seconds

Argon2i 3 iterations 32 MiB 1 threads: 6.34 cpb 203.00 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 6.26 cpb 200.26 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 6.27 cpb 200.72 Mcycles
0.2105 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.42 cpb 109.52 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.38 cpb 108.09 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.38 cpb 108.12 Mcycles
0.1134 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.93 cpb 61.63 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.90 cpb 60.92 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.94 cpb 62.00 Mcycles
0.0650 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.94 cpb 62.07 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.96 cpb 62.58 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 1.92 cpb 61.30 Mcycles
0.0643 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.48 cpb 414.84 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.40 cpb 409.88 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 6.41 cpb 410.55 Mcycles
0.4305 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.47 cpb 221.90 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.43 cpb 219.27 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.43 cpb 219.69 Mcycles
0.2304 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.92 cpb 123.08 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.90 cpb 121.74 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.93 cpb 123.49 Mcycles
0.1295 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.82 cpb 116.51 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.79 cpb 114.79 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.80 cpb 115.02 Mcycles
0.1206 seconds

Argon2i 3 iterations 128 MiB 1 threads: 6.60 cpb 844.52 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 6.52 cpb 835.11 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 6.54 cpb 836.68 Mcycles
0.8773 seconds

Argon2i 3 iterations 128 MiB 2 threads: 3.52 cpb 450.00 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 3.47 cpb 444.85 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 3.49 cpb 446.23 Mcycles
0.4679 seconds

Argon2i 3 iterations 128 MiB 4 threads: 1.94 cpb 247.84 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 1.91 cpb 245.05 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 1.92 cpb 245.15 Mcycles
0.2571 seconds

Argon2i 3 iterations 128 MiB 8 threads: 1.73 cpb 221.21 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 1.70 cpb 217.79 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 1.64 cpb 209.97 Mcycles
0.2202 seconds

Argon2i 3 iterations 256 MiB 1 threads: 6.69 cpb 1712.64 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 6.62 cpb 1694.77 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 6.63 cpb 1696.72 Mcycles
1.7791 seconds

Argon2i 3 iterations 256 MiB 2 threads: 3.55 cpb 909.09 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 3.51 cpb 899.22 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 3.52 cpb 900.67 Mcycles
0.9444 seconds

Argon2i 3 iterations 256 MiB 4 threads: 1.95 cpb 499.72 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 1.94 cpb 497.66 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 1.94 cpb 496.66 Mcycles
0.5208 seconds

Argon2i 3 iterations 256 MiB 8 threads: 1.48 cpb 379.07 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 1.55 cpb 398.15 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 1.58 cpb 403.45 Mcycles
0.4230 seconds

Argon2i 3 iterations 512 MiB 1 threads: 6.75 cpb 3458.96 Mcycles
Argon2d 3 iterations 512 MiB 1 threads: 6.68 cpb 3419.92 Mcycles
Argon2id 3 iterations 512 MiB 1 threads: 6.69 cpb 3426.03 Mcycles
3.5925 seconds

Argon2i 3 iterations 512 MiB 2 threads: 3.58 cpb 1835.84 Mcycles
Argon2d 3 iterations 512 MiB 2 threads: 3.55 cpb 1816.11 Mcycles
Argon2id 3 iterations 512 MiB 2 threads: 3.55 cpb 1819.26 Mcycles
1.9076 seconds

Argon2i 3 iterations 512 MiB 4 threads: 1.97 cpb 1009.56 Mcycles
Argon2d 3 iterations 512 MiB 4 threads: 1.95 cpb 997.45 Mcycles
Argon2id 3 iterations 512 MiB 4 threads: 2.01 cpb 1028.11 Mcycles
1.0780 seconds

Argon2i 3 iterations 512 MiB 8 threads: 1.41 cpb 721.65 Mcycles
Argon2d 3 iterations 512 MiB 8 threads: 1.64 cpb 839.50 Mcycles
Argon2id 3 iterations 512 MiB 8 threads: 1.69 cpb 865.63 Mcycles
0.9077 seconds

This machine has a slight issue in terms of running quickly. This machine also had less memory than the other two machines. I guess this is expected as a result.

Moving on to the next optimization level -O3.

Result:
Building without optimizations
cc -std=c89 -O3 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/bench.c -o bench
Argon2i 3 iterations 1 MiB 1 threads: 5.75 cpb 5.75 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 5.45 cpb 5.45 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 5.04 cpb 5.04 Mcycles
0.0053 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.97 cpb 3.97 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.59 cpb 3.59 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.54 cpb 3.54 Mcycles
0.0037 seconds

Argon2i 3 iterations 1 MiB 4 threads: 3.00 cpb 3.00 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.84 cpb 2.84 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.77 cpb 2.77 Mcycles
0.0029 seconds

Argon2i 3 iterations 1 MiB 8 threads: 5.19 cpb 5.20 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 5.07 cpb 5.07 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.92 cpb 4.93 Mcycles
0.0052 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.70 cpb 11.40 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 5.49 cpb 10.98 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 5.07 cpb 10.14 Mcycles
0.0106 seconds

Argon2i 3 iterations 2 MiB 2 threads: 3.19 cpb 6.39 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 3.15 cpb 6.30 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 3.21 cpb 6.43 Mcycles
0.0067 seconds

Argon2i 3 iterations 2 MiB 4 threads: 2.20 cpb 4.41 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 2.22 cpb 4.44 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 2.16 cpb 4.32 Mcycles
0.0045 seconds

Argon2i 3 iterations 2 MiB 8 threads: 3.68 cpb 7.36 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.80 cpb 5.61 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.79 cpb 5.58 Mcycles
0.0058 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.81 cpb 23.23 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 5.34 cpb 21.38 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 5.11 cpb 20.43 Mcycles
0.0214 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.98 cpb 11.93 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.93 cpb 11.73 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.93 cpb 11.71 Mcycles
0.0123 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.82 cpb 7.28 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.77 cpb 7.08 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.77 cpb 7.07 Mcycles
0.0074 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.50 cpb 9.99 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.70 cpb 10.82 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 2.89 cpb 11.54 Mcycles
0.0121 seconds

Argon2i 3 iterations 8 MiB 1 threads: 6.05 cpb 48.43 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.58 cpb 44.62 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 5.31 cpb 42.46 Mcycles
0.0445 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.95 cpb 23.60 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.91 cpb 23.26 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.90 cpb 23.23 Mcycles
0.0244 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.66 cpb 13.24 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.64 cpb 13.13 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.64 cpb 13.10 Mcycles
0.0137 seconds

Argon2i 3 iterations 8 MiB 8 threads: 2.03 cpb 16.25 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 2.29 cpb 18.37 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.92 cpb 15.33 Mcycles
0.0161 seconds

Argon2i 3 iterations 16 MiB 1 threads: 6.37 cpb 102.00 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.97 cpb 95.50 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.74 cpb 91.90 Mcycles
0.0964 seconds

Argon2i 3 iterations 16 MiB 2 threads: 3.12 cpb 49.90 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 3.07 cpb 49.17 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 3.08 cpb 49.33 Mcycles
0.0517 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.70 cpb 27.26 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.68 cpb 26.94 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.69 cpb 27.04 Mcycles
0.0283 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.81 cpb 28.91 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.87 cpb 29.85 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.87 cpb 29.86 Mcycles
0.0313 seconds

Argon2i 3 iterations 32 MiB 1 threads: 6.57 cpb 210.38 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 6.51 cpb 208.24 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 6.52 cpb 208.70 Mcycles
0.2188 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.53 cpb 112.92 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.49 cpb 111.63 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.50 cpb 111.91 Mcycles
0.1173 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.97 cpb 63.21 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.96 cpb 62.57 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.96 cpb 62.68 Mcycles
0.0657 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.89 cpb 60.42 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 2.00 cpb 63.85 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 2.03 cpb 64.85 Mcycles
0.0680 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.72 cpb 430.30 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.66 cpb 426.03 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 6.67 cpb 426.61 Mcycles
0.4473 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.58 cpb 229.32 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.54 cpb 226.89 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.55 cpb 227.27 Mcycles
0.2383 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.98 cpb 126.75 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.96 cpb 125.35 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.96 cpb 125.71 Mcycles
0.1318 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.87 cpb 119.64 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.94 cpb 123.96 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.90 cpb 121.41 Mcycles
0.1273 seconds

Argon2i 3 iterations 128 MiB 1 threads: 6.83 cpb 874.04 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 6.77 cpb 866.06 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 6.78 cpb 867.69 Mcycles
0.9098 seconds

Argon2i 3 iterations 128 MiB 2 threads: 3.62 cpb 464.03 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 3.60 cpb 460.44 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 3.59 cpb 460.12 Mcycles
0.4825 seconds

Argon2i 3 iterations 128 MiB 4 threads: 2.00 cpb 255.49 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 1.97 cpb 251.78 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 1.97 cpb 252.45 Mcycles
0.2647 seconds

Argon2i 3 iterations 128 MiB 8 threads: 1.85 cpb 236.45 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 1.71 cpb 218.54 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 1.71 cpb 219.59 Mcycles
0.2303 seconds

Argon2i 3 iterations 256 MiB 1 threads: 6.92 cpb 1771.62 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 6.86 cpb 1756.04 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 6.87 cpb 1759.49 Mcycles
1.8450 seconds

It looks like the result had a slight improvement in time.

Conclusion:

I do not know what could have removed those random numbers from the x86_64 basic test but I would consider this was a success in porting the argon2 password hashing function bench test tool to work on any Linux OS device such as Aarch64 or x86_64.

 

 

 

Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 3)

Requirements/ System Specifications.

Argon2 Password hashing function package:

https://github.com/P-H-C/phc-winner-argon2

Machine 1:

Aarch64 Fedora 28 version of Linux operating system

Cortex-A57 8 core processor

Two sticks of Dual-Channel DIMM DDR3 8GB RAM (16GB in total)

Machine 2:

Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz

Four sticks of 8GB DIMM DDR4 RAM at 2.4 GHz (32 GB of RAM in total)

x86_64 Fedora 28 version of Linux Operating System

Approach:

I will test the changed code on machine 1. This is a continuation of the last blog titled: “Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 2)”.

Here is the modified version of bench.c from the argon2 password hashing function:

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - start.tv_sec )
+ ( (double)stop.tv_nsec - start.tv_nsec );

double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time = 0;
run_time += accum / BILLION;

/*run_time += accum;
printf("%2.4f seconds\n\n", (double)run_time);*/
}

printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}

The x86_64 basic test done in the previous blog shows how the program is intended to run. The program is suppose to count the amount of CPU cycles while running the program’s main code, “argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type, ARGON2_VERSION_NUMBER);“. I did not expect the rdstc counter found in the x86_64 architecture to be such a sophisticated problem.

This is the portion of code that I assume did the math/ calculation of the CPU cycles:

delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);

The calculation is straight-forward of delta being the value of the stop time subtracting the start time and finally divided by the variable m_cost. m_cost is generated from the for loop seen below:

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2)

My mistake:

When looking at the original code I notice that the program had a variable that I forgot to include.

run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);

I made the change and rebuilt the program using the Makefile.

cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/bench.c -o bench

Here is the changed code:

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#elif defined(__aarch64__)
return 1;
#else
return 0;
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" ); 
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - (double)start.tv_sec )
+ ( (double)stop.tv_nsec - (double)start.tv_nsec );

double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time += accum / BILLION
run_time += run_time / (CLOCKS_PER_SEC);

/*run_time += accum;
printf("%2.4f seconds\n\n", (double)run_time);*/
}

/*run_time = 0;
run_time += accum / BILLION;*/
printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}
Command to run the program:
./bench
Result:
Argon2i 3 iterations 1 MiB 1 threads: 5.38 cpb 5.38 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 4.97 cpb 4.97 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.45 cpb 4.45 Mcycles
0.0155 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.50 cpb 3.50 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.21 cpb 3.21 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.20 cpb 3.20 Mcycles
0.0104 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.69 cpb 2.69 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.61 cpb 2.61 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.65 cpb 2.65 Mcycles
0.0083 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.43 cpb 4.43 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.41 cpb 4.41 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.39 cpb 4.39 Mcycles
0.0139 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.21 cpb 10.42 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 4.98 cpb 9.95 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.42 cpb 8.84 Mcycles
0.0306 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.81 cpb 5.63 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.73 cpb 5.47 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 0.00 cpb -948.16 Mcycles
-0.9826 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.88 cpb 3.76 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.90 cpb 3.80 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.88 cpb 3.76 Mcycles
0.0119 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.52 cpb 5.04 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.54 cpb 5.08 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.60 cpb 5.20 Mcycles
0.0161 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.29 cpb 21.18 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 4.75 cpb 19.00 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.43 cpb 17.72 Mcycles
0.0607 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.60 cpb 10.41 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.57 cpb 10.27 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.58 cpb 10.31 Mcycles
0.0325 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.61 cpb 6.42 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.59 cpb 6.37 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.60 cpb 6.39 Mcycles
0.0201 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.09 cpb 8.35 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.06 cpb 8.25 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 2.41 cpb 9.64 Mcycles
0.0275 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.52 cpb 44.13 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.00 cpb 40.03 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.61 cpb 36.90 Mcycles
0.1269 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.59 cpb 20.76 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.57 cpb 20.56 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.56 cpb 20.52 Mcycles
0.0648 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.48 cpb 11.85 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.49 cpb 11.88 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.48 cpb 11.84 Mcycles
0.0373 seconds

Argon2i 3 iterations 8 MiB 8 threads: 2.24 cpb 17.95 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 0.00 cpb -939.59 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 2.02 cpb 16.16 Mcycles
-0.9495 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.77 cpb 92.33 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.31 cpb 84.99 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.01 cpb 80.18 Mcycles
0.2700 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.75 cpb 44.05 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.73 cpb 43.68 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.74 cpb 43.80 Mcycles
0.1379 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.54 cpb 24.66 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.51 cpb 24.24 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.52 cpb 24.33 Mcycles
0.0768 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.62 cpb 25.92 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.68 cpb 26.85 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.76 cpb 28.13 Mcycles
0.0848 seconds

Argon2i 3 iterations 32 MiB 1 threads: 5.96 cpb 190.66 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 5.88 cpb 188.16 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 0.00 cpb -765.51 Mcycles
-0.4055 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.29 cpb 105.24 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.25 cpb 104.07 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.26 cpb 104.20 Mcycles
0.3287 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.85 cpb 59.35 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.84 cpb 58.92 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.85 cpb 59.15 Mcycles
0.1860 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.92 cpb 61.44 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.84 cpb 58.89 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 1.99 cpb 63.67 Mcycles
0.1929 seconds

Argon2i 3 iterations 64 MiB 1 threads: 0.00 cpb -564.65 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.02 cpb 385.31 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 0.00 cpb -567.80 Mcycles
-0.7834 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.33 cpb 213.04 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.30 cpb 210.98 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.30 cpb 211.29 Mcycles
0.6662 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.86 cpb 119.27 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 0.00 cpb -835.44 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.85 cpb 118.59 Mcycles
-0.6266 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.88 cpb 120.44 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.94 cpb 124.37 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.63 cpb 104.46 Mcycles
0.3662 seconds

Argon2i 3 iterations 128 MiB 1 threads: 0.00 cpb -158.98 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 0.00 cpb -167.45 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 0.00 cpb -165.81 Mcycles
-0.5162 seconds

Argon2i 3 iterations 128 MiB 2 threads: 3.38 cpb 432.10 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 3.34 cpb 427.70 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 0.00 cpb -525.12 Mcycles
0.3509 seconds

Argon2i 3 iterations 128 MiB 4 threads: 1.88 cpb 240.61 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 1.86 cpb 238.46 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 0.00 cpb -715.31 Mcycles
-0.2477 seconds

Argon2i 3 iterations 128 MiB 8 threads: 1.56 cpb 199.22 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 1.72 cpb 219.92 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 1.69 cpb 216.88 Mcycles
0.6669 seconds

I will change the placement of the equations in a hope to change the results.

Here is the changed code:
/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#elif defined(__aarch64__)
return 1;
#else
return 0;
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" ); 
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - (double)start.tv_sec )
+ ( (double)stop.tv_nsec - (double)start.tv_nsec ) / BILLION;

double mcycles = accum * BILLION;
mcycles = mcycles / (1UL << 20);
uint64_t delta = accum * BILLION;
delta = delta / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time += run_time / (CLOCKS_PER_SEC);

/*run_time += accum;
printf("%2.4f seconds\n\n", (double)run_time);*/
}

printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}
Here is the result:
Argon2i 3 iterations 1 MiB 1 threads: 5.61 cpb 5.61 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 5.18 cpb 5.18 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.64 cpb 4.64 Mcycles
0.0000 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.64 cpb 3.64 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.26 cpb 3.26 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.29 cpb 3.29 Mcycles
0.0000 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.69 cpb 2.69 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.69 cpb 2.69 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.64 cpb 2.64 Mcycles
0.0000 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.44 cpb 4.44 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.41 cpb 4.41 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.45 cpb 4.45 Mcycles
0.0000 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.45 cpb 10.90 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 5.19 cpb 10.39 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.67 cpb 9.34 Mcycles
0.0000 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.95 cpb 5.90 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.88 cpb 5.75 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 2.91 cpb 5.83 Mcycles
0.0000 seconds

Argon2i 3 iterations 2 MiB 4 threads: 2.09 cpb 4.18 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 2.09 cpb 4.17 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.94 cpb 3.88 Mcycles
0.0000 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.44 cpb 4.88 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.48 cpb 4.96 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.63 cpb 5.26 Mcycles
0.0000 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.52 cpb 22.07 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 5.01 cpb 20.06 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.70 cpb 18.79 Mcycles
0.0000 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.78 cpb 11.13 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.69 cpb 10.76 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.71 cpb 10.83 Mcycles
0.0000 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.68 cpb 6.73 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.67 cpb 6.69 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.68 cpb 6.74 Mcycles
0.0000 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.24 cpb 8.98 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.47 cpb 9.87 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 1.94 cpb 7.76 Mcycles
0.0000 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.71 cpb 45.69 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.24 cpb 41.95 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.87 cpb 38.96 Mcycles
0.0000 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.71 cpb 21.71 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.68 cpb 21.48 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.68 cpb 21.46 Mcycles
0.0000 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.55 cpb 12.43 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.54 cpb 12.31 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.56 cpb 12.46 Mcycles
0.0000 seconds

Argon2i 3 iterations 8 MiB 8 threads: 1.77 cpb 14.15 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.72 cpb 13.77 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.80 cpb 14.39 Mcycles
0.0000 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.97 cpb 95.46 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.52 cpb 88.28 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.21 cpb 83.43 Mcycles
0.0000 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.87 cpb 45.92 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.83 cpb 45.30 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.84 cpb 45.51 Mcycles
0.0000 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.59 cpb 25.43 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.57 cpb 25.17 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.58 cpb 25.32 Mcycles
0.0000 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.92 cpb 30.72 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.71 cpb 27.37 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.78 cpb 28.47 Mcycles
0.0000 seconds

Argon2i 3 iterations 32 MiB 1 threads: 6.19 cpb 198.09 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 6.10 cpb 195.33 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 6.11 cpb 195.65 Mcycles
0.0000 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.39 cpb 108.50 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.36 cpb 107.50 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.36 cpb 107.38 Mcycles
0.0000 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.91 cpb 61.22 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.90 cpb 60.79 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.90 cpb 60.86 Mcycles
0.0000 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.90 cpb 60.93 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.90 cpb 60.83 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 1.97 cpb 62.99 Mcycles
0.0000 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.32 cpb 404.43 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.23 cpb 398.94 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 6.24 cpb 399.53 Mcycles
0.0000 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.45 cpb 220.50 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.41 cpb 218.07 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.42 cpb 218.96 Mcycles
0.0000 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.92 cpb 123.16 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.91 cpb 122.17 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.91 cpb 122.42 Mcycles
0.0000 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.82 cpb 116.25 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.84 cpb 117.60 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.87 cpb 119.54 Mcycles
0.0000 seconds

The result did show positive numbers for Mcycles but I accidentally removed the equation to calculate the time at the end. I will fix that now.

Here is the changed code:
/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#elif defined(__aarch64__)
return 1;
#else
return 0;
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" ); 
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - (double)start.tv_sec )
+ ( (double)stop.tv_nsec - (double)start.tv_nsec ) / BILLION;

double mcycles = accum * BILLION;
mcycles = mcycles / (1UL << 20);
uint64_t delta = accum * BILLION;
delta = delta / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time += run_time / (CLOCKS_PER_SEC);

/*run_time += accum;
printf("%2.4f seconds\n\n", (double)run_time);*/
}

/*run_time = 0;*/
run_time += accum;
printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}

Hopefully it works now.

Rebuild and test.

Result:
Argon2i 3 iterations 1 MiB 1 threads: 5.24 cpb 5.24 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 4.89 cpb 4.90 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.40 cpb 4.40 Mcycles
0.0046 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.46 cpb 3.46 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.13 cpb 3.13 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.16 cpb 3.16 Mcycles
0.0033 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.65 cpb 2.65 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.58 cpb 2.58 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.61 cpb 2.61 Mcycles
0.0027 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.36 cpb 4.36 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.27 cpb 4.27 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.25 cpb 4.25 Mcycles
0.0045 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.20 cpb 10.41 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 4.93 cpb 9.86 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.41 cpb 8.82 Mcycles
0.0092 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.83 cpb 5.65 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.72 cpb 5.44 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 2.73 cpb 5.47 Mcycles
0.0057 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.87 cpb 3.73 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.99 cpb 3.98 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.87 cpb 3.74 Mcycles
0.0039 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.46 cpb 4.93 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.52 cpb 5.05 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.55 cpb 5.10 Mcycles
0.0053 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.28 cpb 21.11 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 4.80 cpb 19.21 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.56 cpb 18.22 Mcycles
0.0191 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.67 cpb 10.66 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.56 cpb 10.25 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.57 cpb 10.27 Mcycles
0.0108 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.61 cpb 6.42 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.57 cpb 6.29 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 2.26 cpb 9.03 Mcycles
0.0095 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.43 cpb 9.74 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 1.99 cpb 7.95 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 2.15 cpb 8.61 Mcycles
0.0090 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.50 cpb 43.97 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.06 cpb 40.49 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.63 cpb 37.06 Mcycles
0.0389 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.62 cpb 20.97 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.56 cpb 20.48 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.57 cpb 20.53 Mcycles
0.0215 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.49 cpb 11.91 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.46 cpb 11.69 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.47 cpb 11.74 Mcycles
0.0123 seconds

Argon2i 3 iterations 8 MiB 8 threads: 1.96 cpb 15.66 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.73 cpb 13.82 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.86 cpb 14.86 Mcycles
0.0156 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.75 cpb 92.08 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.29 cpb 84.71 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.01 cpb 80.20 Mcycles
0.0841 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.75 cpb 44.01 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.73 cpb 43.66 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.72 cpb 43.55 Mcycles
0.0457 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.52 cpb 24.39 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.50 cpb 24.08 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.51 cpb 24.14 Mcycles
0.0253 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.70 cpb 27.21 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.67 cpb 26.80 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.70 cpb 27.21 Mcycles
0.0285 seconds

Argon2i 3 iterations 32 MiB 1 threads: 5.93 cpb 189.81 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 5.88 cpb 188.10 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 5.86 cpb 187.57 Mcycles
0.1967 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.29 cpb 105.13 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.25 cpb 103.96 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.25 cpb 104.06 Mcycles
0.1091 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.85 cpb 59.28 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.84 cpb 58.83 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.84 cpb 58.88 Mcycles
0.0617 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.82 cpb 58.35 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.99 cpb 63.75 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 1.88 cpb 60.21 Mcycles
0.0631 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.07 cpb 388.65 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.01 cpb 384.52 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 6.02 cpb 385.18 Mcycles
0.4039 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.34 cpb 213.63 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.30 cpb 211.42 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.30 cpb 211.20 Mcycles
0.2215 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.87 cpb 119.59 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.84 cpb 118.12 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.85 cpb 118.15 Mcycles
0.1239 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.74 cpb 111.63 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.76 cpb 112.49 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.85 cpb 118.57 Mcycles
0.1243 seconds

Argon2i 3 iterations 128 MiB 1 threads: 6.20 cpb 793.29 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 6.14 cpb 785.44 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 6.14 cpb 786.33 Mcycles
0.8245 seconds

Argon2i 3 iterations 128 MiB 2 threads: 3.38 cpb 432.51 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 3.35 cpb 428.33 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 3.35 cpb 428.92 Mcycles
0.4498 seconds

Argon2i 3 iterations 128 MiB 4 threads: 1.88 cpb 240.65 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 1.86 cpb 238.37 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 1.86 cpb 238.47 Mcycles
0.2501 seconds

Argon2i 3 iterations 128 MiB 8 threads: 1.60 cpb 205.20 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 1.71 cpb 218.40 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 1.77 cpb 227.16 Mcycles
0.2382 seconds

Argon2i 3 iterations 256 MiB 1 threads: 6.30 cpb 1611.99 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 6.24 cpb 1597.32 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 6.25 cpb 1600.12 Mcycles
1.6778 seconds

Argon2i 3 iterations 256 MiB 2 threads: 3.42 cpb 874.77 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 3.39 cpb 867.53 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 3.39 cpb 868.38 Mcycles
0.9106 seconds

Argon2i 3 iterations 256 MiB 4 threads: 1.92 cpb 491.15 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 1.88 cpb 481.03 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 1.89 cpb 484.98 Mcycles
0.5085 seconds

Argon2i 3 iterations 256 MiB 8 threads: 1.44 cpb 369.10 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 1.63 cpb 418.42 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 1.67 cpb 428.07 Mcycles
0.4489 seconds

The results seem successful. I will try again but with optimization level -O3 for the GNU gcc compiler flag option.

I can change the option by using Vim Editor.

command:
vi Makefile

I will change the following line:

CFLAGS += -std=c89 -O2 -Wall -g -Iinclude -Isrc

The change will look like this:

CFLAGS += -std=c89 -O3 -Wall -g -Iinclude -Isrc

I will save the file with the new changes and rebuild the program to test it.

command:
make bench
Result:
Argon2i 3 iterations 1 MiB 1 threads: 4.80 cpb 4.80 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 4.52 cpb 4.52 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 3.96 cpb 3.96 Mcycles
0.0042 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.33 cpb 3.33 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 2.92 cpb 2.92 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 2.91 cpb 2.91 Mcycles
0.0031 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.46 cpb 2.46 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.43 cpb 2.43 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.48 cpb 2.48 Mcycles
0.0026 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.52 cpb 4.52 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.39 cpb 4.39 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.33 cpb 4.33 Mcycles
0.0045 seconds

Argon2i 3 iterations 2 MiB 1 threads: 4.79 cpb 9.57 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 4.52 cpb 9.04 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.00 cpb 8.00 Mcycles
0.0084 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.62 cpb 5.25 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.58 cpb 5.17 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 2.59 cpb 5.18 Mcycles
0.0054 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.85 cpb 3.69 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.85 cpb 3.70 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.77 cpb 3.53 Mcycles
0.0037 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.31 cpb 4.62 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.42 cpb 4.84 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.46 cpb 4.93 Mcycles
0.0052 seconds

Argon2i 3 iterations 4 MiB 1 threads: 4.87 cpb 19.47 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 4.39 cpb 17.55 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.03 cpb 16.11 Mcycles
0.0169 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.45 cpb 9.81 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.40 cpb 9.61 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.39 cpb 9.56 Mcycles
0.0100 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.48 cpb 5.93 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.47 cpb 5.87 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.50 cpb 5.98 Mcycles
0.0063 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.21 cpb 8.84 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.05 cpb 8.19 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 2.13 cpb 8.53 Mcycles
0.0089 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.14 cpb 41.16 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 4.62 cpb 36.95 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.23 cpb 33.81 Mcycles
0.0355 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.42 cpb 19.33 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.38 cpb 19.03 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 2.38 cpb 19.03 Mcycles
0.0200 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.38 cpb 11.09 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.38 cpb 11.00 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.38 cpb 11.07 Mcycles
0.0116 seconds

Argon2i 3 iterations 8 MiB 8 threads: 1.73 cpb 13.88 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.81 cpb 14.47 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.90 cpb 15.24 Mcycles
0.0160 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.39 cpb 86.31 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 4.93 cpb 78.84 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 4.66 cpb 74.55 Mcycles
0.0782 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.59 cpb 41.41 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.56 cpb 40.95 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.57 cpb 41.09 Mcycles
0.0431 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.47 cpb 23.47 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.46 cpb 23.35 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.44 cpb 23.05 Mcycles
0.0242 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.69 cpb 27.07 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.71 cpb 27.36 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.60 cpb 25.60 Mcycles
0.0268 seconds

Argon2i 3 iterations 32 MiB 1 threads: 5.56 cpb 178.05 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 5.48 cpb 175.31 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 5.49 cpb 175.62 Mcycles
0.1841 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.10 cpb 99.33 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.07 cpb 98.24 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.07 cpb 98.39 Mcycles
0.1032 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.78 cpb 56.83 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.76 cpb 56.34 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.76 cpb 56.46 Mcycles
0.0592 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.80 cpb 57.72 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.75 cpb 56.17 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 1.80 cpb 57.75 Mcycles
0.0606 seconds

Argon2i 3 iterations 64 MiB 1 threads: 5.69 cpb 364.37 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 5.63 cpb 360.52 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 5.64 cpb 361.19 Mcycles
0.3787 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.17 cpb 203.00 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.14 cpb 200.72 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.14 cpb 201.11 Mcycles
0.2109 seconds

Argon2i 3 iterations 64 MiB 4 threads: 1.79 cpb 114.35 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.77 cpb 113.36 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.78 cpb 114.01 Mcycles
0.1195 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.69 cpb 108.44 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.72 cpb 109.93 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.70 cpb 108.90 Mcycles
0.1142 seconds

Argon2i 3 iterations 128 MiB 1 threads: 5.81 cpb 743.61 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 5.76 cpb 737.17 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 5.76 cpb 737.74 Mcycles
0.7736 seconds

Argon2i 3 iterations 128 MiB 2 threads: 3.23 cpb 413.39 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 3.20 cpb 409.93 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 3.20 cpb 410.16 Mcycles
0.4301 seconds

Argon2i 3 iterations 128 MiB 4 threads: 1.80 cpb 230.53 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 1.79 cpb 228.66 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 1.78 cpb 228.44 Mcycles
0.2395 seconds

Argon2i 3 iterations 128 MiB 8 threads: 1.69 cpb 216.05 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 1.62 cpb 207.76 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 1.65 cpb 211.43 Mcycles
0.2217 seconds

Argon2i 3 iterations 256 MiB 1 threads: 5.93 cpb 1517.87 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 5.87 cpb 1503.31 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 5.88 cpb 1505.68 Mcycles
1.5788 seconds

Argon2i 3 iterations 256 MiB 2 threads: 3.27 cpb 838.35 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 3.25 cpb 831.07 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 3.25 cpb 831.79 Mcycles
0.8722 seconds

Argon2i 3 iterations 256 MiB 4 threads: 1.81 cpb 464.17 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 1.81 cpb 463.87 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 1.80 cpb 461.07 Mcycles
0.4835 seconds

Argon2i 3 iterations 256 MiB 8 threads: 1.53 cpb 390.76 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 1.59 cpb 406.13 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 1.60 cpb 409.85 Mcycles
0.4298 seconds

This seems like the tests were quite similar to the optimization level -O2. This could be from the additional writing of variables into memory.

Test 2(x86_64):

I will try the changed code on machine 2.

This machine as mentioned before has these specifications:

Machine 2:

Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz

Four sticks of 8GB DIMM DDR4 RAM at 2.4 GHz (32 GB of RAM in total)

x86_64 Fedora 28 version of Linux Operating System

I will do the test with optimization level -O2 for testing.

Compile the program:

cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread -march=native src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/opt.c src/bench.c -o bench
Result:
Argon2i 3 iterations 1 MiB 1 threads: 3.54 cpb 3.54 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 3.20 cpb 3.20 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 2.73 cpb 2.73 Mcycles
0.0029 seconds

Argon2i 3 iterations 1 MiB 2 threads: 2.92 cpb 2.92 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 2.34 cpb 2.34 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 2.40 cpb 2.40 Mcycles
0.0025 seconds

Argon2i 3 iterations 1 MiB 4 threads: 1.97 cpb 1.97 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 1.87 cpb 1.87 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 1.94 cpb 1.94 Mcycles
0.0020 seconds

Argon2i 3 iterations 1 MiB 8 threads: 3.21 cpb 3.21 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 3.00 cpb 3.00 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 2.81 cpb 2.81 Mcycles
0.0030 seconds

Argon2i 3 iterations 2 MiB 1 threads: 1.40 cpb 2.79 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 1.21 cpb 2.42 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 1.04 cpb 2.08 Mcycles
0.0022 seconds

Argon2i 3 iterations 2 MiB 2 threads: 1.44 cpb 2.88 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 1.36 cpb 2.72 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 1.37 cpb 2.73 Mcycles
0.0029 seconds

Argon2i 3 iterations 2 MiB 4 threads: 0.99 cpb 1.99 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.11 cpb 2.21 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.05 cpb 2.11 Mcycles
0.0022 seconds

Argon2i 3 iterations 2 MiB 8 threads: 1.67 cpb 3.35 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 1.54 cpb 3.08 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 1.51 cpb 3.02 Mcycles
0.0032 seconds

Argon2i 3 iterations 4 MiB 1 threads: 1.41 cpb 5.65 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 1.09 cpb 4.38 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 0.98 cpb 3.92 Mcycles
0.0041 seconds

Argon2i 3 iterations 4 MiB 2 threads: 1.28 cpb 5.13 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 1.21 cpb 4.85 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 1.23 cpb 4.93 Mcycles
0.0052 seconds

Argon2i 3 iterations 4 MiB 4 threads: 0.79 cpb 3.18 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 0.79 cpb 3.18 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 0.81 cpb 3.22 Mcycles
0.0034 seconds

Argon2i 3 iterations 4 MiB 8 threads: 1.00 cpb 4.00 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 0.89 cpb 3.58 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 0.91 cpb 3.64 Mcycles
0.0038 seconds

Argon2i 3 iterations 8 MiB 1 threads: 1.47 cpb 11.79 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 1.13 cpb 9.08 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 0.97 cpb 7.80 Mcycles
0.0082 seconds

Argon2i 3 iterations 8 MiB 2 threads: 1.27 cpb 10.18 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 0.87 cpb 6.95 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 0.88 cpb 7.00 Mcycles
0.0073 seconds

Argon2i 3 iterations 8 MiB 4 threads: 0.91 cpb 7.31 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 0.80 cpb 6.42 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 0.59 cpb 4.70 Mcycles
0.0049 seconds

Argon2i 3 iterations 8 MiB 8 threads: 0.82 cpb 6.53 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 0.83 cpb 6.63 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 0.81 cpb 6.47 Mcycles
0.0068 seconds

Argon2i 3 iterations 16 MiB 1 threads: 1.89 cpb 30.20 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 1.33 cpb 21.22 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 1.17 cpb 18.70 Mcycles
0.0196 seconds

Argon2i 3 iterations 16 MiB 2 threads: 1.17 cpb 18.80 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 0.81 cpb 13.03 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 0.79 cpb 12.57 Mcycles
0.0132 seconds

Argon2i 3 iterations 16 MiB 4 threads: 0.80 cpb 12.79 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 0.56 cpb 8.97 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 0.53 cpb 8.45 Mcycles
0.0089 seconds

Argon2i 3 iterations 16 MiB 8 threads: 0.60 cpb 9.57 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 0.64 cpb 10.22 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 0.68 cpb 10.83 Mcycles
0.0114 seconds

Argon2i 3 iterations 32 MiB 1 threads: 1.64 cpb 52.53 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 1.50 cpb 47.89 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 1.49 cpb 47.84 Mcycles
0.0502 seconds

Argon2i 3 iterations 32 MiB 2 threads: 1.28 cpb 41.08 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 1.29 cpb 41.17 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 1.38 cpb 44.31 Mcycles
0.0465 seconds

Argon2i 3 iterations 32 MiB 4 threads: 0.86 cpb 27.46 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 0.74 cpb 23.58 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 0.65 cpb 20.68 Mcycles
0.0217 seconds

Argon2i 3 iterations 32 MiB 8 threads: 0.68 cpb 21.81 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 0.69 cpb 22.09 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 0.68 cpb 21.73 Mcycles
0.0228 seconds

Argon2i 3 iterations 64 MiB 1 threads: 1.61 cpb 103.11 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 1.58 cpb 101.05 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 1.58 cpb 101.25 Mcycles
0.1062 seconds

Argon2i 3 iterations 64 MiB 2 threads: 1.44 cpb 92.42 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 1.18 cpb 75.76 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 1.18 cpb 75.28 Mcycles
0.0789 seconds

Argon2i 3 iterations 64 MiB 4 threads: 0.76 cpb 48.48 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 0.65 cpb 41.49 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 0.63 cpb 40.49 Mcycles
0.0425 seconds

Argon2i 3 iterations 64 MiB 8 threads: 0.58 cpb 37.08 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 0.61 cpb 38.88 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 0.61 cpb 39.02 Mcycles
0.0409 seconds

Argon2i 3 iterations 128 MiB 1 threads: 1.72 cpb 220.68 Mcycles
Argon2d 3 iterations 128 MiB 1 threads: 1.65 cpb 211.20 Mcycles
Argon2id 3 iterations 128 MiB 1 threads: 1.61 cpb 206.66 Mcycles
0.2167 seconds

Argon2i 3 iterations 128 MiB 2 threads: 1.12 cpb 143.16 Mcycles
Argon2d 3 iterations 128 MiB 2 threads: 1.11 cpb 142.53 Mcycles
Argon2id 3 iterations 128 MiB 2 threads: 1.11 cpb 142.67 Mcycles
0.1496 seconds

Argon2i 3 iterations 128 MiB 4 threads: 0.68 cpb 87.52 Mcycles
Argon2d 3 iterations 128 MiB 4 threads: 0.68 cpb 86.96 Mcycles
Argon2id 3 iterations 128 MiB 4 threads: 0.68 cpb 86.78 Mcycles
0.0910 seconds

Argon2i 3 iterations 128 MiB 8 threads: 0.59 cpb 75.56 Mcycles
Argon2d 3 iterations 128 MiB 8 threads: 0.55 cpb 70.96 Mcycles
Argon2id 3 iterations 128 MiB 8 threads: 0.58 cpb 74.02 Mcycles
0.0776 seconds

Argon2i 3 iterations 256 MiB 1 threads: 1.75 cpb 447.73 Mcycles
Argon2d 3 iterations 256 MiB 1 threads: 1.62 cpb 414.48 Mcycles
Argon2id 3 iterations 256 MiB 1 threads: 1.62 cpb 415.25 Mcycles
0.4354 seconds

Argon2i 3 iterations 256 MiB 2 threads: 1.17 cpb 299.72 Mcycles
Argon2d 3 iterations 256 MiB 2 threads: 1.07 cpb 274.17 Mcycles
Argon2id 3 iterations 256 MiB 2 threads: 1.14 cpb 291.48 Mcycles
0.3056 seconds

Argon2i 3 iterations 256 MiB 4 threads: 0.70 cpb 180.25 Mcycles
Argon2d 3 iterations 256 MiB 4 threads: 0.71 cpb 182.79 Mcycles
Argon2id 3 iterations 256 MiB 4 threads: 0.70 cpb 180.23 Mcycles
0.1890 seconds

Argon2i 3 iterations 256 MiB 8 threads: 0.54 cpb 137.75 Mcycles
Argon2d 3 iterations 256 MiB 8 threads: 0.54 cpb 139.23 Mcycles
Argon2id 3 iterations 256 MiB 8 threads: 0.53 cpb 134.82 Mcycles
0.1414 seconds

(This blog is getting too long. I will continue in Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 4))

 

Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 2)

Requirements/ System Specifications.

Argon2 Password hashing function package:

https://github.com/P-H-C/phc-winner-argon2

Aarch64 Fedora 28 version of Linux operating system

Cortex-A57 8 core processor

One set of Dual-Channel DIMM DDR3 8GB RAM (16GB in total)

New Plan

The new plan will be to change the benchmark program to use the internal system timer to calculate the run time (The time required to run the program/ specific piece of code) instead of relying on RDTSC (time counter register, x86_64 type processors only).

Here is the original benchmark program file (bench.c):

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];

unsigned j;
for (j = 0; j < 3; ++j) {
clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;

argon2_type type = types[j];
start_time = clock();
start_cycles = rdtsc();

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

stop_cycles = rdtsc();
stop_time = clock();

delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);
}

printf("%2.4f seconds\n\n", run_time);
}
}
}

int main() {
benchmark();
return ARGON2_OK;
}

I will change the bench.c file by removing the rdtsc function. The rdtsc function only starts and ends the timer to count the time the program code will run. I will also remove any of the code that is affected by the change; marked in red.

The main part/chunk of the program is this section:

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

The code is within a for loop that will continuously run until stopped by the user (using CTRL+C/ kill command). The code will generate each of the three types of argon2 hashing before returning a calculated run time. The three argon2 hashing types are argon2_d, argon2_i, and argon2_id.

The function/ builtin feature of Linux for the system timer is called clock_gettime. The link also contains an example that I will use to get the run time I need for my test. Here is the example:

/*
 * This program calculates the time required to
 * execute the program specified as its first argument.
 * The time is printed in seconds, on standard out.
 */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>

#define BILLION  1000000000L;

int main( int argc, char **argv )
  {
    struct timespec start, stop;
    double accum;

    if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
      perror( "clock gettime" );
      exit( EXIT_FAILURE );
    }

    system( argv[1] );

    if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
      perror( "clock gettime" );
      exit( EXIT_FAILURE );
    }

    accum = ( stop.tv_sec - start.tv_sec )
          + ( stop.tv_nsec - start.tv_nsec )
            / BILLION;
    printf( "%lf\n", accum );
    return( EXIT_SUCCESS );
  }

The code that I require are marked in red.

The final code will look like this:

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - start.tv_sec )
+ ( (double)stop.tv_nsec - start.tv_nsec );

double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time = 0;
run_time += accum / BILLION;

}

printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}
NOTE: /*     */ is a comment block/ force the program compiler to ignore this section of code.

I will now explain what the code does.

#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;

The #include will use the library of codes that I do not have to manually write. The code is already included in the GNU gcc C language compiler during installation.

NOTE: The format of the code is enclosed in triangular brackets (<>).

The #define is a code that will create a variable/ place to hold something that I will use later in the program.

NOTE: The format requires a name then the value.

The next set of code:

struct timespec start, stop;
double accum;

The struct code will call a structure(Set of pre-made code with a specific format). This code will call the timespec structure that will allow my program to use the start and stop commands.

NOTE: The format of the struct code require the structure’s name followed by the commands. This also require the line to be closed with the semi-colon(;), like in most C/C++ language code.

The code double is a variable that will hold a value that will be used later in the program.

NOTE: The format of the code will require a specific name type followed by the variable’s name. Example: double is the variable type, accum is the variable’s name.

Here is the next piece of code:

double run_time = 0;

This is another variable that I will assign a value of zero.

NOTE: I have placed this code within a for loop to constantly reset the run_time variable. I will have to reset the time counter each time the program runs the main chunk of code mentioned before.

The next piece of code:

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

The red highlighted part is from the example found here:(https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html). The code will check if the program cannot access the system time and return an error to the user.

I added an else code to start the timer if the system timer is accessible.

NOTE:The format of the else code is always after an if code/statement.

The next section of code:

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

I will stop the timer after the main chunk of code is done running. This code is similar to the code to start the timer. The if code/statement will check for any errors that might occur if the system timer cannot be stopped.

The next section of code:

accum = ( (double)stop.tv_sec - start.tv_sec )
+ ( (double)stop.tv_nsec - start.tv_nsec );

The calculation of the run time is set to the variable accum.

NOTE: This is similar to the code from the example (https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html) but I have removed the / BILLION at the end because I will need the number in the original form for the next lines of code.
double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

The variable mcycles will take the value of variable accum and divide it by             (1UL << 20). The explanation of 1UL is found here (https://stackoverflow.com/questions/14467173/bit-setting-in-ansi-c). It is an unsigned long integer value. The << 20 is the bit shift code that will move the position of the bit/ value to the left twenty times. Similar to basic algebra, the brackets are performed first. This variable is suppose to count memory cycles.

The variable delta is a uint64_t variable type. It is an unsigned 64-bit integer variable. The delta variable will calculate the efficiency of the program. The value is from the timed value divided by the memory cost (2^n) value.

The mcost variable is from within the for loop checking the conditions found here:

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2)
NOTE: The variable type uint32_t is an unsigned 32-bit integer variable.

The next section of code:

run_time = 0;
run_time += accum / BILLION;

The run_time variable is set to 0 again because the GNU gcc C language compiler kept complaining about the variable not being used. (This maybe an issue later; I will have to check it later)

I will set the variable run_time with the value of itself combined with the value of (accum divided by BILLION). This is where the required code from the example found here (https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html). The equation will change the value into a number closer to a second. The result will be a really fast number without the equation. Logically, a processor(CPU) can process information in the Gigahertz range(1,000,000,000 per line of code). This will mean the equation should divide the timed result by one billion to get a number in seconds.

The final section of code is:

printf("%2.4f seconds\n\n", run_time);

This line of code will output the calculated time value to the user. The value is counted by two decimal positions to the left and four decimal positions to the right of the decimal.

Result:

The result was strange as the calculated time had negative values. Also the benchmark program ran really fast compared to the original benchmark program performed on a x86_64 processor system.

This is to build the program using the builtin Makefile included in the argon2 package.

Building without optimizations
cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/bench.c -o bench
NOTE: I have changed the built flag to -O2 instead of the -O3.

The next test(On a x86_64 architecture; Basic Test of original program)

The x86_64 system have these hardware:

Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz

Four sticks of 8GB DIMM DDR4 RAM at 2.4 GHz (32 GB of RAM in total)

x86_64 Fedora 28 version of Linux Operating System

I will first show the x86_64 system results:

37312731171934821544Argon2i 3 iterations 1 MiB 1 threads: 10574.77 cpb 10574.77 Mcycles
1247866662622202070Argon2d 3 iterations 1 MiB 1 threads: 10573.72 cpb 10573.72 Mcycles
8121691903307694325Argon2id 3 iterations 1 MiB 1 threads: 10571.92 cpb 10571.92 Mcycles
0.0100 seconds

14977167733997818648Argon2i 3 iterations 1 MiB 2 threads: 10576.28 cpb 10576.28 Mcycles
2187773163388595072Argon2d 3 iterations 1 MiB 2 threads: 10572.17 cpb 10572.17 Mcycles
28735233341075268257Argon2id 3 iterations 1 MiB 2 threads: 10573.05 cpb 10573.05 Mcycles
0.0171 seconds

35601863931760751719Argon2i 3 iterations 1 MiB 4 threads: 10571.92 cpb 10571.93 Mcycles
42457421402446018348Argon2d 3 iterations 1 MiB 4 threads: 10571.65 cpb 10571.65 Mcycles
6359495663131182784Argon2id 3 iterations 1 MiB 4 threads: 10571.64 cpb 10571.64 Mcycles
0.0220 seconds

13211386193827312231Argon2i 3 iterations 1 MiB 8 threads: 10582.07 cpb 10582.07 Mcycles
2017380744228771863Argon2d 3 iterations 1 MiB 8 threads: 10582.25 cpb 10582.25 Mcycles
2713660177924949126Argon2id 3 iterations 1 MiB 8 threads: 10582.15 cpb 10582.15 Mcycles
0.0401 seconds

34098789171626307623Argon2i 3 iterations 2 MiB 1 threads: 5293.53 cpb 10587.05 Mcycles
41112087152325301561Argon2d 3 iterations 2 MiB 1 threads: 5292.41 cpb 10584.83 Mcycles
5152598753020665429Argon2id 3 iterations 2 MiB 1 threads: 5290.67 cpb 10581.34 Mcycles
0.0193 seconds

12106284303712976859Argon2i 3 iterations 2 MiB 2 threads: 5289.21 cpb 10578.43 Mcycles
1902890366109077386Argon2d 3 iterations 2 MiB 2 threads: 5288.64 cpb 10577.29 Mcycles
2593963985801686024Argon2id 3 iterations 2 MiB 2 threads: 5289.38 cpb 10578.75 Mcycles
0.0248 seconds

32865988571490376646Argon2i 3 iterations 2 MiB 4 threads: 5287.49 cpb 10574.99 Mcycles
39752915512178772514Argon2d 3 iterations 2 MiB 4 threads: 5287.35 cpb 10574.71 Mcycles
3687074192867256577Argon2id 3 iterations 2 MiB 4 threads: 5287.40 cpb 10574.80 Mcycles
0.0308 seconds

10572264843566742997Argon2i 3 iterations 2 MiB 8 threads: 5292.63 cpb 10585.26 Mcycles
17567059124266120428Argon2d 3 iterations 2 MiB 8 threads: 5292.58 cpb 10585.16 Mcycles
2456075227669645387Argon2id 3 iterations 2 MiB 8 threads: 5292.16 cpb 10584.33 Mcycles
0.0521 seconds

31545606251397912479Argon2i 3 iterations 4 MiB 1 threads: 2653.18 cpb 10612.73 Mcycles
38828398532114725197Argon2d 3 iterations 4 MiB 1 threads: 2650.45 cpb 10601.79 Mcycles
3047237062830488385Argon2id 3 iterations 4 MiB 1 threads: 2650.19 cpb 10600.76 Mcycles
0.0368 seconds

10204770293535003798Argon2i 3 iterations 4 MiB 2 threads: 2647.51 cpb 10590.04 Mcycles
17249317724238457938Argon2d 3 iterations 4 MiB 2 threads: 2647.27 cpb 10589.09 Mcycles
2428403684647788185Argon2id 3 iterations 4 MiB 2 threads: 2647.47 cpb 10589.87 Mcycles
0.0431 seconds

31327130301342861294Argon2i 3 iterations 4 MiB 4 threads: 2645.27 cpb 10581.06 Mcycles
38277567492037602970Argon2d 3 iterations 4 MiB 4 threads: 2645.19 cpb 10580.78 Mcycles
2276116172732562416Argon2id 3 iterations 4 MiB 4 threads: 2645.23 cpb 10580.91 Mcycles
0.0495 seconds

9225480733436003113Argon2i 3 iterations 4 MiB 8 threads: 2647.25 cpb 10589.02 Mcycles
16259668664138962378Argon2d 3 iterations 4 MiB 8 threads: 2647.14 cpb 10588.58 Mcycles
2328880090548145081Argon2id 3 iterations 4 MiB 8 threads: 2647.44 cpb 10589.76 Mcycles
0.0744 seconds

30330952961289279353Argon2i 3 iterations 8 MiB 1 threads: 1328.12 cpb 10624.97 Mcycles
37742017952023717309Argon2d 3 iterations 8 MiB 1 threads: 1327.33 cpb 10618.61 Mcycles
2136636212757009041Argon2id 3 iterations 8 MiB 1 threads: 1327.19 cpb 10617.52 Mcycles
0.0498 seconds

9469966913485517122Argon2i 3 iterations 8 MiB 2 threads: 1326.61 cpb 10612.92 Mcycles
16754634484211165330Argon2d 3 iterations 8 MiB 2 threads: 1326.28 cpb 10610.23 Mcycles
2401273148642647180Argon2id 3 iterations 8 MiB 2 threads: 1326.35 cpb 10610.84 Mcycles
0.0771 seconds

31275726791350457380Argon2i 3 iterations 8 MiB 4 threads: 1324.15 cpb 10593.21 Mcycles
38353773162057823887Argon2d 3 iterations 8 MiB 4 threads: 1324.10 cpb 10592.79 Mcycles
2478461132765912450Argon2id 3 iterations 8 MiB 4 threads: 1324.18 cpb 10593.42 Mcycles
0.0866 seconds

9558952213482208994Argon2i 3 iterations 8 MiB 8 threads: 1325.16 cpb 10601.28 Mcycles
16721510404198952967Argon2d 3 iterations 8 MiB 8 threads: 1325.22 cpb 10601.75 Mcycles
2388942706619242490Argon2id 3 iterations 8 MiB 8 threads: 1325.04 cpb 10600.28 Mcycles
0.1194 seconds

31042168541427536599Argon2i 3 iterations 16 MiB 1 threads: 668.06 cpb 10688.99 Mcycles
39124498882214906246Argon2d 3 iterations 16 MiB 1 threads: 666.82 cpb 10669.10 Mcycles
4048743402995991599Argon2id 3 iterations 16 MiB 1 threads: 666.44 cpb 10663.08 Mcycles
0.0951 seconds

11859953773762067761Argon2i 3 iterations 16 MiB 2 threads: 665.54 cpb 10648.73 Mcycles
1952051050242384679Argon2d 3 iterations 16 MiB 2 threads: 666.10 cpb 10657.54 Mcycles
27273213391008354924Argon2id 3 iterations 16 MiB 2 threads: 665.54 cpb 10648.67 Mcycles
0.1325 seconds

34933301581740323590Argon2i 3 iterations 16 MiB 4 threads: 663.51 cpb 10616.20 Mcycles
42252147552471078403Argon2d 3 iterations 16 MiB 4 threads: 663.45 cpb 10615.13 Mcycles
6610278363201576241Argon2id 3 iterations 16 MiB 4 threads: 663.43 cpb 10614.86 Mcycles
0.1505 seconds

13915654333937508899Argon2i 3 iterations 16 MiB 8 threads: 663.75 cpb 10620.00 Mcycles
2127460927377542831Argon2d 3 iterations 16 MiB 8 threads: 663.70 cpb 10619.15 Mcycles
28623010321111202089Argon2id 3 iterations 16 MiB 8 threads: 663.63 cpb 10618.02 Mcycles
0.1908 seconds

35961333192018513362Argon2i 3 iterations 32 MiB 1 threads: 336.98 cpb 10783.46 Mcycles
2084438022925688382Argon2d 3 iterations 32 MiB 1 threads: 336.98 cpb 10783.37 Mcycles
11156363583834542778Argon2id 3 iterations 32 MiB 1 threads: 337.03 cpb 10784.95 Mcycles
0.1887 seconds

2024511488406553503Argon2i 3 iterations 32 MiB 2 threads: 335.78 cpb 10745.00 Mcycles
28914570101263429122Argon2d 3 iterations 32 MiB 2 threads: 335.48 cpb 10735.39 Mcycles
37483569582110660894Argon2id 3 iterations 32 MiB 2 threads: 335.19 cpb 10726.17 Mcycles
0.2657 seconds

3006611252891444371Argon2i 3 iterations 32 MiB 4 threads: 333.21 cpb 10662.76 Mcycles
10814032083673982863Argon2d 3 iterations 32 MiB 4 threads: 333.26 cpb 10664.48 Mcycles
1863928371162819675Argon2id 3 iterations 32 MiB 4 threads: 333.30 cpb 10665.70 Mcycles
0.2917 seconds

2647894462945096885Argon2i 3 iterations 32 MiB 8 threads: 333.25 cpb 10664.09 Mcycles
34300370941726852693Argon2d 3 iterations 32 MiB 8 threads: 333.24 cpb 10663.72 Mcycles
42118400902507645479Argon2id 3 iterations 32 MiB 8 threads: 333.21 cpb 10662.75 Mcycles
0.3958 seconds

6975984133638080775Argon2i 3 iterations 64 MiB 1 threads: 171.82 cpb 10996.26 Mcycles
1828026929462843471Argon2d 3 iterations 64 MiB 1 threads: 171.66 cpb 10986.06 Mcycles
29477903061591876255Argon2id 3 iterations 64 MiB 1 threads: 171.79 cpb 10994.90 Mcycles
0.3656 seconds

40768340232642930179Argon2i 3 iterations 64 MiB 2 threads: 170.63 cpb 10920.52 Mcycles
8329564143653322544Argon2d 3 iterations 64 MiB 2 threads: 170.03 cpb 10881.71 Mcycles
1843354044359643162Argon2id 3 iterations 64 MiB 2 threads: 169.89 cpb 10873.02 Mcycles
0.4956 seconds

28446343191230071866Argon2i 3 iterations 64 MiB 4 threads: 167.94 cpb 10748.23 Mcycles
37150261572105910736Argon2d 3 iterations 64 MiB 4 threads: 168.02 cpb 10753.43 Mcycles
2958699032978678975Argon2id 3 iterations 64 MiB 4 threads: 167.98 cpb 10750.53 Mcycles
0.5300 seconds

11686546543836486008Argon2i 3 iterations 64 MiB 8 threads: 167.75 cpb 10736.24 Mcycles
2026438434392927139Argon2d 3 iterations 64 MiB 8 threads: 167.66 cpb 10730.16 Mcycles
28778995791260930582Argon2id 3 iterations 64 MiB 8 threads: 167.91 cpb 10745.94 Mcycles
0.7020 seconds

Here are the results of the new code change:

Argon2i 3 iterations 1 MiB 1 threads: 5.53 cpb 5.53 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 5.14 cpb 5.15 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.63 cpb 4.63 Mcycles
0.0049 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.57 cpb 3.57 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.23 cpb 3.23 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.29 cpb 3.30 Mcycles
0.0035 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.62 cpb 2.62 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.53 cpb 2.53 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.59 cpb 2.59 Mcycles
0.0027 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.20 cpb 4.20 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.14 cpb 4.14 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.41 cpb 4.41 Mcycles
0.0046 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.43 cpb 10.86 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 5.20 cpb 10.40 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.67 cpb 9.33 Mcycles
0.0098 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.93 cpb 5.85 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.84 cpb 5.69 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 2.86 cpb 5.72 Mcycles
0.0060 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.96 cpb 3.91 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.94 cpb 3.89 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.95 cpb 3.90 Mcycles
0.0041 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.56 cpb 5.12 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.51 cpb 5.01 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.53 cpb 5.06 Mcycles
0.0053 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.52 cpb 22.10 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 5.00 cpb 19.98 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.70 cpb 18.79 Mcycles
0.0197 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.78 cpb 11.11 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.68 cpb 10.74 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.70 cpb 10.79 Mcycles
0.0113 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.66 cpb 6.63 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.64 cpb 6.56 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.65 cpb 6.61 Mcycles
0.0069 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.37 cpb 9.47 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.24 cpb 8.95 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 1.89 cpb 7.57 Mcycles
0.0079 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.78 cpb 46.22 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.29 cpb 42.36 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.89 cpb 39.12 Mcycles
0.0410 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.70 cpb 21.64 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.67 cpb 21.32 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 0.00 cpb -932.22 Mcycles
-0.9775 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.53 cpb 12.27 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.52 cpb 12.14 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.52 cpb 12.14 Mcycles
0.0127 seconds

Argon2i 3 iterations 8 MiB 8 threads: 1.84 cpb 14.72 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.77 cpb 14.19 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.74 cpb 13.91 Mcycles
0.0146 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.97 cpb 95.55 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.50 cpb 88.01 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.21 cpb 83.43 Mcycles
0.0875 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.87 cpb 45.87 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.83 cpb 45.24 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.84 cpb 45.39 Mcycles
0.0476 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.58 cpb 25.29 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.56 cpb 24.91 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.56 cpb 24.98 Mcycles
0.0262 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.78 cpb 28.54 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.78 cpb 28.55 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.77 cpb 28.28 Mcycles
0.0297 seconds

Argon2i 3 iterations 32 MiB 1 threads: 6.18 cpb 197.69 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 0.00 cpb -758.56 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 6.12 cpb 195.79 Mcycles
0.2053 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.38 cpb 108.24 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.34 cpb 106.87 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.36 cpb 107.44 Mcycles
0.1127 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.92 cpb 61.53 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.89 cpb 60.38 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.89 cpb 60.60 Mcycles
0.0635 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.85 cpb 59.29 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.96 cpb 62.65 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 0.00 cpb -893.19 Mcycles
-0.9366 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.29 cpb 402.50 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.22 cpb 397.86 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 0.00 cpb -554.57 Mcycles
-0.5815 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.45 cpb 220.73 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.41 cpb 218.22 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.42 cpb 218.81 Mcycles
0.2294 seconds

Argon2i 3 iterations 64 MiB 4 threads: 0.00 cpb -830.95 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.90 cpb 121.72 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.90 cpb 121.88 Mcycles
0.1278 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.93 cpb 123.78 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.97 cpb 126.37 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.81 cpb 115.84 Mcycles
0.1215 seconds

The Aarch64 program(My code changes) runs extremely fast. The speed increase also produce the mentioned negative time value problem. It does not make sense to have a negative time value as time is always running and moving forward. The original program(x86_64 only) had a noticable delay before outputting the results. It can also be seen in the Mcycles(Memory cycles) and cpb(memory cost) values.

(I will continue the testing in Project: Part3, Progress 3)

Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 1)

Requirements/ System Specifications.

Argon2 Password hashing function package:

https://github.com/P-H-C/phc-winner-argon2

Aarch64 Fedora 28 version of Linux operating system

Cortex-A57 8 core processor

One set of Dual-Channel DIMM DDR3 8GB RAM (16GB in total)

Change of plans for the project.

The original plan for my project stage 3 was this:

Modified Testing Stage(Modified Program):

I will change some of the code in the base program to attempt optimization. The changes can be algorithm changes, new inline assembler code, or removing unnecessary code.
I will run the changed program multiple times to check if there are any bugs or problems compared to the base program. The testing requirements/variables are: compile time, program run time, file size, and program produced result.
I will test if the GNU C program compiler is able to further optimize the changed program by using optimization levels 0 to 3.

I did not foresee the package having through optimization for the following hardware: Amd64, x86_64, i386, and x86. This can be seen in the source code file called bench.c.

Here is a sample of the code, starting from line 29:

static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif

The red highlighted lines checks which processor architecture is being used.

I did notice that the bench testing program did not have the ability to test the Aarch64 type of processor.

I will try to optimize the program for testing in any architecture.

(I will continue in progress 2)

Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 5)

Following the last blog called “Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 4)

I will be testing using the following criteria (The crossed out points are completed in previous progress blogs)(Red highlighted are the current topic(s)):

  • Check how many inline assembler code already exist
  • Check how many dedicated/ separate assembler files exist in the package (File Extension: s or S)
  • Check how many files use the C programming language (File Extension: c)
  • Use a profiling tool to check how optimized the program is in the current build package (Tools such as: gprof, stap, etc.)
  • Build the package and check the default results by testing the build against a password file. The file will be made using a Microsoft Excel formula.
  • Check that the results are relatively consistent. (This will be from program compile time, program run time, and the program file size) (This blog’s topic)
  • Introduce a minor or major change that will optimize or reduce performance of the original program. This is done by changing the character(s) of the password file generated by the Microsoft Excel formula.
  • Compare the results of the changes and the original program results.
  • Building/ Testing the original argon2 package

This is the system specifications:

Argon2 package on an Aarch64 Fedora 28 Linux operating system. The system has 8 CPU cores; Cortex-A57 Model.

My current directory is /home/username/projects/phc-winner-argon2/.

I will be testing the argon2 program with the already made compiling file “Makefile” to check the following:

  • Program compile time
  • Program run time
  • Program file size

I will be using this command to build the program:

time make argon2

Note: The time command will record the user run time, system run time, and the real run time(user run time and system run time combined)

I will be using this command to check the file size:

ls -l

I will also need to remove the existing argon2 program. The GNU gcc compiler builds a new argon2 program when the old/ existing argon2 program is removed. This is the command:

rm argon2

I will not begin testing.

Test 1

Start a new build by removing any existing argon2 program.

rm argon2

Build and time the argon2 program:

time make argon2

This is the result:

Building without optimizations
cc -std=c89 -O3 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/run.c -o argon2

real 0m2.506s
user 0m2.336s
sys 0m0.163s
  • real 0m2.506s is the total time it took for the user to run the program and the system to send the program to the CPU to be processed (in seconds).
  • user 0m2.336s is the total time it took for the user to run the program(in seconds).
  • sys 0m0.164s is the total time it took for the system to send the program to the CPU to be processed(in seconds).

File size: 252944 bits (about 253 kilo bits)

Note: Notice the highlighted -O3. This normally is the optimization level in GNU gcc compiler for maximum optimization.

Test 2

I will instead try to change the highlighted -O3 in Test 1 to a lower level of optimization called -O2.

I will need to edit the “Makefile” using this command:

vi Makefile

Move the cursor in front of the -O3 and press the Insert key. Type the -O2. Press the ESC key. Type 😡. This will change the “Makefile” to test if a lower level of optimization will have an impact on any of the argon2 program.

Start a new build by removing any existing argon2 program.

rm argon2

Build and time the argon2 program:

time make argon2

This is the result:

Building without optimizations
cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/run.c -o argon2

real 0m1.783s
user 0m1.566s
sys 0m0.214s

File size: 206464 bits (about 206 kilo bits)

Note: Notice that the time for real, user, and sys have all lowered. The file size is also smaller than in Test 1. This test was performed in the middle of the night. This means there should be low amount of users accessing the system. I can check the number of users using the command: w. (Reference: https://www.cyberciti.biz/faq/unix-linux-list-current-logged-in-users/).

Output:

01:38:13 up 74 days, 4:56, 1 user, load average: 0.06, 0.03, 0.01
USER TTY LOGIN@ IDLE JCPU PCPU WHAT
username pts/0 23:01 0.00s 0.14s 0.02s w

I will now check the argon2 program with the current setting of -O2 optimization level.

The command to test:

sudo time operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Result:

operf: Profiler started

Profiling done.
0.05user 0.12system 0:00.25elapsed 70%CPU (0avgtext+0avgdata 5120maxresident)k
0inputs+40outputs (3major+2052minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.320 seconds
Verification ok

Detailed Report(using opreport -d):

Using /home/username/projects/phc-winner-argon2/oprofile_data/samples/ for samples directory.
CPU: ARM Cortex-A57, speed 750 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
vma samples % image name symbol name
00008ce0 2 40.0000 ld-2.27.so do_lookup_x
00008dc0 1 50.0000
00008e14 1 50.0000
000096c0 1 20.0000 ld-2.27.so _dl_lookup_symbol_x
000096f4 1 100.000
00109ea8 1 20.0000 libc-2.27.so _dl_addr
00109f78 1 100.000
000c2d10 1 20.0000 libc-2.27.so write
000c2d3c 1 100.000

Note: This program is split into 4 threads from the -p 4 option. This optimization level split the processes into multiple CPU cores compared to the baseline tests in the previous blog (Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 4)).

Test 3

I will change the optimization level back to -O3.

I will need to edit the “Makefile” using this command:

vi Makefile

Move the cursor in front of the -O2 and press the Insert key. Type the -O3. Press the ESC key. Type :x. This will change the “Makefile” to test if a lower level of optimization will have an impact on any of the argon2 program.

Start a new build by removing any existing argon2 program.

rm argon2

Build and time the argon2 program:

time make argon2

This is the result:

Building without optimizations
cc -std=c89 -O3 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/run.c -o argon2

real 0m2.498s
user 0m2.307s
sys 0m0.189s

File Size: 252944 bits (about 253 kilo bits)

Note: The time is about the same as the results from the previous blog (Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 4)).

Test 4

This test will see what will happen if I add another character into the Microsoft Generated Password “Ch(329nE”. I will add a 0 to the end of the generated password.

This is the new password that I will use for testing, “Ch(329nE0”.

The command to test:

sudo time operf echo -n "Ch(329nE0" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Result:

operf: Profiler started

Profiling done.
0.04user 0.12system 0:00.23elapsed 75%CPU (0avgtext+0avgdata 5120maxresident)k
0inputs+40outputs (3major+2046minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 9a7a5b4a3055934595908ba9bad1e959c1f5cc7fbdf37394
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$mnpbSjBVk0WVkIuputHpWcH1zH+983OU
0.297 seconds
Verification ok

The initial build result:

operf: Profiler started 
Profiling done. 0.03user 0.14system 0:00.21elapsed 82%CPU (0avgtext+0avgdata 5196maxresident)k
0inputs+48outputs (3major+2051minor)pagefaults 0swaps 
Type: Argon2i 
Iterations: 2 
Memory: 65536 KiB 
Parallelism: 4 
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a 
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q 
0.296 seconds 
Verification ok

Test 5

I did one more test for timing of the argon2 program. Here is the result:

Profiling done.
0.06user 0.11system 0:00.21elapsed 84%CPU (0avgtext+0avgdata 5172maxresident)k
0inputs+40outputs (2major+2061minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.297 seconds
Verification ok

Test 6

I did notice that the argon2 package default testing program, “smoke test”, did not include any optimization for Aarch64 type architecture.

This will not work on an Aarch64 architecture.

Here is the section of code found in the Makefile:

static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

This is the command to build the testing program of argon2:

make build

The testing program will run the test with randomly generated input and various settings until the user stops the program using CTRL+C.

Here is a sample output(on a x86_64 architecture system):

Argon2i 3 iterations 64 MiB 2 threads: 4.67 cpb 299.20 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 4.40 cpb 281.41 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 4.35 cpb 278.59 Mcycles
0.4259 seconds

Each test of the program will check argon2 type, number of iterations,  memory usage(in Megabytes), value of total run time divided by memory usage, and number of CPU cycles.

I also found that the ARMv8 and other Aarch64 architecture system has drop support for a function similar to rdtsc, a time stamp counter, found on this on this site: https://stackoverflow.com/questions/32374599/mcr-and-mrc-does-not-exist-on-aarch64.

Another method is to enable performance counters on the Aarch64 system for assembler language use. Here is a link that explains it: https://stackoverflow.com/questions/34590846/enabling-performance-monitoring-register-to-user-access-mode. This does require privileged mode(administrative rights) to enable the performance counters.

Conclusion

Comparing Test 4 and the initial build result show that the CPU is used a lot, 75-85% in about 0.3 seconds. This could be a lot of CPU usage for hashing a password but the program argon2 executes in a short amount of time and does not need to use a continuous amount of system resources. Also, the GNU gcc compiler is already running in -O3 optimization level. The “Makefile” is already optimize for Linux, Darwin, CYGWIN, MINGW, MSYS, and SunOS. The next step is to  check the code for any unnecessary code.

(I will continue the next step in Project Part 3, Progress 1)

Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 4)

Following the last blog called “Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 3)

I will be testing using the following criteria (The crossed out points are completed in previous progress blogs):

  • Check how many inline assembler code already exist
  • Check how many dedicated/ separate assembler files exist in the package (File Extension: s or S)
  • Check how many files use the C programming language (File Extension: c)
  • Use a profiling tool to check how optimized the program is in the current build package (Tools such as: gprof, stap, etc.) (This blog’s topic)
  • Build the package and check the default results by testing the build against a password file. The file will be made using a Microsoft Excel formula.
  • Check that the results are relatively consistent. (This will be from program compile time, program run time, and the program file size)
  • Introduce a minor or major change that will optimize or reduce performance of the original program. This is done by changing the character(s) of the password file generated by the Microsoft Excel formula.
  • Compare the results of the changes and the original program results.
  • Building/ Testing the original argon2 package

This post will be talking about profiling tool to evaluate the optimization/ performance level of the argon2 package on an Aarch64 Fedora 28 Linux operating system. The system has 8 CPU cores; Cortex-A57 Model.

I will be using the profiling tool named oprofile. This tool is open source and there is a manual that comes with the software. The oprofile site is in the following link: http://oprofile.sourceforge.net/news/.

I can access the manual using this command in Linux:

man operf

oprofile is a tool that tracks the performance of the system and generates a report.

The basic command template for oprofile is:

operf [ options ] [ --system-wide | --pid <pid> | [ command [ args ]]

Here are some of the options for oprofile:

–system-wide option in the operf command will require elevated permissions, such as administrative permissions, before this software will run.  This option will make the software sample performance of the entire system. It is also recommended to run the command while in the root directory (the / directory). The reason to run the command in the root directory is to avoid normal users from storing the sampled data in the current user’s directory.

–pid <pid> option is for sampling by process id(pid). All operations performing on the system will be using CPU cycles and require the CPU to perform a task, and each task are separated into process id. This option will continuously sample the running process until it is stop by the user, either by CTRL+C or killing the process using the following command:

kill -SIGINT <operf-PID>

[Command [args]] is simply any command that you want to sample the performance of.

I will perform the profiling in my current directory as I do not want to save the profiling sampled data in the root directory.

My current directory is:

/home/username/projects/phc-winner-argon2/

I will be running the command with these requirements:

  • Run as a root user
  • Use the Microsoft Excel generated password(From my “Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 3)” blog)
Ch(329nE
echo -n "password" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

This will be the command to run oprofile:

 sudo operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Here is an explanation of what the command does:

  • The sudo command will run the command as a root elevated user.
  • operf is oprofile command.
  • echo is to send the following double enclosed quotes to the host as standard output.
  • -n is the option in the echo command to remove the trailing newline.
  • “Ch(329nE” is the Microsoft Excel Generated Password I generated in my “Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 3)” blog.
  • | is a piping command to send the left side commands as input for the right side of the piping command.
  • ./argon2 is the command to run the built argon2 program in my current directory of /home/username/projects/phc-winner-argon2/.
  • somesalt is used as the salt. The definition that I found for salt is from Wikipedia(https://en.wikipedia.org/wiki/Argon2). Salt is used for hashing a password.
  • -t 2 is number of iterations to perform.
  • -m 16 is the memory size in kilobytes.
  • -p 4 is for parallelism. This sets the number of threads this process will use.
  • -l 24 sets the hash output length by number of bytes.

Note:

Initial Testing

This was my output:

operf: Profiler started

Profiling done.
Type:           Argon2i
Iterations:     2
Memory:         65536 KiB
Parallelism:    4
Hash:           6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded:        $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.295 seconds
Verification ok

I will now test using the built-in timer command in Linux. The following is the command:

sudo time operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Note: Notice the positioning of the commands. I have placed a time command after the sudo command and in front of the operf command.

This will run the time command to keep track of time that the user and system require to run.

This is my output:

sudo time operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24
operf: Profiler started

Profiling done.
0.03user 0.14system 0:00.21elapsed 82%CPU (0avgtext+0avgdata 5196maxresident)k
0inputs+48outputs (3major+2051minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.296 seconds
Verification ok

There is a new line in the output highlighted in red. This outputs the recorded time the command took to run.

  • 0.03user is the time the command took to send to the system(in seconds).
  • 0.14system is the time it took for the system to run the command through the CPU (in seconds).
  • 0:00.21elapsed is the time to return the results to the host(my console to see the output)

A new directory should be created called oprofile_data under the current directory.

This can be seen by running the list directory command:

ls

I will now check the report generated by the previous oprofile command.

The command is:

opreport

This is my result:

Using /home/username/projects/phc-winner-argon2/oprofile_data/samples/ for samples directory.
CPU: ARM Cortex-A57, speed 750 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CYCLES:100000|
samples| %|
------------------
30 100.000 echo
CPU_CYCLES:100000|
samples| %|
------------------
26 86.6667 kallsyms
3 10.0000 ld-2.27.so
1 3.3333 libc-2.27.so

Initial Testing (Continued)

I will be running the profiling tool oprofile two more times to get a baseline.

oprofile test 2

Command:

sudo time operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Result:

operf: Profiler started

Profiling done.
0.05user 0.12system 0:00.25elapsed 69%CPU (0avgtext+0avgdata 5184maxresident)k
0inputs+32outputs (2major+2050minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.297 seconds
Verification ok

Check Report(using command: opreport):

Using /home/username/projects/phc-winner-argon2/oprofile_data/samples/ for samples directory.
CPU: ARM Cortex-A57, speed 750 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CYCLES:100000|
samples| %|
------------------
25 100.000 echo
CPU_CYCLES:100000|
samples| %|
------------------
21 84.0000 kallsyms
4 16.0000 ld-2.27.so
oprofile test 3

Command:

sudo time operf echo -n "Ch(329nE" | ./argon2 somesalt -t 2 -m 16 -p 4 -l 24

Result:

operf: Profiler started

Profiling done.
0.02user 0.15system 0:00.23elapsed 75%CPU (0avgtext+0avgdata 5132maxresident)k
0inputs+40outputs (3major+2050minor)pagefaults 0swaps
Type: Argon2i
Iterations: 2
Memory: 65536 KiB
Parallelism: 4
Hash: 6271154a35ed64acc752368ca97460c0e295a404d0ba0d2a
Encoded: $argon2i$v=19$m=65536,t=2,p=4$c29tZXNhbHQ$YnEVSjXtZKzHUjaMqXRgwOKVpATQug0q
0.297 seconds
Verification ok

Check Report(using command: opreport):

Using /home/username/projects/phc-winner-argon2/oprofile_data/samples/ for samples directory.
CPU: ARM Cortex-A57, speed 750 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CYCLES:100000|
samples| %|
------------------
25 100.000 echo
CPU_CYCLES:100000|
samples| %|
------------------
21 84.0000 kallsyms
3 12.0000 ld-2.27.so
1 4.0000 libc-2.27.so

I realized that the opreport command was not using the option for a detailed output.

This is the command to show a detailed report:

opreport -d

This is my result:

Using /home/username/projects/phc-winner-argon2/oprofile_data/samples/ for samples directory.
CPU: ARM Cortex-A57, speed 750 MHz (estimated)
Counted CPU_CYCLES events (Cycle) with a unit mask of 0x00 (No unit mask) count 100000
vma samples % image name symbol name
000096c0 1 25.0000 ld-2.27.so _dl_lookup_symbol_x
00009710 1 100.000
0000ac88 1 25.0000 ld-2.27.so _dl_relocate_object
0000b170 1 100.000
00008ce0 1 25.0000 ld-2.27.so do_lookup_x
00008ebc 1 100.000
00109ea8 1 25.0000 libc-2.27.so _dl_addr
00109f74 1 100.000

I can see two image names and 4 symbol names used for the argon2 program.

Image Names: ld-2.27.so, libc-2.27.so

Symbol Names:_dl_lookup_symbol_x, _dl_relocate_object, do_lookup_x, _dl_addr

Each of the symbol name takes 25% of the 100000 CPU cycles.

The CPU spends 25% of the 100000 CPU cycles to lookup the symbol to hash; 25% for relocating the symbol x; 25% for looking up the symbol x again; 25% to set the output into an address. This is the report generated when profiling using a 75% CPU load/ usage of the system.

(To be continued at Progress 5 blog)

 

 

 

 

 

 

 

Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 3)

Following the last blog called “Project: Part2 – Initial build testing on argon2 package using C and Assembler language(Progress 2)

I will continue to blog about the build of argon2 password hashing function.

Testing Requirements

I will be testing using the following criteria(The crossed out are done in the Progress 2):

  • Check how many inline assembler code already exist
  • Check how many dedicated/ separate assembler files exist in the package (File Extension: s or S)
  • Check how many files use the C programming language (File Extension: c)
  • Use a profiling tool to check how optimized the program is in the current build package (Tools such as: gprof, stap, etc.)
  • Build the package and check the default results by testing the build against a password file. The file will be made using a Microsoft Excel formula.
  • Check that the results are relatively consistent. (This will be from program compile time, program run time, and the program file size)
  • Introduce a minor or major change that will optimize or reduce performance of the original program. This is done by changing the character(s) of the password file generated by the Microsoft Excel formula.
  • Compare the results of the changes and the original program results.
Use a profiling tool to check how optimized the program is in the current build package (Tools such as: gprof, stap, etc.)

I will be using a profiling tool.

(This continues in Progress 4)