Project: Part3 – Optimizing and porting argon2 package using C and Assembler language(Progress 2)

Requirements/ System Specifications.

Argon2 Password hashing function package:

https://github.com/P-H-C/phc-winner-argon2

Aarch64 Fedora 28 version of Linux operating system

Cortex-A57 8 core processor

One set of Dual-Channel DIMM DDR3 8GB RAM (16GB in total)

New Plan

The new plan will be to change the benchmark program to use the internal system timer to calculate the run time (The time required to run the program/ specific piece of code) instead of relying on RDTSC (time counter register, x86_64 type processors only).

Here is the original benchmark program file (bench.c):

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];

unsigned j;
for (j = 0; j < 3; ++j) {
clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;

argon2_type type = types[j];
start_time = clock();
start_cycles = rdtsc();

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

stop_cycles = rdtsc();
stop_time = clock();

delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);
}

printf("%2.4f seconds\n\n", run_time);
}
}
}

int main() {
benchmark();
return ARGON2_OK;
}

I will change the bench.c file by removing the rdtsc function. The rdtsc function only starts and ends the timer to count the time the program code will run. I will also remove any of the code that is affected by the change; marked in red.

The main part/chunk of the program is this section:

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

The code is within a for loop that will continuously run until stopped by the user (using CTRL+C/ kill command). The code will generate each of the three types of argon2 hashing before returning a calculated run time. The three argon2 hashing types are argon2_d, argon2_i, and argon2_id.

The function/ builtin feature of Linux for the system timer is called clock_gettime. The link also contains an example that I will use to get the run time I need for my test. Here is the example:

/*
 * This program calculates the time required to
 * execute the program specified as its first argument.
 * The time is printed in seconds, on standard out.
 */
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <time.h>

#define BILLION  1000000000L;

int main( int argc, char **argv )
  {
    struct timespec start, stop;
    double accum;

    if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
      perror( "clock gettime" );
      exit( EXIT_FAILURE );
    }

    system( argv[1] );

    if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
      perror( "clock gettime" );
      exit( EXIT_FAILURE );
    }

    accum = ( stop.tv_sec - start.tv_sec )
          + ( stop.tv_nsec - start.tv_nsec )
            / BILLION;
    printf( "%lf\n", accum );
    return( EXIT_SUCCESS );
  }

The code that I require are marked in red.

The final code will look like this:

/*
* Argon2 reference source code package - reference C implementations
*
* Copyright 2015
* Daniel Dinu, Dmitry Khovratovich, Jean-Philippe Aumasson, and Samuel Neves
*
* You may use this work under the terms of a Creative Commons CC0 1.0
* License/Waiver or the Apache Public License 2.0, at your option. The terms of
* these licenses can be found at:
*
* - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
* - Apache 2.0 : http://www.apache.org/licenses/LICENSE-2.0
*
* You should have received a copy of both of these licenses along with this
* software. If not, they may be obtained at the above URLs.
*/

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;
#ifdef _MSC_VER
#include <intrin.h>
#endif

#include "argon2.h"

/*
static uint64_t rdtsc(void) {
#ifdef _MSC_VER
return __rdtsc();
#else
#if defined(__amd64__) || defined(__x86_64__)
uint64_t rax, rdx;
__asm__ __volatile__("rdtsc" : "=a"(rax), "=d"(rdx) : :);
return (rdx << 32) | rax;
#elif defined(__i386__) || defined(__i386) || defined(__X86__)
uint64_t rax;
__asm__ __volatile__("rdtsc" : "=A"(rax) : :);
return rax;
#else
#error "Not implemented!"
#endif
#endif
}

*/


/*
* Benchmarks Argon2 with salt length 16, password length 16, t_cost 3,
and different m_cost and threads
*/
static void benchmark() {
#define BENCH_OUTLEN 16
#define BENCH_INLEN 16
const uint32_t inlen = BENCH_INLEN;
const unsigned outlen = BENCH_OUTLEN;
unsigned char out[BENCH_OUTLEN];
unsigned char pwd_array[BENCH_INLEN];
unsigned char salt_array[BENCH_INLEN];
#undef BENCH_INLEN
#undef BENCH_OUTLEN

struct timespec start, stop;
double accum;

uint32_t t_cost = 3;
uint32_t m_cost;
uint32_t thread_test[4] = {1, 2, 4, 8};
argon2_type types[3] = {Argon2_i, Argon2_d, Argon2_id};

memset(pwd_array, 0, inlen);
memset(salt_array, 1, inlen);

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2) {
unsigned i;
for (i = 0; i < 4; ++i) {
double run_time = 0;
uint32_t thread_n = thread_test[i];
unsigned j;
for (j = 0; j < 3; ++j) {
/*clock_t start_time, stop_time;
uint64_t start_cycles, stop_cycles;
uint64_t delta;
double mcycles;*/

argon2_type type = types[j];

/*start_time = clock();
start_cycles = rdtsc();*/

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

argon2_hash(t_cost, m_cost, thread_n, pwd_array, inlen,
salt_array, inlen, out, outlen, NULL, 0, type,
ARGON2_VERSION_NUMBER);

/*stop_cycles = rdtsc();
stop_time = clock();*/

/*delta = (stop_cycles - start_cycles) / (m_cost);
mcycles = (double)(stop_cycles - start_cycles) / (1UL << 20);
run_time += ((double)stop_time - start_time) / (CLOCKS_PER_SEC);*/

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

accum = ( (double)stop.tv_sec - start.tv_sec )
+ ( (double)stop.tv_nsec - start.tv_nsec );

double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

printf("%s %d iterations %d MiB %d threads: %2.2f cpb %2.2f "
"Mcycles \n", argon2_type2string(type, 1), t_cost,
m_cost >> 10, thread_n, (float)delta / 1024, mcycles);

run_time = 0;
run_time += accum / BILLION;

}

printf("%2.4f seconds\n\n", run_time);
}
}

}

int main() {
benchmark();
return ARGON2_OK;
}
NOTE: /*     */ is a comment block/ force the program compiler to ignore this section of code.

I will now explain what the code does.

#include <time.h>
#include <unistd.h>
#define BILLION 1000000000L;

The #include will use the library of codes that I do not have to manually write. The code is already included in the GNU gcc C language compiler during installation.

NOTE: The format of the code is enclosed in triangular brackets (<>).

The #define is a code that will create a variable/ place to hold something that I will use later in the program.

NOTE: The format requires a name then the value.

The next set of code:

struct timespec start, stop;
double accum;

The struct code will call a structure(Set of pre-made code with a specific format). This code will call the timespec structure that will allow my program to use the start and stop commands.

NOTE: The format of the struct code require the structure’s name followed by the commands. This also require the line to be closed with the semi-colon(;), like in most C/C++ language code.

The code double is a variable that will hold a value that will be used later in the program.

NOTE: The format of the code will require a specific name type followed by the variable’s name. Example: double is the variable type, accum is the variable’s name.

Here is the next piece of code:

double run_time = 0;

This is another variable that I will assign a value of zero.

NOTE: I have placed this code within a for loop to constantly reset the run_time variable. I will have to reset the time counter each time the program runs the main chunk of code mentioned before.

The next piece of code:

if( clock_gettime( CLOCK_REALTIME, &start) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &start);
}

The red highlighted part is from the example found here:(https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html). The code will check if the program cannot access the system time and return an error to the user.

I added an else code to start the timer if the system timer is accessible.

NOTE:The format of the else code is always after an if code/statement.

The next section of code:

if( clock_gettime( CLOCK_REALTIME, &stop) == -1 ) {
perror( "clock gettime" );
exit( EXIT_FAILURE );
}
else
{
clock_gettime(CLOCK_REALTIME, &stop);
}

I will stop the timer after the main chunk of code is done running. This code is similar to the code to start the timer. The if code/statement will check for any errors that might occur if the system timer cannot be stopped.

The next section of code:

accum = ( (double)stop.tv_sec - start.tv_sec )
+ ( (double)stop.tv_nsec - start.tv_nsec );

The calculation of the run time is set to the variable accum.

NOTE: This is similar to the code from the example (https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html) but I have removed the / BILLION at the end because I will need the number in the original form for the next lines of code.
double mcycles = accum / (1UL << 20);
uint64_t delta = accum / (m_cost);

The variable mcycles will take the value of variable accum and divide it by             (1UL << 20). The explanation of 1UL is found here (https://stackoverflow.com/questions/14467173/bit-setting-in-ansi-c). It is an unsigned long integer value. The << 20 is the bit shift code that will move the position of the bit/ value to the left twenty times. Similar to basic algebra, the brackets are performed first. This variable is suppose to count memory cycles.

The variable delta is a uint64_t variable type. It is an unsigned 64-bit integer variable. The delta variable will calculate the efficiency of the program. The value is from the timed value divided by the memory cost (2^n) value.

The mcost variable is from within the for loop checking the conditions found here:

for (m_cost = (uint32_t)1 << 10; m_cost <= (uint32_t)1 << 22; m_cost *= 2)
NOTE: The variable type uint32_t is an unsigned 32-bit integer variable.

The next section of code:

run_time = 0;
run_time += accum / BILLION;

The run_time variable is set to 0 again because the GNU gcc C language compiler kept complaining about the variable not being used. (This maybe an issue later; I will have to check it later)

I will set the variable run_time with the value of itself combined with the value of (accum divided by BILLION). This is where the required code from the example found here (https://users.pja.edu.pl/~jms/qnx/help/watcom/clibref/qnx/clock_gettime.html). The equation will change the value into a number closer to a second. The result will be a really fast number without the equation. Logically, a processor(CPU) can process information in the Gigahertz range(1,000,000,000 per line of code). This will mean the equation should divide the timed result by one billion to get a number in seconds.

The final section of code is:

printf("%2.4f seconds\n\n", run_time);

This line of code will output the calculated time value to the user. The value is counted by two decimal positions to the left and four decimal positions to the right of the decimal.

Result:

The result was strange as the calculated time had negative values. Also the benchmark program ran really fast compared to the original benchmark program performed on a x86_64 processor system.

This is to build the program using the builtin Makefile included in the argon2 package.

Building without optimizations
cc -std=c89 -O2 -Wall -g -Iinclude -Isrc -pthread src/argon2.c src/core.c src/blake2/blake2b.c src/thread.c src/encoding.c src/ref.c src/bench.c -o bench
NOTE: I have changed the built flag to -O2 instead of the -O3.

The next test(On a x86_64 architecture; Basic Test of original program)

The x86_64 system have these hardware:

Intel(R) Xeon(R) CPU E5-1630 v4 @ 3.70GHz

Four sticks of 8GB DIMM DDR4 RAM at 2.4 GHz (32 GB of RAM in total)

x86_64 Fedora 28 version of Linux Operating System

I will first show the x86_64 system results:

37312731171934821544Argon2i 3 iterations 1 MiB 1 threads: 10574.77 cpb 10574.77 Mcycles
1247866662622202070Argon2d 3 iterations 1 MiB 1 threads: 10573.72 cpb 10573.72 Mcycles
8121691903307694325Argon2id 3 iterations 1 MiB 1 threads: 10571.92 cpb 10571.92 Mcycles
0.0100 seconds

14977167733997818648Argon2i 3 iterations 1 MiB 2 threads: 10576.28 cpb 10576.28 Mcycles
2187773163388595072Argon2d 3 iterations 1 MiB 2 threads: 10572.17 cpb 10572.17 Mcycles
28735233341075268257Argon2id 3 iterations 1 MiB 2 threads: 10573.05 cpb 10573.05 Mcycles
0.0171 seconds

35601863931760751719Argon2i 3 iterations 1 MiB 4 threads: 10571.92 cpb 10571.93 Mcycles
42457421402446018348Argon2d 3 iterations 1 MiB 4 threads: 10571.65 cpb 10571.65 Mcycles
6359495663131182784Argon2id 3 iterations 1 MiB 4 threads: 10571.64 cpb 10571.64 Mcycles
0.0220 seconds

13211386193827312231Argon2i 3 iterations 1 MiB 8 threads: 10582.07 cpb 10582.07 Mcycles
2017380744228771863Argon2d 3 iterations 1 MiB 8 threads: 10582.25 cpb 10582.25 Mcycles
2713660177924949126Argon2id 3 iterations 1 MiB 8 threads: 10582.15 cpb 10582.15 Mcycles
0.0401 seconds

34098789171626307623Argon2i 3 iterations 2 MiB 1 threads: 5293.53 cpb 10587.05 Mcycles
41112087152325301561Argon2d 3 iterations 2 MiB 1 threads: 5292.41 cpb 10584.83 Mcycles
5152598753020665429Argon2id 3 iterations 2 MiB 1 threads: 5290.67 cpb 10581.34 Mcycles
0.0193 seconds

12106284303712976859Argon2i 3 iterations 2 MiB 2 threads: 5289.21 cpb 10578.43 Mcycles
1902890366109077386Argon2d 3 iterations 2 MiB 2 threads: 5288.64 cpb 10577.29 Mcycles
2593963985801686024Argon2id 3 iterations 2 MiB 2 threads: 5289.38 cpb 10578.75 Mcycles
0.0248 seconds

32865988571490376646Argon2i 3 iterations 2 MiB 4 threads: 5287.49 cpb 10574.99 Mcycles
39752915512178772514Argon2d 3 iterations 2 MiB 4 threads: 5287.35 cpb 10574.71 Mcycles
3687074192867256577Argon2id 3 iterations 2 MiB 4 threads: 5287.40 cpb 10574.80 Mcycles
0.0308 seconds

10572264843566742997Argon2i 3 iterations 2 MiB 8 threads: 5292.63 cpb 10585.26 Mcycles
17567059124266120428Argon2d 3 iterations 2 MiB 8 threads: 5292.58 cpb 10585.16 Mcycles
2456075227669645387Argon2id 3 iterations 2 MiB 8 threads: 5292.16 cpb 10584.33 Mcycles
0.0521 seconds

31545606251397912479Argon2i 3 iterations 4 MiB 1 threads: 2653.18 cpb 10612.73 Mcycles
38828398532114725197Argon2d 3 iterations 4 MiB 1 threads: 2650.45 cpb 10601.79 Mcycles
3047237062830488385Argon2id 3 iterations 4 MiB 1 threads: 2650.19 cpb 10600.76 Mcycles
0.0368 seconds

10204770293535003798Argon2i 3 iterations 4 MiB 2 threads: 2647.51 cpb 10590.04 Mcycles
17249317724238457938Argon2d 3 iterations 4 MiB 2 threads: 2647.27 cpb 10589.09 Mcycles
2428403684647788185Argon2id 3 iterations 4 MiB 2 threads: 2647.47 cpb 10589.87 Mcycles
0.0431 seconds

31327130301342861294Argon2i 3 iterations 4 MiB 4 threads: 2645.27 cpb 10581.06 Mcycles
38277567492037602970Argon2d 3 iterations 4 MiB 4 threads: 2645.19 cpb 10580.78 Mcycles
2276116172732562416Argon2id 3 iterations 4 MiB 4 threads: 2645.23 cpb 10580.91 Mcycles
0.0495 seconds

9225480733436003113Argon2i 3 iterations 4 MiB 8 threads: 2647.25 cpb 10589.02 Mcycles
16259668664138962378Argon2d 3 iterations 4 MiB 8 threads: 2647.14 cpb 10588.58 Mcycles
2328880090548145081Argon2id 3 iterations 4 MiB 8 threads: 2647.44 cpb 10589.76 Mcycles
0.0744 seconds

30330952961289279353Argon2i 3 iterations 8 MiB 1 threads: 1328.12 cpb 10624.97 Mcycles
37742017952023717309Argon2d 3 iterations 8 MiB 1 threads: 1327.33 cpb 10618.61 Mcycles
2136636212757009041Argon2id 3 iterations 8 MiB 1 threads: 1327.19 cpb 10617.52 Mcycles
0.0498 seconds

9469966913485517122Argon2i 3 iterations 8 MiB 2 threads: 1326.61 cpb 10612.92 Mcycles
16754634484211165330Argon2d 3 iterations 8 MiB 2 threads: 1326.28 cpb 10610.23 Mcycles
2401273148642647180Argon2id 3 iterations 8 MiB 2 threads: 1326.35 cpb 10610.84 Mcycles
0.0771 seconds

31275726791350457380Argon2i 3 iterations 8 MiB 4 threads: 1324.15 cpb 10593.21 Mcycles
38353773162057823887Argon2d 3 iterations 8 MiB 4 threads: 1324.10 cpb 10592.79 Mcycles
2478461132765912450Argon2id 3 iterations 8 MiB 4 threads: 1324.18 cpb 10593.42 Mcycles
0.0866 seconds

9558952213482208994Argon2i 3 iterations 8 MiB 8 threads: 1325.16 cpb 10601.28 Mcycles
16721510404198952967Argon2d 3 iterations 8 MiB 8 threads: 1325.22 cpb 10601.75 Mcycles
2388942706619242490Argon2id 3 iterations 8 MiB 8 threads: 1325.04 cpb 10600.28 Mcycles
0.1194 seconds

31042168541427536599Argon2i 3 iterations 16 MiB 1 threads: 668.06 cpb 10688.99 Mcycles
39124498882214906246Argon2d 3 iterations 16 MiB 1 threads: 666.82 cpb 10669.10 Mcycles
4048743402995991599Argon2id 3 iterations 16 MiB 1 threads: 666.44 cpb 10663.08 Mcycles
0.0951 seconds

11859953773762067761Argon2i 3 iterations 16 MiB 2 threads: 665.54 cpb 10648.73 Mcycles
1952051050242384679Argon2d 3 iterations 16 MiB 2 threads: 666.10 cpb 10657.54 Mcycles
27273213391008354924Argon2id 3 iterations 16 MiB 2 threads: 665.54 cpb 10648.67 Mcycles
0.1325 seconds

34933301581740323590Argon2i 3 iterations 16 MiB 4 threads: 663.51 cpb 10616.20 Mcycles
42252147552471078403Argon2d 3 iterations 16 MiB 4 threads: 663.45 cpb 10615.13 Mcycles
6610278363201576241Argon2id 3 iterations 16 MiB 4 threads: 663.43 cpb 10614.86 Mcycles
0.1505 seconds

13915654333937508899Argon2i 3 iterations 16 MiB 8 threads: 663.75 cpb 10620.00 Mcycles
2127460927377542831Argon2d 3 iterations 16 MiB 8 threads: 663.70 cpb 10619.15 Mcycles
28623010321111202089Argon2id 3 iterations 16 MiB 8 threads: 663.63 cpb 10618.02 Mcycles
0.1908 seconds

35961333192018513362Argon2i 3 iterations 32 MiB 1 threads: 336.98 cpb 10783.46 Mcycles
2084438022925688382Argon2d 3 iterations 32 MiB 1 threads: 336.98 cpb 10783.37 Mcycles
11156363583834542778Argon2id 3 iterations 32 MiB 1 threads: 337.03 cpb 10784.95 Mcycles
0.1887 seconds

2024511488406553503Argon2i 3 iterations 32 MiB 2 threads: 335.78 cpb 10745.00 Mcycles
28914570101263429122Argon2d 3 iterations 32 MiB 2 threads: 335.48 cpb 10735.39 Mcycles
37483569582110660894Argon2id 3 iterations 32 MiB 2 threads: 335.19 cpb 10726.17 Mcycles
0.2657 seconds

3006611252891444371Argon2i 3 iterations 32 MiB 4 threads: 333.21 cpb 10662.76 Mcycles
10814032083673982863Argon2d 3 iterations 32 MiB 4 threads: 333.26 cpb 10664.48 Mcycles
1863928371162819675Argon2id 3 iterations 32 MiB 4 threads: 333.30 cpb 10665.70 Mcycles
0.2917 seconds

2647894462945096885Argon2i 3 iterations 32 MiB 8 threads: 333.25 cpb 10664.09 Mcycles
34300370941726852693Argon2d 3 iterations 32 MiB 8 threads: 333.24 cpb 10663.72 Mcycles
42118400902507645479Argon2id 3 iterations 32 MiB 8 threads: 333.21 cpb 10662.75 Mcycles
0.3958 seconds

6975984133638080775Argon2i 3 iterations 64 MiB 1 threads: 171.82 cpb 10996.26 Mcycles
1828026929462843471Argon2d 3 iterations 64 MiB 1 threads: 171.66 cpb 10986.06 Mcycles
29477903061591876255Argon2id 3 iterations 64 MiB 1 threads: 171.79 cpb 10994.90 Mcycles
0.3656 seconds

40768340232642930179Argon2i 3 iterations 64 MiB 2 threads: 170.63 cpb 10920.52 Mcycles
8329564143653322544Argon2d 3 iterations 64 MiB 2 threads: 170.03 cpb 10881.71 Mcycles
1843354044359643162Argon2id 3 iterations 64 MiB 2 threads: 169.89 cpb 10873.02 Mcycles
0.4956 seconds

28446343191230071866Argon2i 3 iterations 64 MiB 4 threads: 167.94 cpb 10748.23 Mcycles
37150261572105910736Argon2d 3 iterations 64 MiB 4 threads: 168.02 cpb 10753.43 Mcycles
2958699032978678975Argon2id 3 iterations 64 MiB 4 threads: 167.98 cpb 10750.53 Mcycles
0.5300 seconds

11686546543836486008Argon2i 3 iterations 64 MiB 8 threads: 167.75 cpb 10736.24 Mcycles
2026438434392927139Argon2d 3 iterations 64 MiB 8 threads: 167.66 cpb 10730.16 Mcycles
28778995791260930582Argon2id 3 iterations 64 MiB 8 threads: 167.91 cpb 10745.94 Mcycles
0.7020 seconds

Here are the results of the new code change:

Argon2i 3 iterations 1 MiB 1 threads: 5.53 cpb 5.53 Mcycles
Argon2d 3 iterations 1 MiB 1 threads: 5.14 cpb 5.15 Mcycles
Argon2id 3 iterations 1 MiB 1 threads: 4.63 cpb 4.63 Mcycles
0.0049 seconds

Argon2i 3 iterations 1 MiB 2 threads: 3.57 cpb 3.57 Mcycles
Argon2d 3 iterations 1 MiB 2 threads: 3.23 cpb 3.23 Mcycles
Argon2id 3 iterations 1 MiB 2 threads: 3.29 cpb 3.30 Mcycles
0.0035 seconds

Argon2i 3 iterations 1 MiB 4 threads: 2.62 cpb 2.62 Mcycles
Argon2d 3 iterations 1 MiB 4 threads: 2.53 cpb 2.53 Mcycles
Argon2id 3 iterations 1 MiB 4 threads: 2.59 cpb 2.59 Mcycles
0.0027 seconds

Argon2i 3 iterations 1 MiB 8 threads: 4.20 cpb 4.20 Mcycles
Argon2d 3 iterations 1 MiB 8 threads: 4.14 cpb 4.14 Mcycles
Argon2id 3 iterations 1 MiB 8 threads: 4.41 cpb 4.41 Mcycles
0.0046 seconds

Argon2i 3 iterations 2 MiB 1 threads: 5.43 cpb 10.86 Mcycles
Argon2d 3 iterations 2 MiB 1 threads: 5.20 cpb 10.40 Mcycles
Argon2id 3 iterations 2 MiB 1 threads: 4.67 cpb 9.33 Mcycles
0.0098 seconds

Argon2i 3 iterations 2 MiB 2 threads: 2.93 cpb 5.85 Mcycles
Argon2d 3 iterations 2 MiB 2 threads: 2.84 cpb 5.69 Mcycles
Argon2id 3 iterations 2 MiB 2 threads: 2.86 cpb 5.72 Mcycles
0.0060 seconds

Argon2i 3 iterations 2 MiB 4 threads: 1.96 cpb 3.91 Mcycles
Argon2d 3 iterations 2 MiB 4 threads: 1.94 cpb 3.89 Mcycles
Argon2id 3 iterations 2 MiB 4 threads: 1.95 cpb 3.90 Mcycles
0.0041 seconds

Argon2i 3 iterations 2 MiB 8 threads: 2.56 cpb 5.12 Mcycles
Argon2d 3 iterations 2 MiB 8 threads: 2.51 cpb 5.01 Mcycles
Argon2id 3 iterations 2 MiB 8 threads: 2.53 cpb 5.06 Mcycles
0.0053 seconds

Argon2i 3 iterations 4 MiB 1 threads: 5.52 cpb 22.10 Mcycles
Argon2d 3 iterations 4 MiB 1 threads: 5.00 cpb 19.98 Mcycles
Argon2id 3 iterations 4 MiB 1 threads: 4.70 cpb 18.79 Mcycles
0.0197 seconds

Argon2i 3 iterations 4 MiB 2 threads: 2.78 cpb 11.11 Mcycles
Argon2d 3 iterations 4 MiB 2 threads: 2.68 cpb 10.74 Mcycles
Argon2id 3 iterations 4 MiB 2 threads: 2.70 cpb 10.79 Mcycles
0.0113 seconds

Argon2i 3 iterations 4 MiB 4 threads: 1.66 cpb 6.63 Mcycles
Argon2d 3 iterations 4 MiB 4 threads: 1.64 cpb 6.56 Mcycles
Argon2id 3 iterations 4 MiB 4 threads: 1.65 cpb 6.61 Mcycles
0.0069 seconds

Argon2i 3 iterations 4 MiB 8 threads: 2.37 cpb 9.47 Mcycles
Argon2d 3 iterations 4 MiB 8 threads: 2.24 cpb 8.95 Mcycles
Argon2id 3 iterations 4 MiB 8 threads: 1.89 cpb 7.57 Mcycles
0.0079 seconds

Argon2i 3 iterations 8 MiB 1 threads: 5.78 cpb 46.22 Mcycles
Argon2d 3 iterations 8 MiB 1 threads: 5.29 cpb 42.36 Mcycles
Argon2id 3 iterations 8 MiB 1 threads: 4.89 cpb 39.12 Mcycles
0.0410 seconds

Argon2i 3 iterations 8 MiB 2 threads: 2.70 cpb 21.64 Mcycles
Argon2d 3 iterations 8 MiB 2 threads: 2.67 cpb 21.32 Mcycles
Argon2id 3 iterations 8 MiB 2 threads: 0.00 cpb -932.22 Mcycles
-0.9775 seconds

Argon2i 3 iterations 8 MiB 4 threads: 1.53 cpb 12.27 Mcycles
Argon2d 3 iterations 8 MiB 4 threads: 1.52 cpb 12.14 Mcycles
Argon2id 3 iterations 8 MiB 4 threads: 1.52 cpb 12.14 Mcycles
0.0127 seconds

Argon2i 3 iterations 8 MiB 8 threads: 1.84 cpb 14.72 Mcycles
Argon2d 3 iterations 8 MiB 8 threads: 1.77 cpb 14.19 Mcycles
Argon2id 3 iterations 8 MiB 8 threads: 1.74 cpb 13.91 Mcycles
0.0146 seconds

Argon2i 3 iterations 16 MiB 1 threads: 5.97 cpb 95.55 Mcycles
Argon2d 3 iterations 16 MiB 1 threads: 5.50 cpb 88.01 Mcycles
Argon2id 3 iterations 16 MiB 1 threads: 5.21 cpb 83.43 Mcycles
0.0875 seconds

Argon2i 3 iterations 16 MiB 2 threads: 2.87 cpb 45.87 Mcycles
Argon2d 3 iterations 16 MiB 2 threads: 2.83 cpb 45.24 Mcycles
Argon2id 3 iterations 16 MiB 2 threads: 2.84 cpb 45.39 Mcycles
0.0476 seconds

Argon2i 3 iterations 16 MiB 4 threads: 1.58 cpb 25.29 Mcycles
Argon2d 3 iterations 16 MiB 4 threads: 1.56 cpb 24.91 Mcycles
Argon2id 3 iterations 16 MiB 4 threads: 1.56 cpb 24.98 Mcycles
0.0262 seconds

Argon2i 3 iterations 16 MiB 8 threads: 1.78 cpb 28.54 Mcycles
Argon2d 3 iterations 16 MiB 8 threads: 1.78 cpb 28.55 Mcycles
Argon2id 3 iterations 16 MiB 8 threads: 1.77 cpb 28.28 Mcycles
0.0297 seconds

Argon2i 3 iterations 32 MiB 1 threads: 6.18 cpb 197.69 Mcycles
Argon2d 3 iterations 32 MiB 1 threads: 0.00 cpb -758.56 Mcycles
Argon2id 3 iterations 32 MiB 1 threads: 6.12 cpb 195.79 Mcycles
0.2053 seconds

Argon2i 3 iterations 32 MiB 2 threads: 3.38 cpb 108.24 Mcycles
Argon2d 3 iterations 32 MiB 2 threads: 3.34 cpb 106.87 Mcycles
Argon2id 3 iterations 32 MiB 2 threads: 3.36 cpb 107.44 Mcycles
0.1127 seconds

Argon2i 3 iterations 32 MiB 4 threads: 1.92 cpb 61.53 Mcycles
Argon2d 3 iterations 32 MiB 4 threads: 1.89 cpb 60.38 Mcycles
Argon2id 3 iterations 32 MiB 4 threads: 1.89 cpb 60.60 Mcycles
0.0635 seconds

Argon2i 3 iterations 32 MiB 8 threads: 1.85 cpb 59.29 Mcycles
Argon2d 3 iterations 32 MiB 8 threads: 1.96 cpb 62.65 Mcycles
Argon2id 3 iterations 32 MiB 8 threads: 0.00 cpb -893.19 Mcycles
-0.9366 seconds

Argon2i 3 iterations 64 MiB 1 threads: 6.29 cpb 402.50 Mcycles
Argon2d 3 iterations 64 MiB 1 threads: 6.22 cpb 397.86 Mcycles
Argon2id 3 iterations 64 MiB 1 threads: 0.00 cpb -554.57 Mcycles
-0.5815 seconds

Argon2i 3 iterations 64 MiB 2 threads: 3.45 cpb 220.73 Mcycles
Argon2d 3 iterations 64 MiB 2 threads: 3.41 cpb 218.22 Mcycles
Argon2id 3 iterations 64 MiB 2 threads: 3.42 cpb 218.81 Mcycles
0.2294 seconds

Argon2i 3 iterations 64 MiB 4 threads: 0.00 cpb -830.95 Mcycles
Argon2d 3 iterations 64 MiB 4 threads: 1.90 cpb 121.72 Mcycles
Argon2id 3 iterations 64 MiB 4 threads: 1.90 cpb 121.88 Mcycles
0.1278 seconds

Argon2i 3 iterations 64 MiB 8 threads: 1.93 cpb 123.78 Mcycles
Argon2d 3 iterations 64 MiB 8 threads: 1.97 cpb 126.37 Mcycles
Argon2id 3 iterations 64 MiB 8 threads: 1.81 cpb 115.84 Mcycles
0.1215 seconds

The Aarch64 program(My code changes) runs extremely fast. The speed increase also produce the mentioned negative time value problem. It does not make sense to have a negative time value as time is always running and moving forward. The original program(x86_64 only) had a noticable delay before outputting the results. It can also be seen in the Mcycles(Memory cycles) and cpb(memory cost) values.

(I will continue the testing in Project: Part3, Progress 3)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s