Extension: ExBLAS
#include "dg/algorithm.h" (or as a standalone library as "dg/exblas/exblas.h")
dg::exblas Namespace Reference

This is the namespace for all functions and classes defined and used in the exblas library. More...

Namespaces

namespace  cpu
 cpu versions of the primitive functions
 
namespace  gpu
 gpu (CUDA) versions of primitive functions
 

Classes

union  udouble
 Utility union to display all bits of a double (using type-punning) More...
 

Functions

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>
__host__ void exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *d_superacc, int *status)
 GPU version of exact dot product. More...
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>
__host__ void exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *d_superacc, int *status)
 GPU version of exact dot product. More...
 
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status)
 OpenMP parallel version of exact triple dot product. More...
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status)
 OpenMP parallel version of exact triple dot product. More...
 
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status)
 Serial version of exact dot product. More...
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status)
 Serial version of exact dot product. More...
 
static void mpi_reduce_communicator (MPI_Comm comm, MPI_Comm *comm_mod, MPI_Comm *comm_mod_reduce)
 This function can be used to partition communicators for the exblas::reduce_mpi_cpu function. More...
 
static void reduce_mpi_cpu (unsigned num_superacc, int64_t *in, int64_t *out, MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce)
 reduce a number of superaccumulators distributed among mpi processes More...
 

Detailed Description

This is the namespace for all functions and classes defined and used in the exblas library.

In principle you can use this as a standalone library but it is much easier to just use the dg::blas1::dot and dg::blas2::dot functions for general purpose usage

Function Documentation

◆ exdot_cpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void dg::exblas::exdot_cpu ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
int64_t *  h_superacc,
int *  status 
)

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_cpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void dg::exblas::exdot_cpu ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
PointerOrValue3  x3_ptr,
int64_t *  h_superacc,
int *  status 
)

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>
__host__ void dg::exblas::exdot_gpu ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
int64_t *  d_superacc,
int *  status 
)

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
d_superaccpointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>
__host__ void dg::exblas::exdot_gpu ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
PointerOrValue3  x3_ptr,
int64_t *  d_superacc,
int *  status 
)

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
d_superaccpointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void dg::exblas::exdot_omp ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
int64_t *  h_superacc,
int *  status 
)

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void dg::exblas::exdot_omp ( unsigned  size,
PointerOrValue1  x1_ptr,
PointerOrValue2  x2_ptr,
PointerOrValue3  x3_ptr,
int64_t *  h_superacc,
int *  status 
)

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ mpi_reduce_communicator()

static void dg::exblas::mpi_reduce_communicator ( MPI_Comm  comm,
MPI_Comm *  comm_mod,
MPI_Comm *  comm_mod_reduce 
)
static

This function can be used to partition communicators for the exblas::reduce_mpi_cpu function.

Parameters
commthe input communicator (unmodified, may not be MPI_COMM_NULL)
comm_moda subgroup of comm (comm is split)
comm_mod_reducea subgroup of comm, consists of all rank 0 processes in comm_mod
Note
the creation of new communicators involves communication between all participation processes (comm in this case).
Attention
In order to avoid excessive creation of new MPI communicators (there is a limit to how many a program can create), the function keeps record of which communicators it has been called with. If you repeatedly call this function with the same comm only the first call will actually create new communicators.

◆ reduce_mpi_cpu()

static void dg::exblas::reduce_mpi_cpu ( unsigned  num_superacc,
int64_t *  in,
int64_t *  out,
MPI_Comm  comm,
MPI_Comm  comm_mod,
MPI_Comm  comm_mod_reduce 
)
static

reduce a number of superaccumulators distributed among mpi processes

We cannot sum more than 256 accumulators before we need to normalize again, so we need to split the reduction into several steps if more than 256 processes are involved. This function normalizes, reduces, normalizes, reduces and broadcasts the result to all participating processes. As usual the resulting superaccumulator is unnormalized.

Parameters
num_superaccnumber of Superaccumulators eaach process holds
inunnormalized input superaccumulators ( must be of size num_superacc*exblas::BIN_COUNT, allocated on the cpu) (read/write, undefined on out)
outeach process contains the result on output( must be of size num_superacc*exblas::BIN_COUNT, allocated on the cpu) (write, may not alias in)
commThe complete MPI communicator
comm_modThis is comm modulo 128 ( or any other number <256)
comm_mod_reduceThis is the communicator consisting of all rank 0 processes in comm_mod, may be MPI_COMM_NULL
See also
exblas::mpi_reduce_communicator to generate the required communicators