Extension: ExBLAS
#include "dg/algorithm.h" (or as a standalone library as "dg/exblas/exblas.h")
Loading...
Searching...
No Matches
dg::exblas Namespace Reference

This is the namespace for all functions and classes defined and used in the exblas library. More...

Namespaces

namespace  cpu
 cpu versions of the primitive functions
 
namespace  gpu
 gpu (CUDA) versions of primitive functions
 

Classes

union  udouble
 Utility union to display all bits of a double (using type-punning) More...
 
union  ufloat
 Utility union to display all bits of a float (using type-punning) More...
 

Functions

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>
__host__ void exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *d_superacc, int *status)
 GPU version of exact dot product.
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>
__host__ void exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *d_superacc, int *status)
 GPU version of exact dot product.
 
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status)
 OpenMP parallel version of exact triple dot product.
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status)
 OpenMP parallel version of exact triple dot product.
 
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status)
 Serial version of exact dot product.
 
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status)
 Serial version of exact dot product.
 
template<class T , size_t N, class Functor , class ... PointerOrValues>
__host__ void fpedot_gpu (int *status, unsigned size, T *fpe, Functor f, PointerOrValues ...xs_ptr)
 GPU version of fpe generalized dot product.
 
template<class T , size_t N, class Functor , class ... PointerOrValues>
void fpedot_omp (int *status, unsigned size, std::array< T, N > &fpe, Functor f, PointerOrValues ...xs_ptr)
 OpenMP version of fpe generalized dot product.
 
template<class T , size_t N, class Functor , class ... PointerOrValues>
void fpedot_cpu (int *status, unsigned size, std::array< T, N > &fpe, Functor f, PointerOrValues ...xs_ptr)
 serial version of extended precision general dot product
 
void mpi_reduce_communicator (MPI_Comm comm, MPI_Comm *comm_mod, MPI_Comm *comm_mod_reduce)
 This function can be used to partition communicators for the exblas::reduce_mpi_cpu function.
 
void reduce_mpi_cpu (unsigned num_superacc, int64_t *in, int64_t *out, MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce)
 reduce a number of superaccumulators distributed among mpi processes
 

Detailed Description

This is the namespace for all functions and classes defined and used in the exblas library.

In principle you can use this as a standalone library but it is much easier to just use the dg::blas1::dot and dg::blas2::dot functions for general purpose usage

Function Documentation

◆ exdot_cpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void dg::exblas::exdot_cpu ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
int64_t * h_superacc,
int * status )

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
Parameters
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_cpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void dg::exblas::exdot_cpu ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
PointerOrValue3 x3_ptr,
int64_t * h_superacc,
int * status )

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
Parameters
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>
__host__ void dg::exblas::exdot_gpu ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
int64_t * d_superacc,
int * status )

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
Parameters
d_superaccpointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>
__host__ void dg::exblas::exdot_gpu ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
PointerOrValue3 x3_ptr,
int64_t * d_superacc,
int * status )

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
Parameters
d_superaccpointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void dg::exblas::exdot_omp ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
int64_t * h_superacc,
int * status )

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
Parameters
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void dg::exblas::exdot_omp ( unsigned size,
PointerOrValue1 x1_ptr,
PointerOrValue2 x2_ptr,
PointerOrValue3 x3_ptr,
int64_t * h_superacc,
int * status )

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note
The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version
Attention
the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)
Note
The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.
Template Parameters
NBFPEsize of the floating point expansion (should be between 3 and 8)
PointerOrValuemust be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
sizesize N of the arrays to sum
x1_ptrfirst array
x2_ptrsecond array
x3_ptrthird array
Parameters
h_superaccpointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status0 indicates success, 1 indicates an input value was NaN or Inf

◆ fpedot_cpu()

template<class T , size_t N, class Functor , class ... PointerOrValues>
void dg::exblas::fpedot_cpu ( int * status,
unsigned size,
std::array< T, N > & fpe,
Functor f,
PointerOrValues ... xs_ptr )

serial version of extended precision general dot product

Computes the extended precision reduction

\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]

using floating point expansions

Template Parameters
Tthe return type of Functor.
Nsize of the floating point expansion (should be between 3 and 8)
Functora Functor
PointerOrValuesmust be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
status0 indicates success, 2 means the FPE overflowed
sizesize of the arrays to sum
fpethe FPE holding the result (write-only)
fthe functor
xs_ptrthe input arrays to sum
See also
exblas::cpu::Round to convert the FPE into a double precision number

◆ fpedot_gpu()

template<class T , size_t N, class Functor , class ... PointerOrValues>
__host__ void dg::exblas::fpedot_gpu ( int * status,
unsigned size,
T * fpe,
Functor f,
PointerOrValues ... xs_ptr )

GPU version of fpe generalized dot product.

Computes the extended precision reduction

\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]

using floating point expansions

Template Parameters
Tthe return type of Functor.
Nsize of the floating point expansion (should be between 3 and 8)
Functora Functor
PointerOrValuesmust be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
status0 indicates success, 2 means the FPE overflowed
sizesize of the arrays to sum
fpethe FPE holding the result (write-only)
fthe functor
xs_ptrthe input arrays to sum
See also
exblas::cpu::Round to convert the FPE into a double precision number

◆ fpedot_omp()

template<class T , size_t N, class Functor , class ... PointerOrValues>
void dg::exblas::fpedot_omp ( int * status,
unsigned size,
std::array< T, N > & fpe,
Functor f,
PointerOrValues ... xs_ptr )

OpenMP version of fpe generalized dot product.

Computes the extended precision reduction

\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]

using floating point expansions

Template Parameters
Tthe return type of Functor.
Nsize of the floating point expansion (should be between 3 and 8)
Functora Functor
PointerOrValuesmust be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size, else we consider the value constant in every iteration.
Parameters
status0 indicates success, 2 means the FPE overflowed
sizesize of the arrays to sum
fpethe FPE holding the result (write-only)
fthe functor
xs_ptrthe input arrays to sum
See also
exblas::cpu::Round to convert the FPE into a double precision number

◆ mpi_reduce_communicator()

void dg::exblas::mpi_reduce_communicator ( MPI_Comm comm,
MPI_Comm * comm_mod,
MPI_Comm * comm_mod_reduce )
inline

This function can be used to partition communicators for the exblas::reduce_mpi_cpu function.

If the ranks in comm are aligned in rows of 128 (or any number <= 256) comm_mod is the 1d communicator for the rows and comm_mod_reduce is the communicator for the columns. The last row may contain fewer than 128 ranks and comm_mod may contain 1 rank fewer than the others in the last columns

Parameters
commthe input communicator (unmodified, may not be MPI_COMM_NULL)
comm_modthe line communicator in comm, consists of all rank/mod ranks
comm_mod_reducethe column communicator in comm consists of all rankmod ranks
Attention
The creation of new communicators involves communication between all participation processes (comm in this case). In order to avoid excessive creation of new MPI communicators (there is a limit to how many a program can create), the function keeps a (static) record of which communicators it has been called with. If you repeatedly call this function with the same comm only the first call will actually create new communicators.

◆ reduce_mpi_cpu()

void dg::exblas::reduce_mpi_cpu ( unsigned num_superacc,
int64_t * in,
int64_t * out,
MPI_Comm comm,
MPI_Comm comm_mod,
MPI_Comm comm_mod_reduce )
inline

reduce a number of superaccumulators distributed among mpi processes

We cannot sum more than 256 accumulators before we need to normalize again, so we need to split the reduction into several steps if more than 256 processes are involved. This function normalizes, reduces, normalizes, reduces and broadcasts the result to all participating processes. As usual the resulting superaccumulator is unnormalized.

Parameters
num_superaccnumber of Superaccumulators eaach process holds
inunnormalized input superaccumulators ( must be of size num_superacc*exblas::BIN_COUNT, allocated on the cpu) (read/write, undefined on out)
outeach process contains the result on output( must be of size num_superacc*exblas::BIN_COUNT, allocated on the cpu) (write, may not alias in)
commThe complete MPI communicator
comm_modThis is the line communicator of up to 128 ranks ( or any other number <256)
comm_mod_reduceThis is the column communicator consisting of all rank 0, 1, 2, ...,127 processes in all comm_mod
See also
exblas::mpi_reduce_communicator to generate the required communicators