Extension: ExBLAS
#include "dg/algorithm.h" (or as a standalone library as "dg/exblas/exblas.h")
|
This is the namespace for all functions and classes defined and used in the exblas library. More...
Namespaces | |
namespace | cpu |
cpu versions of the primitive functions | |
namespace | gpu |
gpu (CUDA) versions of primitive functions | |
Classes | |
union | udouble |
Utility union to display all bits of a double (using type-punning) More... | |
union | ufloat |
Utility union to display all bits of a float (using type-punning) More... | |
Functions | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3> | |
__host__ void | exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *d_superacc, int *status) |
GPU version of exact dot product. | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3> | |
__host__ void | exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *d_superacc, int *status) |
GPU version of exact dot product. | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8> | |
void | exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status) |
OpenMP parallel version of exact triple dot product. | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8> | |
void | exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status) |
OpenMP parallel version of exact triple dot product. | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8> | |
void | exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status) |
Serial version of exact dot product. | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8> | |
void | exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status) |
Serial version of exact dot product. | |
template<class T , size_t N, class Functor , class ... PointerOrValues> | |
__host__ void | fpedot_gpu (int *status, unsigned size, T *fpe, Functor f, PointerOrValues ...xs_ptr) |
GPU version of fpe generalized dot product. | |
template<class T , size_t N, class Functor , class ... PointerOrValues> | |
void | fpedot_omp (int *status, unsigned size, std::array< T, N > &fpe, Functor f, PointerOrValues ...xs_ptr) |
OpenMP version of fpe generalized dot product. | |
template<class T , size_t N, class Functor , class ... PointerOrValues> | |
void | fpedot_cpu (int *status, unsigned size, std::array< T, N > &fpe, Functor f, PointerOrValues ...xs_ptr) |
serial version of extended precision general dot product | |
void | mpi_reduce_communicator (MPI_Comm comm, MPI_Comm *comm_mod, MPI_Comm *comm_mod_reduce) |
This function can be used to partition communicators for the exblas::reduce_mpi_cpu function. | |
void | reduce_mpi_cpu (unsigned num_superacc, int64_t *in, int64_t *out, MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce) |
reduce a number of superaccumulators distributed among mpi processes | |
This is the namespace for all functions and classes defined and used in the exblas library.
In principle you can use this as a standalone library but it is much easier to just use the dg::blas1::dot
and dg::blas2::dot
functions for general purpose usage
void dg::exblas::exdot_cpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | h_superacc, | ||
int * | status ) |
Serial version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_cpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | h_superacc, | ||
int * | status ) |
Serial version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
__host__ void dg::exblas::exdot_gpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | d_superacc, | ||
int * | status ) |
GPU version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
d_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
__host__ void dg::exblas::exdot_gpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | d_superacc, | ||
int * | status ) |
GPU version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
d_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_omp | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | h_superacc, | ||
int * | status ) |
OpenMP parallel version of exact triple dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_omp | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | h_superacc, | ||
int * | status ) |
OpenMP parallel version of exact triple dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::fpedot_cpu | ( | int * | status, |
unsigned | size, | ||
std::array< T, N > & | fpe, | ||
Functor | f, | ||
PointerOrValues ... | xs_ptr ) |
serial version of extended precision general dot product
Computes the extended precision reduction
\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]
using floating point expansions
T | the return type of Functor . |
N | size of the floating point expansion (should be between 3 and 8) |
Functor | a Functor |
PointerOrValues | must be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
status | 0 indicates success, 2 means the FPE overflowed |
size | size of the arrays to sum |
fpe | the FPE holding the result (write-only) |
f | the functor |
xs_ptr | the input arrays to sum |
exblas::cpu::Round
to convert the FPE into a double precision number __host__ void dg::exblas::fpedot_gpu | ( | int * | status, |
unsigned | size, | ||
T * | fpe, | ||
Functor | f, | ||
PointerOrValues ... | xs_ptr ) |
GPU version of fpe generalized dot product.
Computes the extended precision reduction
\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]
using floating point expansions
T | the return type of Functor . |
N | size of the floating point expansion (should be between 3 and 8) |
Functor | a Functor |
PointerOrValues | must be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
status | 0 indicates success, 2 means the FPE overflowed |
size | size of the arrays to sum |
fpe | the FPE holding the result (write-only) |
f | the functor |
xs_ptr | the input arrays to sum |
exblas::cpu::Round
to convert the FPE into a double precision number void dg::exblas::fpedot_omp | ( | int * | status, |
unsigned | size, | ||
std::array< T, N > & | fpe, | ||
Functor | f, | ||
PointerOrValues ... | xs_ptr ) |
OpenMP version of fpe generalized dot product.
Computes the extended precision reduction
\[ \sum_{i=0}^{N-1} f(x_{0i}, x_{1i}, ... )\]
using floating point expansions
T | the return type of Functor . |
N | size of the floating point expansion (should be between 3 and 8) |
Functor | a Functor |
PointerOrValues | must be one of T, T&&, T&, const T&, T* or const T* , where T is a scalar type. If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
status | 0 indicates success, 2 means the FPE overflowed |
size | size of the arrays to sum |
fpe | the FPE holding the result (write-only) |
f | the functor |
xs_ptr | the input arrays to sum |
exblas::cpu::Round
to convert the FPE into a double precision number
|
inline |
This function can be used to partition communicators for the exblas::reduce_mpi_cpu
function.
If the ranks in comm
are aligned in rows of 128 (or any number <= 256) comm_mod
is the 1d communicator for the rows and comm_mod_reduce
is the communicator for the columns. The last row may contain fewer than 128 ranks and comm_mod
may contain 1 rank fewer than the others in the last columns
comm | the input communicator (unmodified, may not be MPI_COMM_NULL ) |
comm_mod | the line communicator in comm , consists of all rank/mod ranks |
comm_mod_reduce | the column communicator in comm consists of all rankmod ranks |
comm
only the first call will actually create new communicators.
|
inline |
reduce a number of superaccumulators distributed among mpi processes
We cannot sum more than 256 accumulators before we need to normalize again, so we need to split the reduction into several steps if more than 256 processes are involved. This function normalizes, reduces, normalizes, reduces and broadcasts the result to all participating processes. As usual the resulting superaccumulator is unnormalized.
num_superacc | number of Superaccumulators eaach process holds |
in | unnormalized input superaccumulators ( must be of size num_superacc*exblas::BIN_COUNT , allocated on the cpu) (read/write, undefined on out) |
out | each process contains the result on output( must be of size num_superacc*exblas::BIN_COUNT , allocated on the cpu) (write, may not alias in) |
comm | The complete MPI communicator |
comm_mod | This is the line communicator of up to 128 ranks ( or any other number <256) |
comm_mod_reduce | This is the column communicator consisting of all rank 0, 1, 2, ...,127 processes in all comm_mod |
exblas::mpi_reduce_communicator
to generate the required communicators