Extension: ExBLAS
#include "dg/algorithm.h" (or as a standalone library as "dg/exblas/exblas.h")
|
This is the namespace for all functions and classes defined and used in the exblas library. More...
Namespaces | |
namespace | cpu |
cpu versions of the primitive functions | |
namespace | gpu |
gpu (CUDA) versions of primitive functions | |
Classes | |
union | udouble |
Utility union to display all bits of a double (using type-punning) More... | |
Functions | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3> | |
__host__ void | exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *d_superacc, int *status) |
GPU version of exact dot product. More... | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3> | |
__host__ void | exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *d_superacc, int *status) |
GPU version of exact dot product. More... | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8> | |
void | exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status) |
OpenMP parallel version of exact triple dot product. More... | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8> | |
void | exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status) |
OpenMP parallel version of exact triple dot product. More... | |
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8> | |
void | exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t *h_superacc, int *status) |
Serial version of exact dot product. More... | |
template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8> | |
void | exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t *h_superacc, int *status) |
Serial version of exact dot product. More... | |
static void | mpi_reduce_communicator (MPI_Comm comm, MPI_Comm *comm_mod, MPI_Comm *comm_mod_reduce) |
This function can be used to partition communicators for the exblas::reduce_mpi_cpu function. More... | |
static void | reduce_mpi_cpu (unsigned num_superacc, int64_t *in, int64_t *out, MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce) |
reduce a number of superaccumulators distributed among mpi processes More... | |
This is the namespace for all functions and classes defined and used in the exblas library.
In principle you can use this as a standalone library but it is much easier to just use the dg::blas1::dot
and dg::blas2::dot
functions for general purpose usage
void dg::exblas::exdot_cpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | h_superacc, | ||
int * | status | ||
) |
Serial version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_cpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | h_superacc, | ||
int * | status | ||
) |
Serial version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
__host__ void dg::exblas::exdot_gpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | d_superacc, | ||
int * | status | ||
) |
GPU version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
d_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
__host__ void dg::exblas::exdot_gpu | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | d_superacc, | ||
int * | status | ||
) |
GPU version of exact dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
d_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_omp | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
int64_t * | h_superacc, | ||
int * | status | ||
) |
OpenMP parallel version of exact triple dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
void dg::exblas::exdot_omp | ( | unsigned | size, |
PointerOrValue1 | x1_ptr, | ||
PointerOrValue2 | x2_ptr, | ||
PointerOrValue3 | x3_ptr, | ||
int64_t * | h_superacc, | ||
int * | status | ||
) |
OpenMP parallel version of exact triple dot product.
Accumulate the exact sum
\[ \sum_{i=0}^{N-1} x_i w_i y_i \]
into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT
(39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.
NBFPE | size of the floating point expansion (should be between 3 and 8) |
PointerOrValue | must be one of T, T&&, T&, const T&, T* or const T* , where T is either float or double . If it is a pointer type, then we iterate through the pointed data from 0 to size , else we consider the value constant in every iteration. |
size | size N of the arrays to sum |
x1_ptr | first array |
x2_ptr | second array |
x3_ptr | third array |
h_superacc | pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least exblas::BIN_COUNT (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function) |
status | 0 indicates success, 1 indicates an input value was NaN or Inf |
|
static |
This function can be used to partition communicators for the exblas::reduce_mpi_cpu
function.
comm | the input communicator (unmodified, may not be MPI_COMM_NULL ) |
comm_mod | a subgroup of comm (comm is split) |
comm_mod_reduce | a subgroup of comm, consists of all rank 0 processes in comm_mod |
comm
only the first call will actually create new communicators.
|
static |
reduce a number of superaccumulators distributed among mpi processes
We cannot sum more than 256 accumulators before we need to normalize again, so we need to split the reduction into several steps if more than 256 processes are involved. This function normalizes, reduces, normalizes, reduces and broadcasts the result to all participating processes. As usual the resulting superaccumulator is unnormalized.
num_superacc | number of Superaccumulators eaach process holds |
in | unnormalized input superaccumulators ( must be of size num_superacc*exblas::BIN_COUNT , allocated on the cpu) (read/write, undefined on out) |
out | each process contains the result on output( must be of size num_superacc*exblas::BIN_COUNT , allocated on the cpu) (write, may not alias in) |
comm | The complete MPI communicator |
comm_mod | This is comm modulo 128 ( or any other number <256) |
comm_mod_reduce | This is the communicator consisting of all rank 0 processes in comm_mod, may be MPI_COMM_NULL |
exblas::mpi_reduce_communicator
to generate the required communicators