This is the namespace for all functions and classes defined and used in the exblas library. More...

Namespaces
namespace	cpu
	cpu versions of the primitive functions

namespace	gpu
	gpu (CUDA) versions of primitive functions

Classes
union	udouble
	Utility union to display all bits of a double (using type-punning) More...

Functions
template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>
__host__ void	exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t d_superacc, int status)
	GPU version of exact dot product. More...

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>
__host__ void	exdot_gpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t d_superacc, int status)
	GPU version of exact dot product. More...

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void	exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t h_superacc, int status)
	OpenMP parallel version of exact triple dot product. More...

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void	exdot_omp (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t h_superacc, int status)
	OpenMP parallel version of exact triple dot product. More...

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>
void	exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, int64_t h_superacc, int status)
	Serial version of exact dot product. More...

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>
void	exdot_cpu (unsigned size, PointerOrValue1 x1_ptr, PointerOrValue2 x2_ptr, PointerOrValue3 x3_ptr, int64_t h_superacc, int status)
	Serial version of exact dot product. More...

static void	mpi_reduce_communicator (MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce)
	This function can be used to partition communicators for the `exblas::reduce_mpi_cpu` function. More...

static void	reduce_mpi_cpu (unsigned num_superacc, int64_t in, int64_t out, MPI_Comm comm, MPI_Comm comm_mod, MPI_Comm comm_mod_reduce)
	reduce a number of superaccumulators distributed among mpi processes More...

Detailed Description

This is the namespace for all functions and classes defined and used in the exblas library.

In principle you can use this as a standalone library but it is much easier to just use the dg::blas1::dot and dg::blas2::dot functions for general purpose usage

Function Documentation

◆ exdot_cpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>

void dg::exblas::exdot_cpu	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		int64_t *	h_superacc,
		int *	status
	)

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
h_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_cpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>

void dg::exblas::exdot_cpu	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		PointerOrValue3	x3_ptr,
		int64_t *	h_superacc,
		int *	status
	)

Serial version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
x3_ptr	third array
h_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 3>

__host__ void dg::exblas::exdot_gpu	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		int64_t *	d_superacc,
		int *	status
	)

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
d_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_gpu() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 3>

__host__ void dg::exblas::exdot_gpu	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		PointerOrValue3	x3_ptr,
		int64_t *	d_superacc,
		int *	status
	)

GPU version of exact dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
x3_ptr	third array
d_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in device memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [1/2]

template<class PointerOrValue1 , class PointerOrValue2 , size_t NBFPE = 8>

void dg::exblas::exdot_omp	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		int64_t *	h_superacc,
		int *	status
	)

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
h_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ exdot_omp() [2/2]

template<class PointerOrValue1 , class PointerOrValue2 , class PointerOrValue3 , size_t NBFPE = 8>

void dg::exblas::exdot_omp	(	unsigned	size,
		PointerOrValue1	x1_ptr,
		PointerOrValue2	x2_ptr,
		PointerOrValue3	x3_ptr,
		int64_t *	h_superacc,
		int *	status
	)

OpenMP parallel version of exact triple dot product.

Accumulate the exact sum

\[ \sum_{i=0}^{N-1} x_i w_i y_i \]

into a superaccumulator. The superaccumulator is an array of exblas::BIN_COUNT (39) 64 bit integers that represents a large fixed point number such that the summation is computed with virtually infinite precision and is thus bitwise reproducible even in a parallel environment.

Note: The superaccumulator can be converted to a double precision number using the Round function: dg::exblas::cpu::Round for cpu and omp version or dg::exblas::gpu::Round for gpu version

Attention: the product \( x_iw_iy_i\) of numbers is not computed with infinite precision only the sum is (this does not break the reproducibility)

Note: The algorithm is described in the paper Sylvain Collange, David Defour, Stef Graillat, Roman Iakymchuk. "Numerical Reproducibility for the Parallel Reduction on Multi- and Many-Core Architectures", 2015.

Template Parameters

NBFPE	size of the floating point expansion (should be between 3 and 8)
PointerOrValue	must be one of `T, T&&, T&, const T&, T* or const T*` , where `T` is either `float` or `double`. If it is a pointer type, then we iterate through the pointed data from 0 to `size`, else we consider the value constant in every iteration.

Parameters

size	size N of the arrays to sum
x1_ptr	first array
x2_ptr	second array
x3_ptr	third array
h_superacc	pointer to an array of 64 bit integegers (the superaccumulator) in host memory with size at least `exblas::BIN_COUNT` (39) (contents are overwritten, the function does not allocate memory i.e. the memory needs to be allocated before calling the function)
status	0 indicates success, 1 indicates an input value was NaN or Inf

◆ mpi_reduce_communicator()

static void dg::exblas::mpi_reduce_communicator	(	MPI_Comm	comm,
		MPI_Comm *	comm_mod,
		MPI_Comm *	comm_mod_reduce
	)

static

This function can be used to partition communicators for the exblas::reduce_mpi_cpu function.

Parameters

comm	the input communicator (unmodified, may not be `MPI_COMM_NULL`)
comm_mod	a subgroup of comm (comm is split)
comm_mod_reduce	a subgroup of comm, consists of all rank 0 processes in comm_mod

Note: the creation of new communicators involves communication between all participation processes (comm in this case).

Attention: In order to avoid excessive creation of new MPI communicators (there is a limit to how many a program can create), the function keeps record of which communicators it has been called with. If you repeatedly call this function with the same comm only the first call will actually create new communicators.

◆ reduce_mpi_cpu()

static void dg::exblas::reduce_mpi_cpu	(	unsigned	num_superacc,
		int64_t *	in,
		int64_t *	out,
		MPI_Comm	comm,
		MPI_Comm	comm_mod,
		MPI_Comm	comm_mod_reduce
	)

static

reduce a number of superaccumulators distributed among mpi processes

We cannot sum more than 256 accumulators before we need to normalize again, so we need to split the reduction into several steps if more than 256 processes are involved. This function normalizes, reduces, normalizes, reduces and broadcasts the result to all participating processes. As usual the resulting superaccumulator is unnormalized.

Parameters

num_superacc	number of Superaccumulators eaach process holds
in	unnormalized input superaccumulators ( must be of size num_superacc*`exblas::BIN_COUNT`, allocated on the cpu) (read/write, undefined on out)
out	each process contains the result on output( must be of size num_superacc*`exblas::BIN_COUNT`, allocated on the cpu) (write, may not alias in)
comm	The complete MPI communicator
comm_mod	This is comm modulo 128 ( or any other number <256)
comm_mod_reduce	This is the communicator consisting of all rank 0 processes in comm_mod, may be `MPI_COMM_NULL`

See also: exblas::mpi_reduce_communicator to generate the required communicators

Namespaces

Classes

Functions

Detailed Description

Function Documentation

◆ exdot_cpu() [1/2]

◆ exdot_cpu() [2/2]

◆ exdot_gpu() [1/2]

◆ exdot_gpu() [2/2]

◆ exdot_omp() [1/2]

◆ exdot_omp() [2/2]

◆ mpi_reduce_communicator()

◆ reduce_mpi_cpu()