Determine the latencies and bandwidths quantitatively¶

import numpy as np
from collections import OrderedDict as odict
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm

This notebook computes the latencies and bandwidths of the three primitive function types. Again, start with a list of files and architectures that we want to investigate.

#(hardware name, number of nodes)
files = odict({})
files['i5'] = ('i5',1)
files['gtx1060'] = ('gtx1060',1)
files['skl_mpi1'] = ('skl',1)
files['skl_mpi2'] = ('skl',2)
files['skl_mpi4'] = ('skl',4)
files['knl_mpi1'] = ('knl',1)
files['knl_mpi2'] = ('knl',2)
files['knl_mpi4'] = ('knl',4)
files['p100nv_mpi1'] = ('p100',1)
files['p100nv_mpi2'] = ('p100',2)
files['p100nv_mpi4'] = ('p100',4)
files['titanxp'] = ('titanXp',1)
files['v100nv_mpi1'] = ('v100',1)
files['v100nv_mpi2'] = ('v100',2)
files['v100nv_mpi4'] = ('v100',4)

pd.set_option('precision',2)

Assumptions¶

there are three basic functions: trivially parallel(axpby), nearest neighbor (dxdy), global reduction (dot)
each can be represented by the single node bandwidth, the single node latency and the multinode latency

But¶

does not capture cache effect e.g. in SKl

First, we read in files and extract runtimes and compute bandwidth for every measurement. We also compute the average runtime and bandwidth for every combination of n, Nx and Ny.

Axpby and Dot Bandwidths¶

The bandwidths are determined by taking the average bandwidth of the 30 bandwidths corresponding to the 3 largest sizes. The error is the standard deviation on these bandwidths.

Axpby and Dot Latencies¶

We try three different methods to compute the latency. First, we try the mean average runtime for the 3 smallest sizes. Second, we take the minimum of the average runtime among all combinations of n, Nx, and Ny. The error is the standard deviation corresponding to this average runtime. Finally, we use a correction factor from the more accurate bandwidths.

Dx-Dy Bandwidths¶

Since the efficiency of the matrix-vector multiplications depends on the polynomial coefficient we should compute these bandwidths separately for each coefficient and use sizes between 10MB and 1000MB

Dx-Dy Latencies¶

As in Axpby

names={'axpby':3,'dot':2,'dx':3, 'dy':3}
#ns=[3,4]
values = []
for f, v in files.items() :
    runtimes=pd.read_csv('benchmark_'+f+'.csv', delimiter=' ')
    #add size and bandwidth columns
    runtimes.insert(0,'size', 8*runtimes['n']*runtimes['n']
                    *runtimes['Nx']*runtimes['Ny']/1e6/v[1]) #inplace transformation
    for name,memops in names.items() :
        runtimes.insert(0,name+'_bw',runtimes['size']/1000*memops/runtimes[name])
    runtimes = runtimes.assign( dxdy=(runtimes['dx']+runtimes['dy'])/2)
    runtimes = runtimes.assign( dxdy_bw=2.0*runtimes['dx_bw']*runtimes['dy_bw']
                               /(runtimes['dx_bw']+runtimes['dy_bw']))
    #compute one version with aggregated grouped sizes and one without
    avgruntimes=runtimes.groupby(['n', 'Nx','Ny','size']).agg(['mean', 'std'])
    avgruntimes=avgruntimes.reset_index(level=['n','Nx','Ny','size'])
    avgruntimes.sort_values(by='size',inplace=True) #sort by size
    runtimes.sort_values(by='size',inplace=True)
    ##first compute axpby and dot latencies and bandwidths 
    nmax = 3
    s =30

    line = []
    l=len(runtimes)
    line.append(v[0]) #0
    line.append(v[1]) #1
    for q in ['axpby','dot']:
        bandwidth       = runtimes[l-s:l][q+'_bw'].mean()
        err_bandwidth   = runtimes[l-s:l][q+'_bw'].std()
        mean_latency    = avgruntimes[0:nmax][(q,'mean')].mean()/1e-6
        min_latency     = avgruntimes[(q,'mean')].min()/1e-6
        idx_min_latency = avgruntimes[(q,'mean')].idxmin()
        corr_min_latency= min_latency - avgruntimes['size'].loc[idx_min_latency]*names[q]/bandwidth/1e-3
        err_latency     = avgruntimes[(q,'std')].loc[idx_min_latency]/1e-6
        if corr_min_latency <0 : corr_min_latency = 0 
        line.append(bandwidth) #2 bandwidth
        line.append(err_bandwidth)  #3 err_bandwidth
        line.append(mean_latency) #4 latency mean
        line.append(min_latency)  #5 latency min
        line.append(corr_min_latency ) #6 corrected lat 
        line.append(err_latency ) #6 error lat 
    ##now compute latency and bandwidths of dx and y
    for n in [2,3,4,5]:
        #take n
        dxdy=runtimes[runtimes['n']==n]
        
        avgdxdy = avgruntimes[avgruntimes['n']==n]
        dxdy=dxdy.sort_values(by='size')
        avgdxdy=avgdxdy.sort_values(by='size') #sort by size
        bandwidth       = dxdy[(dxdy['size']>10)&(dxdy['size']<1000)]['dxdy_bw'].mean()
        err_bandwidth   = dxdy[(dxdy['size']>10)&(dxdy['size']<1000)]['dxdy_bw'].std()
        mean_latency    = avgdxdy[0:nmax][('dxdy','mean')].mean()/1e-6
        min_latency     = avgdxdy[('dxdy','mean')].min()/1e-6
        idx_min_latency = avgdxdy[('dxdy','mean')].idxmin()
        corr_min_latency= min_latency - avgdxdy['size'].loc[idx_min_latency]*names['dx']/bandwidth/1e-3
        err_latency     = avgdxdy[('dxdy','std')].loc[idx_min_latency]/1e-6
        if corr_min_latency <0 : 
            corr_min_latency = 0 
        line.append(bandwidth) #2 bandwidth
        line.append(err_bandwidth)  #3 err_bandwidth
        line.append(mean_latency) #4 latency mean
        line.append(min_latency)  #5 latency min
        line.append(corr_min_latency ) #6 corrected lat  
        line.append(err_latency ) #6 error lat     
    values.append(line)

In the following cells we just display the results in a more accessible way or extract values to show in publications.

#now construct new more ordered table with values from previous cell      
tuples=[('arch','',''),('nodes','','')]        
for q in ['axpby','dot','dxdy2','dxdy3','dxdy4','dxdy5']:
    tuples.append((q,'bw','avg'))
    tuples.append((q,'bw','std'))
    tuples.append((q,'lat','avg'))
    tuples.append((q,'lat','min'))
    tuples.append((q,'lat','corr'))
    tuples.append((q,'lat','std'))
    
cols=pd.MultiIndex.from_tuples(tuples)
arr = pd.DataFrame(values,index=files.keys(), columns=cols)
arr.sort_values(by='arch',inplace=True)
arr.set_index(['arch','nodes'],inplace=True)
#arr.loc[:,[('dot','bw','avg'),('dot','lat','avg')]]
arr

		axpby						dot				...	dxdy4				dxdy5
		bw		lat				bw		lat		...	lat				bw		lat
		avg	std	avg	min	corr	std	avg	std	avg	min	...	avg	min	corr	std	avg	std	avg	min	corr	std
arch	nodes
gtx1060	1	157.05	0.06	23.19	3.51	0.00	0.24	26.50	0.10	199.90	131.63	...	322.56	72.50	0.00	0.27	69.26	17.04	571.43	125.05	0.00	0.45
i5	1	29.99	0.19	30.58	12.38	0.00	1.18	9.31	0.04	316.67	117.43	...	1212.12	208.32	0.00	12.42	21.45	1.91	1956.57	340.41	0.00	1.31
knl	1	393.15	22.19	11.81	9.98	5.47	0.30	141.36	6.63	77.05	63.20	...	215.74	52.06	0.00	1.59	101.34	14.94	451.55	98.29	0.00	37.85
	2	420.41	40.35	10.71	10.01	7.89	0.21	128.36	2.71	99.39	91.51	...	207.80	92.39	59.93	1.70	88.47	7.45	363.12	133.93	71.21	8.96
	4	420.70	43.72	10.53	10.21	9.16	0.09	109.74	3.94	126.08	122.29	...	155.14	82.89	67.54	5.47	89.24	13.82	235.93	108.08	76.99	6.04
p100	1	550.51	1.23	14.36	10.38	7.52	0.02	375.61	1.94	158.73	68.34	...	190.41	138.27	17.53	2.44	171.81	17.67	275.12	71.80	14.58	5.15
	2	553.30	0.97	7.03	5.75	0.00	0.65	369.20	2.37	76.43	59.87	...	125.22	65.47	49.16	3.43	166.21	14.15	189.63	80.20	50.63	5.60
	4	554.54	0.50	8.16	7.31	0.00	0.27	356.21	5.79	57.54	52.41	...	101.07	69.34	60.86	2.02	160.39	16.98	139.32	89.97	74.65	1.59
skl	1	206.71	5.87	4.56	4.07	0.00	0.11	192.05	18.31	32.29	24.19	...	340.88	82.83	15.38	1.05	110.87	8.03	442.70	132.54	20.32	2.07
	2	216.29	7.02	4.17	3.95	0.00	0.18	182.31	11.77	31.34	25.87	...	149.05	63.79	30.71	0.86	114.29	7.18	249.57	89.13	34.70	0.79
	4	232.62	15.11	4.11	4.02	0.00	0.26	166.86	23.97	43.80	42.40	...	90.99	47.41	29.99	1.16	110.39	9.24	142.54	61.74	33.57	2.50
titanXp	1	431.24	3.45	6.56	2.58	0.00	0.21	61.37	0.12	87.86	61.46	...	115.69	28.69	3.20	0.12	197.91	31.44	211.14	48.84	0.00	1.62
v100	1	846.42	0.95	5.01	4.68	0.50	0.34	610.15	5.99	95.96	92.35	...	45.70	11.74	2.70	0.02	589.97	65.17	78.97	22.98	6.32	0.08
	2	846.01	1.07	5.66	5.58	0.00	0.03	578.96	12.17	102.71	99.70	...	47.17	36.54	31.64	0.17	567.50	46.64	61.97	38.84	30.18	0.27
	4	841.47	3.03	5.73	5.60	0.00	0.31	526.26	16.99	103.29	98.08	...	42.52	38.74	36.29	2.10	556.61	56.29	47.54	39.42	35.01	1.77

15 rows × 36 columns

#arr=arr.reset_index()

#define conversion function for writing nice output tables
def toString(x): 
    if pd.isnull(x) : return 'n/a'
    #string = '%.1f'% x
    string = '%d' %np.ceil(x)
    #if np.ceil(x)<100 : string = '0'+string
    if np.ceil(x)<10 : string = '0'+string
    return string

addto = []
for n in ['axpby','dot','dxdy2','dxdy3','dxdy4','dxdy5']:
    arr.loc[:,(n,'bw','string')]= arr[n]['bw']['avg'].apply(toString) +" ± "+arr[n]['bw']['std'].apply(toString)
    arr.loc[:,(n,'lat','string')]= arr[n]['lat']['corr'].apply(toString) +" ± "+arr[n]['lat']['std'].apply(toString)
    addto.append((n,'lat','string'))
    addto.append((n,'bw','string'))

#make a table for display
nicetable=arr[addto]
drop = nicetable.columns.droplevel(2)
nicetable.columns=drop
#nicetable.reset_index(inplace=True)
#nicetable.set_index('arch')
newindex=[('i5',1)]
newindex.append(('gtx1060',1))
newindex.append(('titanXp',1))
for n in ['skl','knl']:
    for m in [1,2,4]:
        newindex.append((n,m))
for n in ['p100','v100']:
    for m in [1,2,4]:
        newindex.append((n,m))
    
nicetable=nicetable.reindex(newindex)

nicetable

		axpby		dot		dxdy2		dxdy3		dxdy4		dxdy5
		lat	bw	lat	bw	lat	bw	lat	bw	lat	bw	lat	bw
arch	nodes
i5	1	00 ± 02	30 ± 01	05 ± 01	10 ± 01	00 ± 02	28 ± 03	00 ± 04	30 ± 03	00 ± 13	26 ± 02	00 ± 02	22 ± 02
gtx1060	1	00 ± 01	158 ± 01	93 ± 09	27 ± 01	00 ± 01	131 ± 01	03 ± 01	112 ± 02	00 ± 01	84 ± 14	00 ± 01	70 ± 18
titanXp	1	00 ± 01	432 ± 04	45 ± 06	62 ± 01	03 ± 01	373 ± 05	02 ± 01	309 ± 10	04 ± 01	247 ± 08	00 ± 02	198 ± 32
skl	1	00 ± 01	207 ± 06	18 ± 03	193 ± 19	23 ± 03	182 ± 36	17 ± 01	162 ± 13	16 ± 02	119 ± 19	21 ± 03	111 ± 09
	2	00 ± 01	217 ± 08	23 ± 01	183 ± 12	30 ± 03	175 ± 45	31 ± 03	158 ± 17	31 ± 01	121 ± 21	35 ± 01	115 ± 08
	4	00 ± 01	233 ± 16	38 ± 05	167 ± 24	29 ± 03	168 ± 44	30 ± 02	160 ± 08	30 ± 02	115 ± 25	34 ± 03	111 ± 10
knl	1	06 ± 01	394 ± 23	55 ± 02	142 ± 07	10 ± 01	240 ± 18	08 ± 02	173 ± 27	00 ± 02	127 ± 19	00 ± 38	102 ± 15
	2	08 ± 01	421 ± 41	87 ± 02	129 ± 03	49 ± 01	176 ± 28	56 ± 03	142 ± 23	60 ± 02	110 ± 15	72 ± 09	89 ± 08
	4	10 ± 01	421 ± 44	120 ± 06	110 ± 04	53 ± 04	156 ± 25	59 ± 04	128 ± 24	68 ± 06	116 ± 22	77 ± 07	90 ± 14
p100	1	08 ± 01	551 ± 02	51 ± 08	376 ± 02	27 ± 01	294 ± 08	09 ± 02	239 ± 13	18 ± 03	209 ± 08	15 ± 06	172 ± 18
	2	00 ± 01	554 ± 01	51 ± 09	370 ± 03	51 ± 06	257 ± 19	51 ± 05	223 ± 17	50 ± 04	193 ± 17	51 ± 06	167 ± 15
	4	00 ± 01	555 ± 01	52 ± 01	357 ± 06	55 ± 01	240 ± 12	57 ± 02	200 ± 26	61 ± 03	186 ± 16	75 ± 02	161 ± 17
v100	1	01 ± 01	847 ± 01	89 ± 05	611 ± 06	05 ± 01	795 ± 21	03 ± 01	736 ± 34	03 ± 01	697 ± 16	07 ± 01	590 ± 66
	2	00 ± 01	847 ± 02	99 ± 05	579 ± 13	32 ± 01	703 ± 59	34 ± 01	689 ± 48	32 ± 01	642 ± 55	31 ± 01	568 ± 47
	4	00 ± 01	842 ± 04	98 ± 01	527 ± 17	38 ± 01	678 ± 37	38 ± 04	627 ± 84	37 ± 03	641 ± 47	36 ± 02	557 ± 57

index = ['i5','gtx1060','skl','knl','p100','titanXp','v100']  
lines = []
for arch in  index: 
    line = []
    line.append(arch)
    #first the bandwidths
    for n in ['axpby','dot','dxdy2','dxdy3','dxdy4','dxdy5']:
        line.append( arr.loc[(arch,1),(n,'bw','avg')] )
        line.append( arr.loc[(arch,1),(n,'bw','std')])
    for n in ['axpby','dot','dxdy2'] :
        line.append( arr.loc[(arch,1),(n,'lat','corr')] )
        line.append( arr.loc[(arch,1),(n,'lat','std')])
        if arch == 'i5' or arch == 'gtx1060' or arch == 'titanXp':
            line.append(None)
            line.append(None)
        else:
            line.append( arr.loc[(arch,4),(n,'lat','corr')] )
            line.append( arr.loc[(arch,4),(n,'lat','std')] )
    lines.append(line)
    
tuples=['arch']     

for n in ['axpby','dot','dxdy2','dxdy3','dxdy4','dxdy5']:
    tuples.append(n+'_bw')
    tuples.append(n+'_bw_err')
for n in ['axpby','dot','dxdy']:
    tuples.append(n+'_lat_shared')
    tuples.append(n+'_lat_shared_err')
    tuples.append(n+'_lat_dist')
    tuples.append(n+'_lat_dist_err')
cols=tuples
toDisk = pd.DataFrame(lines, columns=cols)
toDisk.to_csv('performance.csv',sep=' ',index=False)

pd.set_option('display.float_format', lambda x: '%.2f' % x)
test = pd.read_csv('performance.csv',delimiter=' ')
test

	arch	axpby_bw	axpby_bw_err	dot_bw	dot_bw_err	dxdy2_bw	dxdy2_bw_err	dxdy3_bw	dxdy3_bw_err	dxdy4_bw	...	axpby_lat_dist	axpby_lat_dist_err	dot_lat_shared	dot_lat_shared_err	dot_lat_dist	dot_lat_dist_err	dxdy_lat_shared	dxdy_lat_shared_err	dxdy_lat_dist	dxdy_lat_dist_err
0	i5	29.99	0.19	9.31	0.04	27.79	2.97	29.12	2.84	25.58	...	NaN	NaN	4.76	0.23	NaN	NaN	0.00	1.44	NaN	NaN
1	gtx1060	157.05	0.06	26.50	0.10	130.63	0.40	111.23	1.11	83.82	...	NaN	NaN	92.06	8.70	NaN	NaN	0.00	0.82	NaN	NaN
2	skl	206.71	5.87	192.05	18.31	181.56	35.38	161.75	13.00	118.06	...	0.00	0.26	17.28	2.32	37.93	4.14	22.70	2.11	28.52	2.10
3	knl	393.15	22.19	141.36	6.63	239.04	17.02	172.69	26.80	126.04	...	9.16	0.09	54.83	1.79	119.59	5.14	9.93	0.70	52.67	3.72
4	p100	550.51	1.23	375.61	1.94	293.25	7.11	238.99	12.63	208.44	...	0.00	0.27	50.89	7.06	51.67	0.59	26.23	0.05	54.40	0.35
5	titanXp	431.24	3.45	61.37	0.12	372.85	4.16	308.92	9.47	246.73	...	NaN	NaN	44.37	5.15	NaN	NaN	2.38	0.57	NaN	NaN
6	v100	846.42	0.95	610.15	5.99	794.43	20.52	735.42	33.02	696.49	...	0.00	0.31	88.49	4.68	97.58	0.79	4.20	0.02	37.19	0.42

7 rows × 25 columns

Observations¶

note the high latency in the knl MPI implementation of dxdy. It seems to suffer from the same problem as the GPUs. (Is this the speed of PCIe we see?)

#index = ['i5','gtx1060','titanXp', 'skl','knl','p100','v100']  

lines = []
for arch in  index: 
    line = []        
    for n in ['axpby','dot','dxdy2','dxdy3','dxdy4','dxdy5']:
        bw  = arr.loc[(arch,1),(n,'bw','avg')]
        bw_err = arr.loc[(arch,1),(n,'bw','std')]
        lat1 = arr.loc[(arch,1),(n,'lat','corr')]
        lat1_err = arr.loc[(arch,1),(n,'lat','std')]
        line.append( toString( bw)+" $\pm$ "+toString(bw_err))
        line.append( toString( lat1)+" $\pm$ "+toString(lat1_err) )
        if arch == 'i5' or arch == 'gtx1060' or arch == 'titanXp':
            line.append(toString(None))
        else:
            if (n == 'dot') or (n == 'axpby'):
                lat4 = arr.loc[(arch,4),(n,'lat','corr')]
                lat4_err = arr.loc[(arch,4),(n,'lat','std')]
            else:
                lat4 = arr.loc[(arch,4),('dxdy2','lat','corr')]
                lat4_err = arr.loc[(arch,4),('dxdy2','lat','std')]  
            line.append( toString( lat4)+" $\pm$ "+toString(lat4_err))

                
    lines.append(line)
    
tuples=[]  


for p in ['axpby','dot','B(P=2)','B(P=3)','B(P=4)','B(P=5)']:
    #for q in ['efficiency [\% bw]','lat s [us]','lat d [us]']:
    for q in ['bandwidth [GB/s]','$T_{lat}(1)$ [$\mu$s]','$T_{lat}(4)$ [$\mu$s]']:
        tuples.append((p,q))
tuples[0] = ('axpby','bandwidth [GB/s]')
    

cols=pd.MultiIndex.from_tuples(tuples)

toDisk = pd.DataFrame(lines, index=index, columns=cols)
theo = [34,192,256,'>400',732,547,898]
toDisk.insert(0,('peak','bandwidth [GB/s]'),theo)
pd.set_option('display.float_format', lambda x: '%.0f' % x)
filename='axpby-dot.tex'
with open(filename, 'wb') as f:
    f.write(bytes(toDisk.iloc[:,0:7].to_latex(
        escape=False,
        column_format='@{}lp{1.5cm}p{1.5cm}p{1.2cm}p{1.2cm}p{1.5cm}p{1.2cm}p{1.4cm}@{}',
        bold_rows=True),'UTF-8'))
toDisk.iloc[:,0:7]

	peak	axpby			dot
	bandwidth [GB/s]	bandwidth [GB/s]	$T_{lat}(1)$ [$\mu$s]	$T_{lat}(4)$ [$\mu$s]	bandwidth [GB/s]	$T_{lat}(1)$ [$\mu$s]	$T_{lat}(4)$ [$\mu$s]
i5	34	30 $\pm$ 01	00 $\pm$ 02	n/a	10 $\pm$ 01	05 $\pm$ 01	n/a
gtx1060	192	158 $\pm$ 01	00 $\pm$ 01	n/a	27 $\pm$ 01	93 $\pm$ 09	n/a
skl	256	207 $\pm$ 06	00 $\pm$ 01	00 $\pm$ 01	193 $\pm$ 19	18 $\pm$ 03	38 $\pm$ 05
knl	>400	394 $\pm$ 23	06 $\pm$ 01	10 $\pm$ 01	142 $\pm$ 07	55 $\pm$ 02	120 $\pm$ 06
p100	732	551 $\pm$ 02	08 $\pm$ 01	00 $\pm$ 01	376 $\pm$ 02	51 $\pm$ 08	52 $\pm$ 01
titanXp	547	432 $\pm$ 04	00 $\pm$ 01	n/a	62 $\pm$ 01	45 $\pm$ 06	n/a
v100	898	847 $\pm$ 01	01 $\pm$ 01	00 $\pm$ 01	611 $\pm$ 06	89 $\pm$ 05	98 $\pm$ 01

dxdy = toDisk.loc[:,[('B(P=2)','bandwidth [GB/s]'),
                     ('B(P=3)','bandwidth [GB/s]'),
                     ('B(P=4)','bandwidth [GB/s]'),
                     ('B(P=5)','bandwidth [GB/s]'),
                     ('B(P=2)','$T_{lat}(1)$ [$\mu$s]'),
                     ('B(P=2)','$T_{lat}(4)$ [$\mu$s]'),
                    ]]
dxdy.columns = ['B(P=2) [GB/s]','B(P=3) [GB/s]','B(P=4) [GB/s]','B(P=5) [GB/s]',
                         '$T_{lat}(1)$ [$\mu$s]','$T_{lat}(4)$ [$\mu$s]']

filename='dxdy.tex'
with open(filename, 'wb') as f:
    f.write(bytes(dxdy.to_latex(
        escape=False,
        column_format='lp{1.5cm}p{1.5cm}p{1.5cm}p{1.5cm}p{1.2cm}p{1.2cm}',
        bold_rows=True),'UTF-8'))
dxdy

	B(P=2) [GB/s]	B(P=3) [GB/s]	B(P=4) [GB/s]	B(P=5) [GB/s]	$T_{lat}(1)$ [$\mu$s]	$T_{lat}(4)$ [$\mu$s]
i5	28 $\pm$ 03	30 $\pm$ 03	26 $\pm$ 02	22 $\pm$ 02	00 $\pm$ 02	n/a
gtx1060	131 $\pm$ 01	112 $\pm$ 02	84 $\pm$ 14	70 $\pm$ 18	00 $\pm$ 01	n/a
skl	182 $\pm$ 36	162 $\pm$ 13	119 $\pm$ 19	111 $\pm$ 09	23 $\pm$ 03	29 $\pm$ 03
knl	240 $\pm$ 18	173 $\pm$ 27	127 $\pm$ 19	102 $\pm$ 15	10 $\pm$ 01	53 $\pm$ 04
p100	294 $\pm$ 08	239 $\pm$ 13	209 $\pm$ 08	172 $\pm$ 18	27 $\pm$ 01	55 $\pm$ 01
titanXp	373 $\pm$ 05	309 $\pm$ 10	247 $\pm$ 08	198 $\pm$ 32	03 $\pm$ 01	n/a
v100	795 $\pm$ 21	736 $\pm$ 34	697 $\pm$ 16	590 $\pm$ 66	05 $\pm$ 01	38 $\pm$ 01

A performance study of Feltor

Determine the latencies and bandwidths quantitatively

Contents