Gauge Updates (HB and OR)

The pure gauge update class contains both heat bath (HB) and over-relaxation (OR) updating for pure gauge fields. The heat bath is implemented using the Kennedy-Pendleton algorithm, which is extended from SU2 to SU3 via the method of Cabbibo and Marinari.

To use this class include the src/gauge/pureGaugeUpdates.h header file and include src/gauge/pureGaugeUpdates.cpp as a source file for your program in CMakeLists.txt. The pure gauge update class is initialized with, for example,

GaugeUpdate<PREC,true,HaloDepth>    gUpdate(gauge);

and one OR sweep of the entire lattice can be performed with

gUpdate.updateOR();

The HB update requires you to initialize a random number generator on the host, then pass the state to the device. This can be done by

int seed = 0;
grnd_state<false> host_state;
grnd_state<true>  dev_state;
host_state.make_rng_state(seed);
dev_state = host_state;

More information about the random number generator can be found in Random Number Generator. The state is then passed as an argument to the HB function as

gUpdate.updateHB(dev_state.state,beta);

Some benchmarks

The following use HaloDepth=1. Each sweep consists of 1 HB with 4 OR updates. Times are measured in [ms]. Error bars are in the last digits in parentheses. Each timing uses 50 sweeps. Each number given is an average time from between 3 and 4 test runs. Timing was done with the SIMULATeQCD code’s built-in timer. Only hyperplanes and planes are communicated. Originally the tests were carried out on NVIDIA Pascal GPU, but more tests were carried out later on NVIDIA VolNVIDIA Volta GPU. Both results are included because maybe it’s interesting to see the improvement from the old hardware to the new hardware. Attached are plots of improvement \(I\) versus number of GPUs for both machines, where I define improvement as \(I=\frac{\text{number of GPUs}}{\text{time}/\text{1 GPU time}}\)

Pascal CPU 16 GB

1 processor: \(68^4\):

no split

106 855(3)

2 processor: \(136\times68^3\):

x split

y split

z split

t split

171 726(8)

154 139(8)

152 932(3)

152 064(5)

4 processor: \(136^2\times68^2\):

xy split

xz split

xt split

yz split

yt split

zt split

179 900(4)

179 590(370)

178 833(7)

163 206(2)

163 480(220)

162 950(500)

Volta GPU 32GB

1 processor: \(68^4\):

no split

71 603(40)

2 processor: \(136\times68^3\):

x split

y split

z split

t split

135 389(12)

110 609(13)

109 598(16)

109 018(29)

4 processor: \(136^2\times68^2\):

xy split

xz split

xt split

yz split

yt split

zt split

143 664(16)

143 432(16)

143 138(11)

120 595(34)

120 581(12)

120 180(15)

4 processor: \(272\times68^3\):

x split

y split

z split

t split

134 423(17)

99 072(30)

98 397(22)

97 810(29)