# Inverter (Conjugate Gradient) The Conjugate Gradient (CG) inverter solves the equation $Ax=b$, where $A$ is a symmetric and positive definite matrix, $b$ an input vector and $x$ the solution vector. Different versions of the CG inverter are implemented in our code in the two classes `ConjugateGradient` and `AdvancedMultiShiftCG`. The matrix $A$ in our case is, up to now, always $M^{\dagger}M$, with $M$ being the fermion matrix. ## The Basic Algorithm Given an initial guess $x_0$ and a target residual $\epsilon$, compute $r_0=b-Ax$ and set $p_0=r_0$. > **while** $|r_{i+1}|^2 < \epsilon$ \ > $\qquad \alpha_{i} = \frac{|r_i|^2}{p^{\dagger}Ap}$\ > $\qquad x_{i+1} = x_{i} + \alpha_{i}p_{i}$\ > $\qquad r_{i+1} = r_{i} - \alpha_{i}r_{i}$\ > $\qquad \beta_{i} = \frac{|r_{i+1}|^2}{|r_i|^2}$\ > $\qquad p_{i+1} = r_{i+1} + \beta_{i}p_{i}$\ > $\qquad i = i+1$\ > **end while**\ > **return** $x_{i+1}$ ## Residual drift and precision In exact arithmetic, the iteratively computed residual $r_i$ is always equivalent to the true residual $r=b-Ax_{i}$. Due to rounding errors in floating point calculations, this is not true in practical use and the iterated residual $r_i$ will drift away from the true residual $r=b-Ax_{i}$. The magnitude of this drift depends on the floating point precision that you are using and the number of iterations the CG needs until convergence. This problem can be resolved by replacing the iterated residual with the exact residual occasionally. ## Multi-RHS and Multi-shift Two important improvements of the basic CG algorithm that we employ in our code are the use of multiple right-hand sides and multiple shifts. These two optimizations cannot be used at the same time. The multi rhs version of the CG solves $Ax_{k}=b_{k}$ for multiple "right-hand sides" $b_k$ simultaneously. This can improve performance significantly since the most expensive part in a CG iteration is the matrix-vector product $Ap$ that enters $\alpha$. The performance of this matrix-vector product, the Dslash kernel, is limited by the GPU's memory bandwidth. By applying the same matrix $A$ to multiple vectors at once, $A$ only needs to be loaded once, thereby saving memory bandwidth and increasing performance. The multi-shift CG solves $\left(A+\sigma_{k}\mathbb{1}\right)x_{k}=b$ with multiple shifts $\sigma_{k}$. Without going into too much technical details, the multi-shift CG essentially solves the system $\left(A+\sigma_{0}\mathbb{1}\right)x_0=b$ with the smallest shift $\sigma_0$ with the basic CG algorithm and computes the solutions for the larger shifts through some smart linear algebra on the way, thereby solving multiple problems at the cost of only one. ## The ConjugateGradient Class This class implements the CG algorithm shown above and supports using multiple right hand sides. It has two template parameters, `floatT` and `NStacks` that define the floating point type to be used and the number of right hand sides, respectively. ```C++ ConjugateGradient cg; ``` The member functions of ConjugateGradient are different flavors of the basic algorithm listed above. Specifically, we have ```C++ template void invert(LinearOperator& dslash, Spinor_t& spinorOut, Spinor_t& spinorIn, int max_iter, double precision); ``` which implements a multi-rhs CG with an absolute stopping criterion, i.e. the algorithm stops when $|r|^2<\mathrm{precision}$. The first argument `LinearOperator& dslash` can be any class that has the function `void applyMdaggM(Spinor_t&, Spinor_t&, bool update)` defined. In our case, this will usually be some dslash class. The second and third arguments are output and input `Spinorfields`. `max_iter` is the maximum number of iterations after which the CG exits and `precision` is the target residual used in the stopping criterion. ```C++ template void invert_new(LinearOperator& dslash, Spinor_t& spinorOut, const Spinor_t& spinorIn, const int max_iter, const double precision); ``` invert_new does the same as invert but uses a relative stopping criterion, i.e. the algorithm stops when $\frac{|r|^2}{|r_{0}|^2}<\mathrm{precision}$, where $r_0$ is the starting residual. ```C++ template void invert_res_replace(LinearOperator& dslash, Spinor_t& spinorOut, const Spinor_t& spinorIn, const int max_iter, const double precision); ``` invert_res_replace employs a residual replacement strategy where the iterated residual is replaced by the exact residual occasionally. ## The AdvancedMultiShiftCG Class This class implements the multi-shift CG algorithm. The template parameters are the same as for the ConjugateGradient class. The member function ```C++ template void invert(LinearOperator& dslash, SpinorOut_t& spinorOut, const SpinorIn_t& spinorIn, SimpleArray sigma, const int max_iter, const double precision); ``` performs the multi-shift inversion. In contrast to ConjugateGradient::invert, this function has two template parameters, one for the input spinor type and one for the output spinor type. The input spinor is a `Spinorfield` with `NStacks=1` and the output spinor is a `Spinorfield` where `NStacks` matches the number of shifts. The shifts are passed to the function via `SimpleArray sigma`. ## Mixed precision The time to solution of the CG can be significantly decreased by using mixed precision approaches. Since it is an iterative method, we can use half precision floating point arithmetic for the bulk part of the computation and inject full precision residuals occasionally to correct for rounding errors accumulated along the way. This is implemented in the member function `invert_mixed` of `ConjugateGradient`: ```C++ template void invert_mixed(LinearOperator& dslash, LinearOperator& dslash_inner, Spinor_t& spinorOut, const Spinor_t& spinorIn, const int max_iter, const double precision, double delta); ``` `void invert_mixed` recomputes a full precision true residual if the norm of the current residual decreased by a factor $\frac{1}{\delta}$ compared to the residual from the last restart. It then resets the CG by re-projecting the gradient vector $p_{i}$ such that it is orthogonal to the new, true residual. In contrast to the other invert methods, one has to pass an additional, lower precision dslash operator to these methods. An example how to use the mixed precision inverters is given below ```C++ //declare target precision gauge fields and half precision gauge fields Gaugefield gauge_smeared(commBase); Gaugefield gauge_naik(commBase); Gaugefield gauge_smeared_lowprec(commBase); Gaugefield gauge_naik_lowprec(commBase); //declare spinors Spinorfield spinorIn(commBase); Spinorfield spinorOut(commBase); //declare CG, regular dslash and half precision dslash ConjugateGradient cg; HisqDSlash dslash(gauge_smeared, gauge_naik, param.m_ud()); HisqDSlash dslash_lowprec(gauge_smeared_lowprec, gauge_naik_lowprec, param.m_ud()); //do stuff with the regular gaugefield and spinors, like smearing and some physics //copy results into half prec gauge fields. THIS STEP IS IMPORTANT! gauge_smeared_lowprec.convert_precision(gauge_smeared); gauge_naik_lowprec.convert_precision(gauge_naik); //invert using floatT-floatT_inner CG solver cg.invert_mixed(dslash, dslash_lowprec, spinorOut, spinorIn, param.cgMax(), param.residue(), param.cgMixedPrec_delta()); ```