# Adding Spin Polarisation and van der Waals Energy Correction to Conquest


This report describes the work done in the distributed Computational Science and Engineering (dCSE) project aimed to extend functionality of the linear scaling ab initio Density Functional Theory code Conquest by adding spin polarisation and van der Waals functional described by works of Dion et al.[3] and Roman-Perez et al.[15]. This work would allow Conquest to perform simulations on magnetic systems and on bio-molecular systems in which van der Waals interactions are essential. The implementations has been successful.

## 1 Introduction

There are various classes of important problems in biochemistry which rely quantum mechanical properties of the biological systems dominate. Due the complexity of biomolecules and the importance of environment (such as the presence of water), isolated quantum mechanical studies on a small part of the molecule may not be sufficient in reproducing and understanding the key properties of the molecule. The ideal quantum mechanical method would allow simulation of a large part of the biomolecule (or even the entire molecule) inside an environment in a highly scalable manner on modern high performance computation platforms. The method should also correctly model the systems which contain transition metal complexes and van dar Waals interactions.

Conquest is a linear scaling Density Functional Theory (DFT) code which has demonstrated excellent scaling for computation on large number of cores. The good scaling makes it an appealing platform for large scale quantum mechanical studies on biomolecules. The aim of this dCSE project is to extend the infrastructure of Conquest to include spin polarisation, and to incorporate a first principle density functional for van der Waals interactions[3].

This project aims to add the collinear form of spin polarisation into Conquest. This means there are two spin states ($$\up$$ and $$\dn$$). Conquest was originally written for spin non-polarised calculations only, and the data structures in the code reflect this. For this reason a significant portion of the code has to be updated to accommodate the extra degrees of freedom introduced by spin. Never-the-less, linear scaling is generally not effected by adding spin. This is because the spin implementation always uses the existing Conquest kernels for distributing data and computation, so the underlying communication and computation mechanisms have not been changed. Since the implementation of spin spreads through-out the code and involves some changes to data structure while the van der Waals functional can be written within additional modules and uses spin dependent quantities, it was decided that spin implementation should be done before the work on van der Waals functional.

There are two flavours of spin polarised calculation implemented for this project:

1. Calculations with spin populations fixed to user given values through-out.
2. Calculations that start with spin populations set by the user, but then are allowed to change—while still keeping the total electron number fixed—and self-adjust to give minimum energy. The final difference of the spin populations will be the relaxed spin polarisation of the system.

In contrast to the implementation of spin polarisation, van der Waals functional can be implemented as a relatively self-contained module. The implementation follows closely to the work done by Dion et al.[3] and Roman-Perez et al.[15].

This report describes the work done for implementing the updates to Conquest mentioned above. Since this work is oriented around extending features and functionality of the code, most resources has been allocated to code development rather than profiling and optimisation. For work on spin since large parts of the code has been updated, and most involve simply adding a loop in the calculation for the second spin component. It would be impractical and tedious to include in this report all the changes made to the code. I will therefore only give an account of the relatively important parts of the modification.

## 2 Implementing Spin Polarisation

Conquest before the project started only handles spin non-polarised calculations. This means that only the matrices and grid data associated to one spin component are stored and calculated, and any extensive quantities are calculated with an additional factor of 2 knowing the contribution from spin up and down components are identical. The aim of this dCSE project is to extend this to allow collinear1 spin polarisation calculations, where physical quantities associated to each spin channels ($$\up$$ and $$\dn$$) are calculated and are allowed to differ with the constraint that total number of electrons has to be conserved.

Conquest uses a set of optimised support functions as a basis-set to give matrix representations of the various quantum mechanical operators. The support functions them-selves are generated by using linear combination of a set of basis functions either in the form of psudo-atomic orbitals (PAOs) or localised blip (B-spline) functions[2]. Each support function centers on a given atom and only basis functions within a given range contributes to the support function. This ensures strict locality of the support functions and hence making linear-ordering calculations possible. The coefficients of the linear combinations are then optimised during the energy minimisation step to ensure the minimum energy is found with the given set of basis functions.

Ideally for spin polarised calculations, in addition to separate matrices and data on grid for physical quantities in each spin channel, one would also benefit from making the support functions them-selves spin dependent. This offers more flexibility in optimising the basis coefficients and hence leads to more optimised ground-state energies. However allowing the support functions to be spin dependent would mean rewriting a significant portion of Conquest given virtually all subroutines and quantities in the code except the core linear algebra, FFT and MPI communication kernels depend or operate on the support functions and their derived quantities. Therefore within the given time-frame of the project, we have decided to use the existing support functions in Conquest for both spin channels.

### 2.1 Changes To Data Structure

There are in general three types of data storage for physical quantities in Conquest:

• Matrices: These store the overlap of the support functions and matrix representations of the quantum operators in a compressed (sparse) format and distributes them to processors with respect to the atomic labels in rows. Conquest stores the matrices in a stack and refers to each matrix using an unique integer tag.
• Functions-on-grid: These store the support functions them-selves and Kleinman-Bylander (K.B.) non local pseudopotential projectors (including those linearly operated by a matrix), which are functions of a given label (be it support function label or KB projector label) defined on every grid point. Conquest distributes these functions according to the real-space grid point blocks, so that each processor gets a subset of such functions (with values associated to all atomic and support labels) on a set of real-space grid points (organised in blocks) it is in charge of. Again the data associated to the functions are stored in a stack and there is an unique integer tag referring to each function. Note that the set of all support functions is considered to be a single "function on grid".
• Numbers-on-grid: These store the function of the form $$f(\vec{r})$$ defined on the real-space or FFT grid. They differ from "functions on grid" in that there is no other index dependencies other than that associated to the grid point. These are stored as ordinary arrays of real numbers. Conquest distributes them among the processors again according to the real-space grid point blocks.

For collinear spin polarised calculations the matrices are given as direct sums of its spin components: \begin{equation*} \mat{M} = \mat{M}^\up \oplus \mat{M}^\dn \equiv \begin{pmatrix} \mat{M}^\up & 0 \\ 0 & \mat{M}^\dn \end{pmatrix} \end{equation*} Hence the matrices have the properties: ($$\alpha, \beta \in \mathbb{C}$$) \begin{align*} (\alpha\mat{A} + \beta\mat{C})\mat{B} = \bigl( (\alpha\mat{A}^\up + \beta\mat{C}^{\up}) \mat{B}^\up \bigr) \oplus \bigl( (\alpha\mat{A}^\dn + \beta\mat{C}^{\dn}) \mat{B}^\dn \bigr) \end{align*}

The following global matrices in Conquest are spin dependent:

 Matrix Conquest Variable Description $$\mat{L}$$ matL auxiliary matrix, all indices contravariant $$\mat{K}$$ matK density matrix, all indices contravariant $$\mat{H}$$ matH Hamiltonian matrix, all indices covariant $$\mat{L}\mat{S}$$ matLS auxiliary matrix, second index covariant $$\mat{S}\mat{L}$$ matSL auxiliary matrix, first index covariant $$\mat{\Phi}$$ matphi derivative of electron no. w.r.t. auxiliary matrix $$\mat{M}_{12}$$ matM12 part of derivative of energy w.r.t. support functions $$\mat{M}_{4}$$ matM4 part of derivative of electron no. w.r.t. support functions

All of these matrices were originally implemented in Conquest as integer variables storing the tag to the internal matrix stack. For spin polarised calculations, we need a second copy for the other spin component. We also need to make sure that there will be no memory wastage when doing spin non-polarised calculations and only one set of the above matrices are allocated and used. The easiest way to achieve this is to add a global control variable nspin which can be either 1 for spin non-polarised calculations or 2 for spin polarised calculations. Then the matrices in the above list (and all other similar matrices used as local variables within subroutines) are redefined as allocatable integer arrays of dimension nspin. Hence once allocated and initialised at the beginning of execution, matM(1) corresponds to $$\mat{M}^\up$$ and matM(2) corresponds to $$\mat{M}^\dn$$.

Conquest was not consistent in the definition of the spin non-polarised matrices in its original implementation. For example, matL and matH were defined as $$\mat{L}^\up$$ and $$\mat{H}^\up$$ correspondingly, however matphi was defined as $$2 \mat{\Phi}^\up = \sum_\sigma \mat{\Phi}^{\sigma}$$ ($$\sigma$$ denotes spin). To be consistent and avoid future confusion—that leads to bugs, I modified this definition so that matM(1) always refers to the spin up component of the matrix $$\mat{M}$$ irrespective to whether the calculation is spin polarised or not.

As we have discussed in section Implementing Spin Polarisation the overlap matrix $$\mat{S} = S_{i\alpha j\beta} = \braket{\phi_{i\alpha}}{\phi_{j\beta}}$$—where $$\phi_{i\alpha}(\vec{r})$$ are the support functions—is treated as spin independent.

Extensive quantities such as the energy and electron numbers are calculated as traces of spin dependent matrices. The trace of any matrix $$\mat{M}$$ is simply the sum of the trace of its spin components: \begin{equation*} \tr{\mat{M}} = \tr\mat{M}^\up + \tr\mat{M}^\dn \end{equation*} To take into account that only one of the spin component is stored for spin non-polarised calculations, I have introduced a factor spin_factor which is set to 1.0 for spin polarised calculations and 2.0 otherwise. so that the above sum is performed in the code as follows:

value = zero
do spin = 1, nspin
value = value + spin_factor * trace(matM(spin))
end do


Conquest stores three functions-on-grid as global variables available through using functions_on_grid module. They correspond to the support functions, the result of Hamiltonian matrix acting on the support functions (H_on_support) and the pseudopotential functions. Only H_on_support should be spin dependent as the Hamiltonian matrix is spin dependent. Therefore following the same method for matrices, I redefined the integer variable H_on_support (which stores the tag referencing the functions-on-grid object in the internal stack) as an array of dimension 2. Because of the way the functions on grid reference tags are implemented, it would be difficult to allow the dimension of H_on_support to vary according to value of nspin. However to save memory for non-spin polarised calculations, there will still only one functions-on-grid object allocated in the stack for H_on_support and only H_on_support(1) references to the correct location in the stack.

Spin dependent numbers-on-grid global quantities in Conquest are density, and potential, both are redefined as allocatable real arrays of dimension (maxngrid, nspin). maxngrid stores the maximum number of grid points on the node. Again to avoid confusion, density(1) always stores the electron density for the spin up channel irrespective to whether the calculation is spin polarised or not. The original Conquest had a factor of 2 contained inside density so that it gives the total density. This has been redefined after the spin implementations.

### 2.2 Order $$N$$: Spin Polarised Energy Minimisation

Under linear scaling scheme, Conquest finds the ground-state energy of a system by minimising band-energy: \begin{equation*} E = \tr(\mat{H}\mat{K}) = \sum_\sigma\tr(\mat{H}^{\sigma}\mat{K}^{\sigma}) \end{equation*} by varying the auxiliary matrices $$\mat{L}^{\sigma}$$ under the constraint on the conservation number of electrons. The density matrix $$\mat{K}^{\sigma}$$ is defined in terms of the auxiliary and overlap matrices \begin{equation*} \mat{K}^{\sigma} = 3\mat{L}^{\sigma}\mat{S}\mat{L}^{\sigma} - 2\mat{L}^{\sigma}\mat{S}\mat{L}^{\sigma}\mat{S}\mat{L}^{\sigma} \end{equation*}

The minimisation is an iterative process by working out a search direction based on the gradient of the energy with respect to $$\mat{L}^\sigma$$; propagate the auxiliary matrix along the search direction until a local minimum is found; and a new search direction is worked out from there and repeat the process. At $$n + 1$$-th step, the auxiliary matrix (written as a direct sum) may be written as \begin{equation*} \mat{L}_{n+1} = \mat{L}_n + \lambda_n \mat{G}_n \end{equation*} For spin polarised calculations we treat the problem as a minimisation procedure on the direct sum of the spin dependent matrices.

The search direction $$\mat{G}_n = \mat{G}^\up_n \oplus \mat{G}^\dn_n$$ is calculated to be the direction opposite to the slope in energy surface $$\pdiff{E}{\mat{L}^{\sigma}}$$ and then corrected so that it remains on the tangent plane to the iso-electron-number surface. The step $$\lambda_n$$ is calculated by minimising E along $$\mat{G}_n$$.

There are two choices for the search direction, which in turn define the type of problem we wish to calculate.

1. The spin populations (i.e. total magnetisation) are fixed. In this case the search direction in each spin channel is corrected so that the it stays on the tangent plane of the iso-electron-number surface for the particular spin: \begin{align*} \mat{G}_n^\up & = - \pdiff{E}{\mat{L}^\up} + \frac{ \left( \pdiff{E}{\mat{L}^\up} \cdot \pdiff{N}{\mat{L}^\up} \right) }{ \Norm{\pdiff{N}{\mat{L}^\up}}^2 } \pdiff{N}{\mat{L}^\up} \\ \mat{G}_n^\dn & = - \pdiff{E}{\mat{L}^\dn} + \frac{ \left( \pdiff{E}{\mat{L}^\dn} \cdot \pdiff{N}{\mat{L}^\dn} \right) }{ \Norm{\pdiff{N}{\mat{L}^\dn}}^2 } \pdiff{N}{\mat{L}^\dn} \end{align*}
2. The spin populations are allowed to vary, but the total electron number $$N = N^\up + N^\dn$$ are required to be fixed. In this case the search direction is corrected so that the direct sum $$\mat{G}$$ needs to stay on the tangent plane $$\pdiff{N}{\mat{L}}$$ of the iso-surface of the total electron number. Hence \begin{align*} \mat{G}_n^\up & = - \pdiff{E}{\mat{L}^\up} + \frac{ \left( \pdiff{E}{\mat{L}} \cdot \pdiff{N}{\mat{L}} \right) }{ \Norm{\pdiff{N}{\mat{L}}}^2 } \pdiff{N}{\mat{L}^\up} \\ \mat{G}_n^\dn & = - \pdiff{E}{\mat{L}^\dn} + \frac{ \left( \pdiff{E}{\mat{L}} \cdot \pdiff{N}{\mat{L}} \right) }{ \Norm{\pdiff{N}{\mat{L}}}^2 } \pdiff{N}{\mat{L}^\dn} \end{align*} where $$\pdiff{E}{\mat{L}} \cdot \pdiff{N}{\mat{L}} \equiv \pdiff{E}{\mat{L}^\up} \cdot \pdiff{N}{\mat{L}^\up} + \pdiff{E}{\mat{L}^\dn} \cdot \pdiff{N}{\mat{L}^\dn}$$ and $$\pdiff{N}{\mat{L}} \cdot \pdiff{N}{\mat{L}} \equiv \pdiff{N}{\mat{L}^\up} \cdot \pdiff{N}{\mat{L}^\up} + \pdiff{N}{\mat{L}^\dn} \cdot \pdiff{N}{\mat{L}^\dn}$$.

Note that in Conquest $$\pdiff{N}{\mat{L}}$$ is named as $$\mat{\Phi}$$ and stored in matphi. And the dot product $$\mat{A} \cdot \mat{B}$$ between matrices is defined as trace of their product $$\tr(\mat{A}\mat{B})$$ and norm of a matrix is defined as the square root of its own dot product $$\norm{\mat{A}} = \sqrt{\mat{A} \cdot \mat{A}}$$.

The difference in the definition of search direction is crucial for calculations where spin is allowed to vary, because the constraint on total electron number means electron populations within each channel are allowed to readjust, whereas the definition of $$\mat{G}$$ in the first case (fixed spin population) prohibits this.

Special attention should be paid to the "step-size" $$\lambda_n$$. While for spin polarised case it is possible to make $$\lambda_n = \lambda_n^\sigma$$ to be spin dependent, it is important to bear in mind that each minimisation step is should be a linear propagation of the auxiliary matrix (direct sum of the spin components), and making the propagation step spin dependent breaks this linearity.

#### 2.2.1 Pulay Minimisation

To find the minimum band energy the procedure $$\mat{L}_{n+1} + \lambda_n \mat{G}_n$$ is repeated until $$\mat{G}$$ becomes less than a given tolerance. One can readily see that this procedure is identical to the linear mixing method used for calculating self-consistent densities. This means we can employ the Pulay methods[14][6] for making the minimisation faster. Conquest implements this by regarding $$\mat{G}$$ as the residual.

The key to extending the implementation to spin polarised calculations is to understand that the minimisation process sees the over all direct sum of the spin matrices as the minimisation variable, not its individual spin components. Therefore the correct form to use for magnitude for the residual is given as $$\norm{\mat{G}}$$, or for more simplicity $$\norm{\mat{G}}^2 = \tr(\mat{G}\mat{G}) = \sum_\sigma \tr(\mat{G}^\sigma\mat{G}^\sigma)$$. Because the Pulay minimisation procedure concerns the direct sum matrix, there is no difference to whether the spin population is fixed or not. The different constraints on electron number are included in the definition of $$\mat{G}^{\sigma}$$.

#### 2.2.2 Constraint on Electron Numbers

While the definition of the search directions ensures the new auxiliary matrices stay on tangent plane of the iso-electron number energy surfaces, this still only conserves the electron numbers up to the first order. It was found that the accuracy and stability of ground-state energy calculations are highly dependent on electron numbers being correct at each minimisation iteration step. Hence Conquest employs a further step to correct $$\mat{L}$$ to give the correct number of electrons.

For spin polarised calculations things are slightly more complex. For the case of fixed spin populations, we want to apply electron number corrections so that populations in each spin channels remain the same as those at the beginning of the calculation. On the other hand for spin relaxation calculations, we would need to allow the electron populations in each spin channels to vary while keeping the total electron number the same. The spin non-polarised calculations are in effect the same as spin-fixed calculations with same populations in each channel, and therefore it is easy to convert the existing electron number correction subroutine in Conquest to work with both spin non-polarised and fixed polarisation cases. For spin relaxation calculations, a new subroutine has to be written.

The basic idea behind the procedure for electron number correction is to find the correct step-size $$x$$ along a search direction so that \begin{equation*} \mat{L}' = \mat{L} + x \mat{\Gamma} \end{equation*} gives the correct number of electrons. For spin-fixed or non-spin polarised case, this means we need to find $$x^\sigma$$ for each spin channel $$\sigma$$ (for spin non-polarised calculations we of course only need to correct electron number for one spin channel as the other is assumed to be identical), so that \begin{equation*} N(\mat{L}'^\sigma) = \tr(3\mat{S}\mat{L}'^{\sigma}\mat{S}\mat{L}'^\sigma) - \tr(2\mat{S}\mat{L}'^{\sigma}\mat{S}\mat{L}'^\sigma\mat{S}\mat{L}'^\sigma) = N^\sigma \end{equation*} where $$N^\sigma$$ are the spin populations defined by user at initialisation. For spin relaxation calculations, we should treat the electron correction process as an linear operator acting on the direct sum $$\mat{L}$$, so that the step $$x$$ should not be spin dependent. We need to find $$x$$ so that \begin{equation*} \sum_\sigma N(\mat{L}'^\sigma) = \sum_\sigma \left( \tr(3\mat{S}\mat{L}'^{\sigma}\mat{S}\mat{L}'^\sigma) - \tr(2\mat{S}\mat{L}'^{\sigma}\mat{S}\mat{L}'^\sigma\mat{S}\mat{L}'^\sigma) \right) = N \end{equation*} where $$N$$ is the total number of electrons in the system under calculation.

Since the energy minimisation step takes $$\mat{L}$$ slightly away from the iso-electron number surface, the search direction thus should be perpendicular to that surface. This means \begin{equation*} \mat{\Gamma}^\sigma = \mat{S}^{-1} \pdiff{N^\sigma}{\mat{L}^{\sigma}} \mat{S}^{-1} \equiv \mat{S}^{-1} \mat{\Phi}^\sigma \mat{S}^{-1} \end{equation*} where $$N^\sigma = N(\mat{L}^{\sigma})$$, and the inverse of overlap matrix appears because the support functions are non-orthogonal. Note that because electron number is an extensive quantity and $$N(\mat{L}) = N(\mat{L}^\up) + N(\mat{L}^\dn) = N^\up + N^\dn$$, $$\pdiff{N}{\mat{L}^\sigma} = \pdiff{N^\sigma}{\mat{L}^\sigma}$$ so that the search direction is the same for both fixed spin and spin relaxation calculations.

The step size $$x$$ (or $$x^\sigma$$ for fixed spin or non-polarised case) is calculated by recognising that $$N(\mat{L} + x\mat{\Gamma})$$ is a cubic polynomial of $$x$$, and $$x$$ (or $$x^\sigma$$) needs to be chosen so that $$N$$ (or $$N^\sigma$$) is stationary—that is its derivative with respect to $$x$$ is zero.

The electron number correction procedure is done after every Pulay energy minimisation step to ensure the electron numbers are correct through out minimisation process.

### 2.3 Order $$N$$: McWeeny Initialisation

For order $$N$$ calculations, before entering the energy minimisation procedures, the auxiliary matrices are first initialised to make sure its eigenvalues are in between 0 and 1. It is then passed through a McWeeny iteration loop[8][1] so that all of the eigenvalues of $$\mat{L}$$ are either 0 or 1. This ensures idempotency of the density matrix, and hence conservation of electron number.

The McWeeny initialisation and iteration procedures used by Conquest is described by the works of Palser and Manolopoulos[8], and extended to non-orthogonal basis sets (support functions) by Bowler[1].

#### 2.3.1 Case for Fixed Spin Populations

To extend the procedure for spin polarised calculations, I noted again that the fixed spin case is equivalent to spin non-polarised case, one only has to repeat the original procedures for the other spin channel. The initial guess for the $$\sigma$$ channel auxiliary matrix thus should be given as \begin{equation*} \mat{L}_0^\sigma = \frac{\lambda^\sigma}{N} \left( \mu^\sigma \mat{S}^{-1} - \mat{S}^{-1} \mat{H}^{\sigma} \mat{S}^{-1} \right) + \frac{N^\sigma_e}{N} \mat{S}^{-1} \end{equation*} where $$N$$ is the dimension of $$\mat{L}^{\sigma}$$ matrix, which equals to the total number of support functions available for the calculation. $$N_e^\sigma$$ is the fixed population for spin channel $$\sigma$$, and \begin{equation*} \lambda^\sigma = \min\left( \frac{N^\sigma_e}{H^\sigma_{\text{max} - \mu^\sigma}}, \frac{N - N^\sigma_e}{\mu^\sigma - H^\sigma_{\text{min}}} \right) \end{equation*} where $$H^\sigma_{\text{max}}$$ and $$H^\sigma_{\text{min}}$$ are bounds to the maximum and minimum eigenvalues of the Hamiltonian component in spin channel $$\sigma$$.

$$\mu^\sigma$$ takes values depending on the value of \begin{equation*} \mu^\sigma = \begin{cases} \frac{\tr(\mat{S}\mat{S}^{-1}\mat{H}^\sigma\mat{S}^{-1})}{N} + A H^\sigma_{\text{max}} & \text{if} \quad H^\sigma_{\text{min}} < \mu^\sigma < H^\sigma_{\text{max}} \\ \frac{\frac{\tr(\mat{S}\mat{S}^{-1}\mat{H}^\sigma\mat{S}^{-1})}{N} - \frac{AN^\sigma_eH^\sigma_{\text{min}}}{N - N^\sigma_e}} {1 - \frac{AN}{N - N^\sigma_e}} & \text{if} \quad H^\sigma_{\text{min}} < \mu^\sigma < H^\sigma_{\text{max}} \end{cases} \end{equation*} $$A$$ is the "error" which comes from the fact that $$\mat{S}^{-1}$$ is not exact for order $$N$$ calculations: $$A = 1 - \frac{\mat{S}\mat{S}^{-1}}{N}$$.

The McWeeny iteration step is given as: \begin{equation*} \mat{L}^{\sigma}_{n+1} = \begin{cases} \frac{(1 - 2c^\sigma_n)\mat{L}^{\sigma}_n + (1 + c^\sigma_n)\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n - \mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n} {1 - c^\sigma_n} & \text{if} \quad c^\sigma_n \le \frac{1}{2} \\ \frac{(1 + c_n^\sigma)\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n - \mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n} {c_n^\sigma} & \text{if} \quad c^\sigma_n \ge \frac{1}{2} \end{cases} \end{equation*} where $$c^\sigma_n$$, which should lie between $$[0,1]$$ is given by \begin{equation*} c_n^\sigma = \frac{\tr(\mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n - \mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n)} {\tr{\mat{L}^\sigma_n - \mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n}} \end{equation*}

#### 2.3.2 Case for Spin Relaxation

If the spin population is allowed to vary, then we should do McWeeny initialisation and iteration on the whole direct sum $$\mat{L} = \mat{L}^\up \oplus \mat{L}^\dn$$. In this case the initial guess for $$\mat{L}^\sigma$$ should be \begin{equation*} \mat{L}_0^\sigma = \frac{\lambda}{2N} \left( \mu \mat{S}^{-1} - \mat{S}^{-1} \mat{H}^{\sigma} \mat{S}^{-1} \right) + \frac{N_e}{2N} \mat{S}^{-1} \end{equation*} where $$N$$ is still the number of support functions, however because we are working with the whole direct sum, the dimension of matrix doubles. $$N_e = \sum_{\sigma} N_e^\sigma$$ now should be the total electron number in the cell.

$$\lambda$$ should no longer be spin dependent: \begin{equation*} \lambda = \min \left( \frac{N_e}{H_\text{max} - \mu}, \frac{2N - N_e}{\mu - H_\text{min}} \right) \end{equation*} where now $$H_\text{min} = \min(H^\up_{\text{min}}, H^\dn_{\text{min}})$$ and $$H_\text{max} = \max(H^\up_{\text{max}}, H^\dn_{\text{max}})$$.

$$\mu$$ is also spin independent, and should be given as \begin{equation*} \mu = \begin{cases} \frac{\sum_\sigma \tr(\mat{S}\mat{S}^{-1}\mat{H}^\sigma\mat{S}^{-1})}{2N} + A H_{\text{max}} & \text{if} \quad H_{\text{min}} < \mu < H_{\text{max}} \\ \frac{\frac{\sum_\sigma \tr(\mat{S}\mat{S}^{-1}\mat{H}^\sigma\mat{S}^{-1})}{2N} - \frac{AN_eH_{\text{min}}}{2N - N_e}} {1 - \frac{2AN}{2N - N_e}} & \text{if} \quad H_{\text{min}} < \mu < H_{\text{max}} \end{cases} \end{equation*} Note that value of $$A$$ should remain the same, as $$\mat{S}$$ extends to both spin components.

The McWeeny iteration step should be given as \begin{equation*} \mat{L}^{\sigma}_{n+1} = \begin{cases} \frac{(1 - 2c_n)\mat{L}^{\sigma}_n + (1 + c_n)\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n - \mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n} {1 - c_n} & \text{if} \quad c_n \le \frac{1}{2} \\ \frac{(1 + c_n^\sigma)\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n - \mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n\mat{S}\mat{L}^{\sigma}_n} {c_n^\sigma} & \text{if} \quad c_n \ge \frac{1}{2} \end{cases} \end{equation*} where $$c_n$$ is given by \begin{equation*} c_n = \frac{\sum_\sigma\tr(\mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n - \mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n)} {\sum_{\sigma}\tr(\mat{L}^\sigma_n - \mat{L}^\sigma_n\mat{S}\mat{L}^\sigma_n)} \end{equation*}

### 2.4 Diagonalisation: Finding Fermi Energy

As well as the order $$N$$ methods described above, Conquest also implemented the standard order $$N^3$$ method of calculating the ground-state energy from finding eigenvectors of the Hamiltonian matrix by diagonalisation using ScaLAPACK. The diagonalisation methods are useful for calculations involving metallic systems where the charge densities are not sufficiently localised for order $$N$$ method to function well.

The most significant changes need to be done to the code for spin polarised calculations is to work out the occupancies of the electrons associated to the eigen-states of the Hamiltonian matrices $$\mat{H}^\sigma$$.

Since spin non-polarised calculations is simply a fixed spin calculation with zero polarisation, it is very easy to extend the original spin non-polarised Fermi energy searching subroutines to fixed spin case. All one needs to do is to repeat the same code for the second spin component. In this case it is important to note that the Fermi energies for spin up and spin down components differ2. This is a consequence of spin populations being forced to stay the same.

For spin relaxation calculations, there is only one Fermi energy, and spin populations in each channel are calculated by filling up occupancies up to to the Fermi level. To calculate Fermi level, we fill up the spin eigen-states of $$\hat{H}^{\sigma}$$ from the lowest eigenstate (in /both/channels), until an upper bound is found. Fermi energy is then calculated using bisection.

### 2.5 Spin Polarised Pulay Mixing

To achieve self-consistency in the calculation, the input density used to generate the Hamiltonian matrix should be the same as the output density calculated from the Hamiltonian. One of the most commonly used method for helping to achieve self-consistency is Pulay mixing[14][6].

Given an input electron density $$\rho_{\text{in}}^n$$ at $$n$$-th step, the residual is defined as \begin{equation*} R^n \equiv R[\rho_{\text{in}}^n] = \rho_{\text{out}}^n - \rho_{\text{in}}^n \end{equation*} The standard Pulay mixing method tries to replace $$\rho_{\text{in}}^n$$ with an optimal $$\rho_{\text{opt}}^n$$—constructed from a linear combination of the input densities of the previous steps—in the mixing step: \begin{equation*} \rho_{\text{in}}^{n+1} = \rho_{\text{in}}^n + \lambda_n R[\rho_{\text{in}}^n] \end{equation*} where $$\lambda_n$$ is the mixing parameter, so that the norm3 of residual $$\norm{R_n}^2$$ is minimised. The minimisation is done with respect to the coefficients $$\alpha_i$$ and with the constraint that $$\sum_i^n \alpha_i = 1$$. This is to ensure the conservation of number of electrons.

For spin polarised calculations however, things are not as straight-forward. At first glance, we may define \begin{equation*} \rho_{\text{opt}}^{\sigma n} = \sum_{i}^n \alpha^\sigma_i \rho_{\text{in}}^{\sigma i} \end{equation*} and require the Pulay method to minimise the quantity $$\norm{R(\rho_{\text{opt}}^{\sigma n})}^2$$, where \begin{equation*} R(\rho_{\text{opt}}^{\sigma n}) = \sum_{i}^n \alpha^\sigma_i R(\rho_{\text{in}}^{\sigma i}) = \sum_{i}^n \alpha^\sigma_i R^{\sigma i} \end{equation*} with the constraint that:

1. For fixed spin: $$N^\sigma = \sum_i^n \alpha_i^\sigma N(\rho_{\text{in}}^{\sigma,i}) \label{eq:3}$$
2. For spin relaxation: $$N = \sum_\sigma\sum_i^n \alpha_i^{\sigma} N(\rho_{\text{in}}^{\sigma,i}) \label{eq:4}$$

It is important to note that if we define the minimisation problem this way then the coefficients $$\alpha$$ must be spin dependent, because the associated Lagrange Multiplier equations for solving the minimisation problem is given as $$\pdiff{}{\alpha_i^\sigma} \norm{R_{\text{opt}}^{\sigma n}}^2 + \gamma \pdiff{}{G}{\alpha_i^\sigma} = 0 \label{eq:5}$$ where $$\gamma$$ is the Lagrange multiplier. $$G$$ is one of the constraint equation on electron numbers described above (equation \eqref{eq:3} or \eqref{eq:4}). This is a set of $$2n$$ equations, and if we have a non-spin dependent coefficients $$\alpha$$ then there will only be $$n$$ variables.

#### 2.5.1 Case for Fixed Spin Population

This approach is correct for fixed spin calculations. Since we require for every mixing step $$N(\rho_{\text{in}}^{\sigma,i}) = N^\sigma$$, equation \eqref{eq:3} becomes $$\sum_{i}^n \alpha^\sigma_i = 1 \qquad (\forall \sigma) \label{eq:6}$$ These constraints (note there are two equations one for each value of $$\sigma$$) also ensure conservation of total charge, as \begin{equation*} N^{n}_{\text{opt}} = \sum_{\sigma} N^{\sigma n} = \sum_\sigma \underbrace{\sum_i^n \alpha^\sigma_i}_{1} N^\sigma = N \end{equation*}

Hence substituting equation \eqref{eq:6} into the Lagrange Multiplier equation \eqref{eq:5} we obtain $$2\sum_{j}^n A^{\sigma ij} \alpha^\sigma_j + \gamma^\sigma = 0 \label{eq:7}$$ where $$A^{\sigma ij} \equiv \inner{R^{\sigma i}}{R^{\sigma j}} \equiv \int\D^3\vec{r}\, R^{\sigma i}(\vec{r}) R^{\sigma j}(\vec{r}) \label{eq:Aij}$$

We can see that for fixed spin case we just have the ordinary Pulay mixing procedure for each spin channel. The coefficients $$\alpha^\sigma_i$$ can be found by treating the equation \eqref{eq:7} as a matrix equation, and provided $$A^{\sigma ij}$$ is invertible, we get \begin{equation*} \alpha_i = - \frac{\gamma^\sigma}{2} \sum_j^n (A^\sigma)^{-1}_{ij} \end{equation*} Substituting this back into the constraints given in equation \eqref{eq:6} we obtain the value of the multiplier, and hence \begin{equation*} \alpha^\sigma_i = \frac{\sum_j^n (A^\sigma)^{-1}_{ij}} {\sum_{ij}^n (A^\sigma)^{-1}_{ij}} \end{equation*}

#### 2.5.2 Case for Spin Relaxation

If we follow the same steps taken for the fixed spin case, and use equation \eqref{eq:4} as the constraint, then the form of $$\alpha^\sigma_i$$ is given by \begin{equation*} \alpha^\sigma_i = N \frac{\sum_{j}^n (A^{\sigma})^{-1}_{ij} N^{\sigma j}} {\sum_{\sigma'} \sum_{jk}^n (A^{\sigma'})^{-1}_{jk} N^{\sigma' j} N^{\sigma' k}} \end{equation*} where $$N^{\sigma i} = N(\rho_{\text{in}}^{\sigma i})$$ are the spin populations for the previous iterations steps.

However, there is a flaw in our arguments. By constraining the total electron number while allowing the spin populations to vary, and by allowing the Pulay coefficients for each spin channel to differ, we have unintentionally created a situation where the minimisation of $$\norm{R(\rho_{\text{opt}}^{\sigma n})}^2$$ becomes ill-defined. Pulay mixing procedure always tries to minimise $$\norm{R(\rho_{\text{opt}}^{\sigma n})}^2$$ w.r.t. the coefficients $$\alpha^{\sigma}_i$$, and if during a charge mixing step one of the spin channels (let us call it $$\sigma_1$$) has either a larger spin population or lower residual (this amounts to higher values in $$(A^{\sigma})^{-1}_{ij}$$) than the other, then $$\alpha^{\sigma}_i$$ for this spin channel will always have a larger value than that of the other spin channels. This means after mixing more electrons will "flow" to channel $$\sigma_1$$. This causes spin populations in channel $$\sigma_1$$ to increase while the population in the other channel decrease. This further increases the weight to the coefficients in channel $$\sigma_1$$ and hence the self-consistency cycle will be trapped and inevitably all electrons will falsely go to channel $$\sigma_1$$.

This problem is not present for fixed spin case because the $$\alpha$$ coefficients in each channel are both constrained to give the sum of 1. However the same constraints cannot be applied to spin relaxation case because the spin populations will be expected to be different at every mixing step. Thus constraint $$\sum_i \alpha^\sigma_i = 1$$ will not be able to conserve total number of electrons.

The solution to this problem is to recognise that the Pulay coefficients $$\alpha$$ must be the same for both spin channels in the case for spin relaxation. This also means we need to redefine the quantity we wish to minimise in Pulay mixing procedure. Instead of minimising $$\norm{R(\rho_{\text{opt}}^{\sigma n})}^2$$ for each of the spin channels, we minimise the quantity \begin{equation*} \sum_\sigma \norm{R(\rho_{\text{opt}}^{\sigma n})}^2 = \norm{R(\rho_{\text{opt}}^{\up n})}^2 + \norm{R(\rho_{\text{opt}}^{\dn n})}^2 \end{equation*} Note that $$\sum_\sigma \norm{R(\rho_{\text{opt}}^{\sigma n})}^2 = 0$$ implies both $$\norm{R(\rho_{\text{opt}}^{\up n})}^2$$ and $$\norm{R(\rho_{\text{opt}}^{\dn n})}^2$$ must also be zero. Thus $$\sum_\sigma \norm{R(\rho_{\text{opt}}^{\sigma n})}^2 \to 0$$ is still a well defined condition for achieving self-consistency.

Hence the modified Pulay mixing procedure for spin relaxation calculation is given as follows: we define \begin{equation*} \rho_{\text{opt}}^{\sigma n} = \sum_i^n \alpha_i \rho_{\text{in}}^{\sigma i} \end{equation*} and assuming \begin{equation*} R_{\text{opt}}^{\sigma n} \equiv R[\rho_{\text{opt}}^{\sigma n}] = \sum_i^n \alpha_i R^{\sigma i} \end{equation*} The optimised input density for mixing is given by minimising $$\sum_\sigma \norm{R_{\text{opt}}^{\sigma n}}^2$$ w.r.t. the coefficients $$\alpha_i$$, subject to the constraint equation \eqref{eq:4}. Note that now we are solving $$n$$ equations with $$n$$ degrees of freedom.

We have \begin{equation*} \sum_\sigma \norm{R_{\text{opt}}^{\sigma n}}^2 = \sum_\sigma \sum_{ij} \alpha_i A^{\sigma ij} \alpha_j \end{equation*} and hence by noting $$A^{\sigma ij}$$ defined by equation \eqref{eq:Aij} is symmetric \begin{equation*} \pdiff{}{\alpha_i} \sum_\sigma \norm{R_{\text{opt}}^{\sigma n}}^2 = 2 \sum_\sigma \sum_{j} A^{\sigma ij} \alpha_j \end{equation*} substituting into the Lagrange Multiplier equation \eqref{eq:5} gives \begin{equation*} 2 \sum_\sigma \sum_j A^{\sigma ij} \alpha_j + \gamma \sum_\sigma N^{\sigma i} = 0 \end{equation*} If we define \begin{align*} A^{ij} & \equiv \sum_{\sigma} A^{\sigma ij} \\ N^{i} & \equiv \sum_{\sigma} N^{\sigma i} \end{align*} then assuming $$A^{ij}$$ is invertible \begin{equation*} \alpha_i = \frac{-\gamma}{2} \sum_{j} A^{-1}_{ij} N^j \end{equation*} substitute this into constraint equation \eqref{eq:4}, we therefore get \begin{equation*} \alpha_i = N \frac{\sum_j A^{-1}_{ij} N^j} {\sum_{ij} A^{-1}_{ij} N^i N^j} \end{equation*} It is worth noting that the issues with constraints and conservation of electron numbers do not feature in the Pulay minimisation scheme described in section Order $$N$$: Spin Polarised Energy Minimisation. This is because the eigen-values of auxiliary matrices $$\mat{L}^{\sigma}$$ are already restricted within $$[0,1]$$ due to the McWeeny initialisation steps, and therefore the standard Pulay constraint of $$\sum_i \alpha_i = 1$$ is enough to ensure the eigenvalues of $$\mat{L}$$ to remain inside $$[0,1]$$. Further more, the electron numbers are constrained at each step anyway by the added electron number correction step.

#### 2.5.3 Kerker Preconditioning and Wave-dependent Metric

Following the work of the previous dCSE project[16], the Kerker preconditioning and wave-dependent metric methods were implemented in Conquest. Due to the need to minimise the number of FFTs, separate storages were provided for the pre-conditioned residuals and the covariant versions of the residuals needed for the wave-dependent metric. Because of this there was a lot of repeats of the Pulay mixing code as the code requires to access different arrays depending on the user options on whether to switch on the preconditioning and wave-dependent metric.

Since the Pulay mixing algorithms needs to be rewritten for spin polarised calculations, I took this opportunity to tidy up the code by introducing pointers to residual and covariant residual arrays so that the Pulay mixing step only has to work with the pointers, while a separate subroutine manages these pointers to make sure they point to the correct data arrays based on user options.

### 2.6 Spin Polarised Exchange-Correlation Functionals

The ground-state energy Conquest or any other density functional code tries to calculate is the sum of several parts[7]: \begin{equation*} E_{\text{DFT}} = T + E_{\text{H}} + E_{\text{xc}} + E_{\text{local}} + E_{\text{nl}} + E_{\text{ion}} \end{equation*} where $$T$$ is the non-interacting electron kinetic energy, $$E_{\text{H}}$$ is the Hartree energy corresponding to the classical coulomb interaction between te charge densities, $$E_{\text{xc}}$$ is the exchange-correlation energy that takes into account all the non-classical interactions between the electrons, $$E_{\text{local}}$$ and $$E_{\text{nl}}$$ are the local and non-local pseudopotential energies describing electron-ion interactions and finally $$E_{\text{ion}}$$ is the Ewald ion-ion interaction energy.

$$E_{\text{ion}}$$ does not depend on electronic charge density. The kinetic, Hartree and both pseudopotential energies depend only on the total charge density $$\rho^\up + \rho^\dn$$. Only the exchange-correlation energy depends also on the spin polarisation $$\rho^\up - \rho^\dn$$. Therefore for spin implementation in Conquest I have to extend the versions of the exchange-correlation functionals already implemented in the code to include dependences on $$\rho^\up - \rho^\dn$$.

Since the main goal of the project is to add spin infrastructure to Conquest, priority is given to just implement the minimum set of spin polarised exchange-correlation functionals to ensure the code functions correctly, while implementation of the spin polarised version of the full set of different flavours of exchange-correlation functionals already implemented in the code would take too much time and is put-aside as possible projects for the future.

The table below describes the status of the various functionals implemented in Conquest

Functional TypeNameStatus
LDAPZ81[12]Spin non-polarised, disabled for spin polarised calculations
LDAGTH96[4]Spin non-polarised, disabled for spin polarised calculations
LDAPW92[11]Fully Spin Polarised
GGAPBE[9]Spin Polarised, disabled for non-self-consistent force calc.
GGArevPBE[10]Spin Polarised, disabled for non-self-consistent force calc.
GGARPBE[5]Spin Polarised, disabled for non-self-consistent force calc.

I aimed to provide at least one spin polarised LDA functional and one GGA functional for Conquest. Since the three flavours of the GGA functionals already included in Conquest are very similar, minimum amount of work was required to upgrade all of the GGA functionals to work with spin.

Conquest also allows the user to calculate non-self-consistent atomic forces. This requires the code to be able to calculate the derivative of exchange-correlation potentials (double derivative with respect to the energy functionals) with respect to the spin charge density. The spin polarised form of the derivative for PW92 LDA functional could be derived relatively easily with the help of Mathematica[18], and thus has been implemented in Conquest. For the GGAs, the form of the non-self-consistent forces are given by the work of Torralba et al. [17]. However an extension to spin-polarised case would involve very complex algebra and would have taken too long. Since this is not part of the main aim of this dCSE project, I have left this work for possible future projects.

### 2.7 Test Results

To test the spin implementations I have done calculations mainly on two systems: bulk Silicon in FCC lattice, which is not ferromagnetic and should have a net spin polarisation of 0 at ground state; and bulk iron, in BCC lattice, which is ferromagnetic and should have an experimental magnetic moment of 2.12 per atom (the difference between the number of electrons in different spin channels).

For bulk Si I used a simulation cell of 8 atoms with lattice parameter being 5.4282 Ang, integration grid size of $$72 \times 72 \times 72$$ and $$13 \times 13 \times 13$$ grid $$\vec{k}$$-points. The functional used was PW92 LDA, and a single-zeta minimum basis set was used.

For bulk Fe I used a simulation cell of 2 atoms with lattice parameter being 2.8690 Ang, integration grid size of $$36 \times 36 \times 36$$ and $$13 \times 13 \times 13$$ grid $$\vec{k}$$-points. The functional used was also PW92 LDA and a single-zeta minimum basis set was used.

The case of spin fixed calculations with initial spin polarisation of 0 should be the same as spin non-polarised calculations (vanilla Conquest), and indeed this is the case

bulk Si
Calculation TypeGround State Energy (eV)
Diagonalisation (spin non-polarised)-855.811987
Diagonalisation (spin fixed, polarisation = 0)-855.811987
Order $$N$$ (spin non-polarised)-855.603347
Order $$N$$ (spin fixed, polarisation = 0)-855.603347

The small differences in the total energy calculated amounts to the different minimisation schemes implemented. For diagonalisation calculations, Fermi smearing has been used with a temperature of 300K to help convergence. For order $$N$$ methods however there is no need to calculate Fermi energies and occupancies and so the calculation is strictly zero temperature.

Figures 1 and 2 show the total spin polarisation in the simulation cell for bulk silicon during the self-consistency iteration step for a spin relaxation calculation using diagonalisation and order $$N$$ methods. In both cases spin polarisation converges to 0.0, confirming our expectations.

Figures 3 and 4 show the total spin polarisation in the simulation cell for bulk iron during the self-consistency iteration steps for a spin relaxation calculation. Both diagonalisation and order $$N$$ methods are used. As we can see if the calculation starts of with 0.0 magnetisation (i.e. populations in each spin channels equal) then the magnetisation stays 0.0 through-out the self-consistency steps. This corresponds to one of the stable magnetisation states of bulk iron. Once we break the symmetry the calculations converges to a total spin polarisation of around 5.2 per cell (see the table below for more details). This corresponds to spin polarisation of 2.6 per atom, which is higher than the experimental values of 2.12. However this is expected as LDA is known to over estimate spin polarisations and we have used minimum basis set. When compared with calculations done with other codes (see below) this value is reasonable.

I have also tested calculations using the Conquest blip (B-spline) basis set, with the same number of support functions as the PAO base. At each energy minimisation step the blip basis coefficients are also optimised. The results on convergence of spin polarisation are presented in figures spinpolarisationvsiterationFeDiagblip and spinpolarisationvsiterationFeONblip. They are comparable to the PAO calculations.

The final spin polarisations per atom calculated using different methods for bulk iron with non-zero initial magnetisation is given in the following table

Method/Codeinit. mag = 2.04.06.08.0
Diagonalisation, LDA, SZ (PAO)2.626362.626362.626622.62637
Diagonalisation, LDA, Blips2.696372.696992.696832.69670
Order $$N$$, LDA, SZ (PAO)2.598532.598922.597952.59901
Order $$N$$, LDA, Blips2.687642.687482.687112.68829
Siesta, LDA, Blips2.62397

Note that for Siesta calculation, the same pseudopotential and PAO SZ basis set, integration grid ($$36 \times 36 \times 36$$) and the same Fermi smearing temperature of 300K were used.

## 3 Implementing van der Waals Energy Correction

The aim for the second part of this dCSE project is to implement the ab initio van der Waals energy correction term to density functional calculations described by Dion et al.[3] in Conquest. The energy correction term provides a excange-correlation functional which includes long-range correlations essencial for modelling van der Waals interactions.

The new functional takes the form \begin{equation*} E_{\text{vdW}} = E_{\text{x}}^{\text{revPBE}} + E_{\text{c}}^{\text{PW92}} + E_{\text{c}}^{\text{nl}} \end{equation*} so that the functional takes the exchange term from a GGA functional and the correlation term from a LDA functional together with contribution from a non-local functional.

The non-local functional is given as $$E_{\text{c}}^{\text{nl}} = \frac{1}{2} \iint \D^3\vec{r}_1\D^3\vec{r}_2\, \rho(\vec{r}_1) \rho(\vec{r}_2) \phi(q_1, q_2, r_{12}) \label{eq:Enl}$$ where $$r_{12} = \norm{\vec{r}_1 - \vec{r}_2}$$, $$q_i = q_0(\vec{r}_i)$$, and $$q_0(\vec{r})$$ is a function of $$\vec{r}$$ that is dependent on charge density $$\rho(\vec{r})$$ and its gradient. Its form is given by equations (11) and (12) in Dion's paper[3]. The form of the non-local functional kernel $$\phi(q_1, q_2, r_{12}) = \phi(d_1, d_2)$$, where $$d_i = q_ir_{12}$$ is given by equations (14) to (16) in Dion's paper.

It would be impractical to try to evaluate equation \eqref{eq:Enl} directly due to the double integral on real-space coordinates. Even for a modest grid of $$64 \times 64 \times 64$$ in the simulation cell, this corresponds to $$64^6 = 68719476736$$ number of evaluation of $$\phi$$ and products with the two densities to evaluate the integral.

Approximations must be applied.

### 3.1 Overview of Approach

Roman-Perez et al.[15] proposed an efficient approximation to equation \eqref{eq:Enl}, by interpolating the non-local kernel $$\phi(q_1, q_2, r_{12})$$ on a logarithmic grid of the $$q_0(\vec{r})$$ values: \begin{equation*} \phi(q_1, q_2, r_{12} \approx \sum_{\alpha\beta} \phi(q_\alpha, q_\beta, r_{12}) p_\alpha(q_1) p_\beta(q_2) \equiv \sum_{\alpha\beta} \phi_{\alpha\beta}(r_{12}) p_\alpha(q_1) p_\beta(q_2) \end{equation*} where $$q_\alpha$$ are points on the logarithmic grid, and $$p_{\alpha}(q_1)$$ are a set of cubic spline interpolations of the delta functions $$\delta(q_1 - q_\alpha)$$ whose values at the grid points are $$\delta_\beta = \delta_{\alpha,\beta}$$.

By approximating the kernel this way, the dependence on $$q_1$$ and $$q_2$$ are now separated, and hence equation \eqref{eq:Enl} can be approximated as \begin{equation*} E_{\text{c}}^{\text{nl}} \approx \frac{1}{2} \iint \D^3\vec{r}_1\D^3\vec{r}_2\, \sum_{\alpha\beta} \theta_{\alpha}(\vec{r}_1) \phi_{\alpha\beta}(r_{12}) \theta_{\beta}(\vec{r}_2) \end{equation*} where $$\theta_{\alpha}(\vec{r}_i) = \rho(\vec{r}_i) p_\alpha(q_i)$$. The equation above now contains a convolution and hence when working in the Fourier transform space we have $$E_{\text{c}}^{\text{nl}} = \frac{1}{2} \int\D^3\vec{k}\, \tilde{\theta}^*_{\alpha}(\vec{k}) \tilde{\theta}_{\beta}(\vec{k}) \tilde{\phi}_{\alpha\beta}(k) \label{eq:EnlFFT}$$ The calculation thus becomes manageable.

$$\phi_{\alpha\beta}(r_{12})$$ is a radial function, and hence a radial FFT subroutine needs to be implemented in Conquest in order to obtain $$\tilde{\phi}_{\alpha\beta}(k)$$. The $$\theta_\alpha(\vec{r})$$ functions are ordinary functions on the 3D real grid, and hence can be Fourier transformed by the existing FFT algorithms implemented in Conquest.

### 3.2 Grids

There are several different types of grids used for the implementation.

1. First of all there is the Conquest's own integration/FFT grid, distributed among the processors, on which the values of $$\theta_\alpha(\vec{r})$$ and their Fourier transforms will be stored.
2. There is the grid for the $$q$$ values, and upon which each value of $$\phi_{\alpha\beta}(r)$$ (for fixed $$r$$) and polynomials $$p_{\alpha}$$ are defined. These are logarithmic grids, and following advice from the paper[15], I have defined 30 grid points for this grid. And this grid is local to each processor.
3. Further more, there is the radial grid, upon which the values of $$\phi_{\alpha\beta}(r)$$ is calculated for fixed $$\alpha$$ and $$\beta$$. Since the radial grid is one dimensional (and hence relatively small) and Fourier transforms will be done on this grid, it is sensible to make this grid local to each processor as well. For the moment I have fixed the grid size to 1024. This is dense enough for most calculations, but still tiny when compared with the 3D real-space grid.

The final integration for calculating the non-local energy $$E_{\text{c}}^{\text{nl}}$$ is done on the Conquest integration/FFT grid. However if we simply use the radial Fourier transforms of $$\phi_{\alpha\beta}(r_{12})$$ calculated on the radial grid, we will cause errors due to inconsistencies of the two different grids. Hence the $$\tilde{\phi}_{\alpha\beta}(k)$$ calculated will be interpolated results from the values calculated on the radial grid.

### 3.3 Implementation of Radial FFT

To calculate the Fourier transforms of $$q(r)$$ defined on the radial grid, I have to implement a version of radial FFT in Conquest.

We note that for any radial function $$f(\vec{r}) = f(r)$$ the Fourier transform may be written in spherical polar coordinates with the $$z$$-axis defined to coincide with vector $$\vec{k}$$, as: \begin{equation*} F(\vec{k}) = 2\pi \int_0^\infty \D{r}\, \int_0^\pi \D\theta\, r^2 \sin(\theta) e^{-i2\pi k r \cos(\theta)} f(r) \end{equation*} where $$k = \norm{\vec{k}}$$. The $$\theta$$ integral may be evaluated to give \begin{equation*} \int_0^\pi \D\theta \sin(\theta) e^{-i2\pi k r \cos(\theta)} = \left[ e^{-i2\pi k r \cos(\theta)} \right]^\pi_0 \frac{1}{i2\pi k r} = \frac{2\sin(2\pi k r)}{2\pi k r} \end{equation*} Therefore: $$F(\vec{k}) = F(k) = \int_0^\infty \D{r}\, \frac{2\sin(2\pi k r)}{2\pi k} r f(r) \label{eq:contFFT}$$ In other words the Fourier transform of a radial function is also radial.

Now we need to discretise equation \eqref{eq:contFFT}. If the real space radial grid goes from 0 to $$r_{\text{max}}$$ with $$N$$ data points then \begin{equation*} F_j \approx dr \sum_{i=0}^{N-1} \frac{2 \sin(2\pi idr jdk)}{jdk} idr f_i \end{equation*} where $$i = 0, N-1$$ indices the real space radial grid and $$j = 0, N-1$$ indices the reciprocal grid. $$dr = r_{\text{max}} / N$$, and $$dk$$ is given in terms of the Nyquist critical frequency associated to the chosen real space grid. The critical frequency is given[13] as $$f_c = \frac{1}{2dr}$$. This is the maximum possible $$k$$ value beyond which point any non-zero F(k) values will be folded back and causing aliasing. Hence it is the usual practice to set the maximum $$k$$ value on the discrete FFT radial grid to be $$f_c$$. So going from 0 to $$f_c$$ with $$N$$ points, $$dk = \frac{1}{2Ndr}$$. Therefore substitute the value of $$dk$$ into the above equation, we obtain \begin{equation*} F_j \approx \frac{dr}{jdk} \sum_{i=0}^{N-1} 2 \sin(2\pi ij/N) idr f_i \end{equation*} Define $$g_i \equiv 2idrf_i$$ we thus have $$F_j \approx \frac{dr}{jdk} \sum_{i=0}^{N-1} \sin(2\pi ij/N) g_i \label{eq:discFFT}$$ where unless $$f_i \sim \frac{1}{idr^n}$$ for integer $$n > 1$$, $$g_i = 0$$ at $$i = 0$$. Hence baring the special case for $$f_i$$ (which will be outside the range of functions we will need to consider) the sum in equation \eqref{eq:discFFT} is just a standard discrete sin transform[13].

There is a logarithmic singularity at $$k = 0$$ due to the $$\frac{1}{k}$$ factor in equation \eqref{eq:contFFT}, and therefore the $$k = 0$$ point is treated separately. If $$k = 0$$ then $$\lim_{k \to 0} \sin(2\pi k r){k} = 2\pi r$$. Hence after discretisation \begin{equation*} F_0 \approx dr \sum_{i=0}^{N-1} 4\pi (idr)^2 f_i \end{equation*} Thus the algorithm for radial FFT implemented in Conquest is as follows

subroutine radfft(func, func_out, n_r, dr)
dk = half / real(dr * n_r, double)
! set the real data to 2 * r * func(r)
do ir = 1, n_r
rr = (ir - 1) * dr
data(ir) = two * rr * func(ir)
end do
! do sin FT
call sinft(data, n_r) ! from Numerical Recipes
! leave the k = 0 point aside, this is treated separately
do ik = 2, n_r, 1
kk = (ik - 1) * dk
func_out(ik) = data(ik) / kk
end do
! doing the k = 0 point
! the limit (k -> 0) of 2 * r * sin(2*pi*k*r) / k is 4*pi*r**2
func_out(1) = zero
do ir = 1, n_r
rr = (ir - 1) * dr
func_out(1) = func_out(1) + (four * pi * rr**2 * func(ir))
end do
! scale func_out by to give approximation to non-discrete form
func_out = dr * func_out
end subroutine


It is important to note that radfft calculates the approximation of the continuous radial Fourier transform, unlike a standard FFT subroutine, which calculates the discrete Fourier transform of a given set of data points.

The grid size must be of integer powers of two.

The subroutine has been tested against and shows good agreement to the analytical results calculated from Mathematica on various gaussian functions.

### 3.4 Implementation of van der Waal Energy Correction

The non local correlation energy $$E_{\text{c}}^{\text{nl}}$$ given in equation \eqref{eq:EnlFFT} are calculated in Conquest as follows.

subroutine vdWXC_nl_energy(rho, E_vdW_nl)
1. From the Conquest integration/FFT mesh get mesh spacing
2. From FFT mesh spacing, calculate Nyquist critical frequency and
hence kcut and dk
3. Set up q-mesh (logarithmic)
4. Calculate phi(r, q1, q2), and its Fourier transform for each
points q1, q2 on the q-mesh and r (and k) on the radial mesh as
well as the second derivatives and store the interpolation table
in file vdW_kernel.tab
5. Calculate gradient of rho (density) and store in grho
6. Calculate polynomials p(q1) as a function of rho and grho
6.1 Calculate the value of q from rho and grho
6.2 Interpolate p(q1) from given q for every q1 on q-mesh
7. From p(q1) and rho calculate theta(vec_r,q1)
8. FFT theta(vec_r,q1) on Conquest FFT grid for every q1 on q-mesh
9. Loop over FFT grid points
Calculate mag_k = sqrt(sum(vec_k**2)) for a given vec_k on grid point
if (mag_k < kcut) then
Interpolate from the vdW_kernel table to get phi(k,q1,q2)
calculate the complex matrix multiplication and dot product on q-mesh
phi(k,q1,q2) * theta(vec_k,q1) theta(vec_k,q2)
Accumulate into E_vdW_nl (for non-local energy)
end if
end Loop
10. Gather and reduce values of E_vdW_nl distributed among the processors
end subroutine vdWXC_nl_energy


The over all van der Waals exchange-correlation energy is the calculated as a sum of $$E_{\text{c}}^{\text{nl}}$$ (calculated above), the exchange part of revPBE functional and the correlation part of PW92 function.

When used in actual applications, Conquest first runs a self-consistent calculation with either one of the LDA or GGA functionals, and after it reached self-consistency the van der Waals exchange-correlation energy is then calculated using the self-consistent density. The van der Waal energy correction is then calculated by subtracting the original LDA/GGA exchange-correlation energy from the calculated value.

### 3.5 Test Results

Test runs have been performed on a system of two benzene rings, stacked one directly on top of the other along the concentric axis. Interaction energies of the benzene rings can be calculated by taking the difference in total energy between the two ring system and 2 times a single benzene system. The van der Waals correction to interaction energies are calculated by first calculating interaction energies calculated using either LDA or GGA functionals, and then subtract this from the interaction energies calculated with van der Waals energy correction turned on4. The calculated results are then compared with results calculated from Siesta, which has implemented the same van der Waals functional. It has to be noted however that the comparison can not be exactly due to the fact that in Siesta the van der Waals functional is self-consistent and hence unlike in Conquest calculation the results are calculated using van der Waals function (potential) through out.

The simulation cell (for both Conquest and Siesta calculations) was set to be cubic, with lattice parameter set to 20.0 Ang on all sides. The number of grid points are set to be $$128 \times 128 \times 128$$. For Conquest calculations pseudopotential and basis sets are generated by either LDA or GGA functionals (which ever is under comparison), for Siesta calculations, the LDA and GGA results are calculated using identical pseudopotential and basis sets as that used by Conquest, however the van der Waals used pseudopotentials and basis sets generated by van der Waals functional.

Figures 5 and 6 shows the van der Waals correction to interaction energies for system calculated either using PW92 LDA or revPBE GGA. The values are plotted against separation distances between the two benzene rings. The results show reasonable agreement between the Conquest implementation and Siesta implementation, bare in mind that the Siesta calculation is self-consistent while Conquest is not.

The table below shows the total execution time taken for the van der Waals correction implementation and its percentage in relation to the total program run time. The test calculation is done on two benzene rings using revPBE functional with van der Waals correction with an integration grid size of $$128 \times 128 \times 128$$. It took 6 SCF steps for the calculation to converge

Number of NodesTotal run time (s)vdW time (s)Ratio
1339.730000156.1250000.459000
2217.320000116.0820300.534000
4131.87000093.3984400.708000

It can be seen that the van der Waals correction subroutines does cost a significant portion of the total run time. There is also an indication that it becomes less efficient compared to the rest of the code when number of computation nodes increases.

The code generates the interpolations of the non-local kernels when it first runs. For the subsequent runs the results are simply read from the table at start up. The following shows the times taken if the table has been pre-generated

Number of NodesTotal run time (s)vdW time (s)Ratio
1309.060000133.9140600.433000
2196.29000095.9335900.488000
4113.66000076.6367200.674000

Slight improvements on performance can be observed.

The long processing time can be expected, due to the nature of the algorithm. For the calculation of the non-local correlation energy $$E_{nl}$$, we have to do a matrix product and a dot product on the $$q$$-mesh, for every integration point on the 3D FFT mesh. This amounts to $$30^2 = 900$$ times the work-load compared to a typical electron number calculation or an inner product between the residuals.

This can be confirmed by following table, which shows break downs on the major parts of the van der Waals code:

AreaTime/Ratio (s) (1 node)2 nodes4 nodes
Generate $$\phi$$ table19.390630 (0.124)18.445310 (0.158)18.195310 (0.194)
build $$\grad{\rho}$$2.613280 (0.016)2.832030 (0.024)2.156250 (0.023)
build $$\theta$$2.253910 (0.014)1.066410 (0.009)0.625000 (0.006)
FFT $$\theta$$19.429690 (0.124)21.257810 (0.183)24.472660 (0.262)
Integrate $$E_{nl}$$78.988280 (0.505)49.218750 (0.423)25.640630 (0.274)

This clear shows that for small number of nodes the $$\vec{k}$$ space integration is the most expensive part of the implementation. However as the number of nodes increase FFT quickly becomes the new bottle neck. This also also explains the apparently poor scaling displayed by the code. Because the implementation requires 30 FFTs to be done the communication inefficiencies of the FFT routines quickly becomes a significant factor. Note that the poor scaling in the $$\phi$$ table generation part is mainly due to only one node is doing all the IO during the writing to file phase, and this process is expected to be slow and non-scaling.

If the non-local kernel table has been pre-generated, then we can see clearly that significant amount of work can be avoided.

AreaTime/Ratio (s), 1 node2 nodes4 nodes
Generate $$\phi$$ table0.187500 (0.001)0.187500 (0.001)0.203130 (0.002)
build $$\grad{\rho}$$2.851560 (0.021)2.175780 (0.022)2.847660 (0.037)
build $$\theta$$1.878910 (0.014)1.062500 (0.011)0.507810 (0.006)
FFT $$\theta$$18.902340 (0.141)18.785160 (0.195)24.929690 (0.325)
Integrate $$E_{nl}$$76.058590 (0.567)50.558590 (0.527)25.847660 (0.337)

The table below shows memory usage by the implementation:

Calc TypeMaximum RSS (MB) per node, 1 node2 nodes4 nodes
vdW without table2021.9101561053.257812549.960938
vdW with table2021.8515621053.210938547.882812
Vanilla (no vdW)1017.695312545.023438276.968750

The van der Waals implementation uses twice the amount of memory as the original code. My initial version of the implementation uses even more memory, due to the fact that original Conquest implementations of exchange-correlation functionals takes in the grid data arrays as input and output. This required a number of temporary storages for passing grid data from one subroutine to another. To reduce memory I changed the implementation of these functional act point-wise on the grid data. While this introduced inefficiency as the exchange-correlation functional subroutines have to be called at every grid point, it reduced memory usage significantly (by about factor of three). However even after this memory usage is still large.

This high memory usage is caused by the fact that there are 30 $$\theta$$ functions—one for every point on the $$q$$-mesh, all of which are data points on the real/FFT grid—we need to store at the same time. We cannot do this one by one because a pair of the functions are required at any one time in the integral for calculating $$E^{\text{nl}}_{\text{c}}$$, and we have to go through all permutations of the pairings. One could in theory recalculate the $$\theta$$ functions at every integration accumulation step, but that will lead to a lot of unnecessary re-evaluation of the $$\theta$$ functions and their Fourier transforms. Considering FFT becomes the bottle neck for more processors I believe doing this would be unwise.

## 4 Conclusion and Future Work

Both spin polarisation and van der Waals functional have been successfully implemented and test calculations gives expected results. For spin relaxation calculations the new energy minimisation scheme and Pulay mixing scheme for self-consistency has been tested and demonstrated to be working. The data structure of Conquest now handles spin polarisation naturally, and efforts has been made to make the existing subroutines to handle both spin non-polarised and polarised calculations. This helps to make future developments in Conquest easier, the developer do not have to develop separate versions codes for spin non-polarised and polarised calculations.

Four out of the six exchange-correlation functionals implemented in Conquest has been extended to work with spin polarised calculations. Future work remains for converting the rest. While the spin polarised version of non-self-consistent forces has been implemented for a LDA functional, the derivation of equations for the GGA functionals are left for future work due to its complexity. Protective warning messages and default behaviours have been added in the code to prevent the user from making wrong choice of functionals and force settings when running a spin polarised calculation.

Memory usage is a big concern with the implementation of van der Waals functions. As it is shown in test calculations, the memory usage with van der Waals correction turned on is almost twice of standard calculations. The fundamental reason for high memory usage seems to be due to the functional form used to approximate the functional. This same problem has been observed in Siesta. Never-the-less this approximation made a functional previously impractical to use (due to its double integration and non-local nature) practical.

Due to the large number of grid integrations one needs to perform, van der Waals functional is relatively expensive computationally, in our test calculations it has the equivalent cost of roughly 6 energy minimisation operations. While integration on grid shows good scalability with number of nodes, the over all scalability of the code is hampered by the extensive use of FFT. For calculation of large systems, a more efficient FFT engine would benefit. At the moment Conquest uses its own FFT implementation, and future work could be done to allow Conquest to use standard external FFT libraries optimised for HCP platforms.

It is also possible in the future to extend the van der Waals implementation for self-consistent calculations. This involves finding the potential by calculating the derivatives of the functional with respect to density.

At the moment most of the grid settings for the logarithmic and radial grid used by van der Waals functional are defined as parameters in the code. While those seldom needs to be changed, in the future it would be beneficial to include them as part of the input changeable by the user.

The spin implementations has already been submitted to Conquest code repository and implementation of van der Waals functional will be submitted shortly. The beta release of Conquest can be obtained from http://www.order-n.org/.

## 5 Acknowledgement

This project was funded under the HECToR Distributed Computational Science and Engineering (CSE) Service operated by NAG Ltd. HECToR—A Research Councils UK High End Computing Service—is the UK’s national supercomputing service, managed by EPSRC on behalf of the participating Research Councils. Its mission is to support capability science and engineering in UK academia. The HECToR supercomputers are managed by UoE HPCx Ltd and the CSE Support Service is provided by NAG Ltd. http://www.hector.ac.uk

## Footnotes:

1 This means the spins of the electrons in the system under study are aligned, so spin states can be either "up" or "down".

2 In the case of spin non-polarised calculation the Fermi energies of the spin channels happen to be the same due to symmetry of the two channels.

3 The inner product between two residual functions $$R_n$$ and $$R_m$$ is defined by $$\inner{R_n}{R_m} \equiv \int\D^3\vec{r}\, R_n(\vec{r}) R_m{\vec{r}}$$; and the norm of $$R_n$$ is simply $$\norm{R_n} = \sqrt{\inner{R_n}{R_n}}$$.

4 The same LDA or GGA functional is used to compute the self-consistent density prior the van der Waals correction step.

## References

 [1] D. R. Bowler. Linear-scaling density matrix minimisation and electron number conservation. Technical report, University College London, 2005. [2] D. R. Bowler, I. J. Bush, and M. J. Gillan. Practical methods for ab initio calculations on thousands of atoms. Int. J. Quantum Chem., 77(5):831-842, 2000. [DOI | http ] [3] M. Dion, H. Rydberg, E. Schroder, D. C. Langreth, and B. I. Lundqvist. Van der waals density functional for general geometries. Phys. Rev. Lett., 92(24):246401, 2004. [http ] [4] S. Goedecker, M. Teter, and J. Hutter. Separable dual-space gaussian pseudopotentials. Phys. Rev. B, 54(3):1703-1710, Jul 1996. [DOI | http ] [5] B. Hammer, L. B. Hansen, and J. K. Nørskov. Improved adsorption energetics within density-functional theory using revised perdew-burke-ernzerhof functionals. Phys. Rev. B, 59:7413-7421, Mar 1999. [DOI | http ] [6] G. Kresse and J. Furthmüller. Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set. Computational Materials Science, 6(1):15-50, July 1996. [http ] [7] R. M. Martin. Electronic Structure: Basic Theory and Practical Methods. Cambridge University Press, 2004. [8] A. H. R. Palser and D. E. Manolopoulos. Canonical purification of the density matrix in electronic-structure theory. Phys. Rev. B, 58(19):12704-12711, Nov 1998. [DOI | http ] [9] J. P. Perdew, K. Burke, and M. Ernzerhof. Generalized gradient approximation made simple. Phys. Rev. Lett., 77(18):3865-3868, Oct 1996. [DOI ] [10] J. P. Perdew, K. Burke, and M. Ernzerhof. Perdew, burke, and ernzerhof reply:. Phys. Rev. Lett., 80(4):891, Jan 1998. [DOI ] [11] J. P. Perdew and Y. Wang. Accurate and simple analytic representation of the electron-gas correlation-energy. Phys. Rev. B, 45(23):13244-13249, June 1992. [12] J. P. Perdew and A. Zunger. Self-interaction correction to density-functional approximations for many-electron systems. Phys. Rev. B, 23(10):5048-5079, 1981. [13] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in Fortran 77: The Art of Scientific Computing, volume Volume 1 of Fortran Numerical Recipes. Cambridge University Press, second edition edition, 2001. [14] P. Pulay. Convergence acceleration of iterative sequences. the case of scf iteration. Chem. Phys. Lett., 73(2):393-398, 1980. [DOI | http ] [15] G. Román-Pérez and J. M. Soler. Efficient implementation of a van der waals density functional: Application to double-wall carbon nanotubes. Phys. Rev. Lett., 103(9):096102, Aug. 2009. [http ] [16] L. Tong. Metal conquest. Technical report, UCL and NAG, Mar. 2011. [http ] [17] A. S. Torralba, D. R. Bowler, T. Miyazaki, and M. J. Gillan. Non-self-consistent density-functional theory exchange-correlation forces for gga functionals. J. Chem. Theory Comput., 5(6):1499-1505, June 2009. [DOI ] [18] I. Wolfram Research. Mathematica edition: Version 7.0, 2008.

This file was generated by bibtex2html 1.96.

Date: Tuesday, 2012/06/05

Org version 7.8.03 with Emacs version 23

Validate XHTML 1.0