All of the proposed changes to CONQUEST code have been successfully implemented and tested. The Kerker preconditioning method shows clear advantage over linear mixing in allowing calculations to converge under charge-sloshing conditions. However it is not clear from the examples we have tested how much improvement Kerker and wave-dependent metric preconditioning have over Pulay mixing, as the aluminium bulk with defect system also reached self-consistency relatively fast. It is particular difficult to test the effectiveness of wave-dependent metric preconditioning in this case because it has to be used together with Pulay mixing. Therefore further testing may be required to realise the true potential of Kerker preconditioning and wave-dependent metric implementations. We may have to test on a larger system with more complicated defects.

A technical complexity that could lead to the potential problem of non-unique Fermi energy is discovered in the Methfessel-Paxton method for approximating the step function. This is an intrinsic problem originating from the form of the Hermite polynomials. The standard search methods still always find a Fermi energy, but in the rare case of the existence of more than one possible solution, the method can only pick a random one. This will cause problems later on in the calculations especially if one wants to calculate forces. We have developed a search method that ensures always the lowest Fermi energy state is found. And the implementation in CONQUEST is tested to be working as expected, with the Methfessel-Paxton approximation allowing much higher smearing temperatures while giving more accurate ground-state energies than Fermi-Dirac smearing. As demonstrated this allows one to reduce the number of $\vec{k}$ points required for a calculation significantly and hence reduces the computational cost.

The bottleneck of the calculations was found to be the diagonalisation process, as expected. And it seems the main bottleneck within the diagonalisation process comes from the imbalances in MPI communications initiated by ScaLAPACK. Changing the ScaLAPACK block sizes will give a significant change in the performance of the code, the smaller the block sizes the less load imbalances but at the same time more communications. Further study is required for testing a wide range of system sizes and ScaLAPACK parameters which will allow us to develop a better automatic parameterisation scheme for CONQUEST, and make it more user friendly. There is an unsolved issue on why the calculation with $2
\times 2$ processor grid and non-square block sizes fails. There may yet be a bug in the code waiting to be resolved, and more work needs to be done to solve this issue.

The modification of CONQUEST for $\vec{k}$ point parallelisation has been successful. There is currently a limiting requirement that each processor group must hold the same number of processors and no-redundant processors are allowed. This means $N_G$ must be chosen to be a factor of the total number of processors. We have shown that by dividing processors into subgroups each working on a $\vec{k}$ point offers more flexibility and in most cases improves on the efficiency of the code. This is especially true if running CONQUEST on a machine that do not have a highly optimised linear algebra library. One limitation of the current implication is that $\vec{k}$ point parallelisation only applies to the processes involved in diagonalisation, and for other processes the matrices are still shared between all available processors. This prevents one from using more processors than allowed on calculation with more $\vec{k}$ points than number of atoms. There are only a certain number of processors allowed in a given calculation because no processor is allowed to have zero atoms, but atoms are distributed to all processors. For example it is not possible to calculate the 32 atom cell bulk aluminium system with 2 processor-groups each having a $1 \times 4$ processor grid, because this requires 8 processors, but for 32 atoms case some processors will not be allocated with atoms. More tests are needed to show the potential $\vec{k}$ point parallelisation have on much larger calculations on HECToR.

The implementations in CONQUEST will be submitted to the code repository after further testing and will be available in the future (beta) release of the code obtainable from

Lianheng Tong 2011-03-02