All of the proposed changes to CONQUEST code have been successfully implemented and tested. The Kerker preconditioning method shows clear advantage over linear mixing in allowing calculations to converge under charge-sloshing conditions. However it is not clear from the examples we have tested how much improvement Kerker and wave-dependent metric preconditioning have over Pulay mixing, as the aluminium bulk with defect system also reached self-consistency relatively fast. It is particular difficult to test the effectiveness of wave-dependent metric preconditioning in this case because it has to be used together with Pulay mixing. Therefore further testing may be required to realise the true potential of Kerker preconditioning and wave-dependent metric implementations. We may have to test on a larger system with more complicated defects.
A technical complexity that could lead to the potential problem of
non-unique Fermi energy is discovered in the Methfessel-Paxton method
for approximating the step function. This is an intrinsic problem
originating from the form of the Hermite polynomials. The standard
search methods still always find a Fermi energy, but in the rare case
of the existence of more than one possible solution, the method can only
pick a random one. This will cause problems later on in the
calculations especially if one wants to calculate forces. We have
developed a search method that ensures always the lowest Fermi energy
state is found. And the implementation in CONQUEST is tested
to be working as expected, with the Methfessel-Paxton approximation
allowing much higher smearing temperatures while giving more accurate
ground-state energies than Fermi-Dirac smearing. As demonstrated this
allows one to reduce the number of points required for a
calculation significantly and hence reduces the computational cost.
The bottleneck of the calculations was found to be the diagonalisation
process, as expected. And it seems the main bottleneck within the
diagonalisation process comes from the imbalances in MPI communications
initiated by ScaLAPACK. Changing the ScaLAPACK block sizes will give a
significant change in the performance of the code, the smaller the
block sizes the less load imbalances but at the same time more
communications. Further study is required for testing a wide range of
system sizes and ScaLAPACK parameters which will allow us to develop
a better automatic parameterisation scheme for CONQUEST, and
make it more user friendly. There is an unsolved issue on why the
calculation with processor grid and non-square block
sizes fails. There may yet be a bug in the code waiting to be
resolved, and more work needs to be done to solve this issue.
The modification of CONQUEST for point
parallelisation has been successful. There is currently a limiting
requirement that each processor group must hold the same number of
processors and no-redundant processors are allowed. This means
must be chosen to be a factor of the total number of processors. We
have shown that by dividing processors into subgroups each working on
a
point offers more flexibility and in most cases improves on
the efficiency of the code. This is especially true if running
CONQUEST on a machine that do not have a highly optimised
linear algebra library. One limitation of the current implication is
that
point parallelisation only applies to the processes
involved in diagonalisation, and for other processes the matrices are
still shared between all available processors. This prevents one from
using more processors than allowed on calculation with more
points than number of atoms. There are only a certain number of
processors allowed in a given calculation because no processor is
allowed to have zero atoms, but atoms are distributed to all
processors. For example it is not possible to calculate the 32 atom
cell bulk aluminium system with 2 processor-groups each having a
processor grid, because this requires 8 processors, but for
32 atoms case some processors will not be allocated with atoms. More
tests are needed to show the potential
point parallelisation
have on much larger calculations on HECToR.
The implementations in CONQUEST will be submitted to the code repository after further testing and will be available in the future (beta) release of the code obtainable from http://hamlin.phys.ucl.ac.uk/NewCQWeb/bin/view.
Lianheng Tong 2011-03-02