Preliminary test calculations were done on both HECToR XT5h (with Cray LibSci 10.5.0) and a local Sun Workstation with 2 AMD Opteron 2214 (2 Core 2.2GHz) (with ACML for LAPACK and local compilation for ScaLAPACK). We used aluminium bulk with 32 atoms unit cell, with a point mesh, Fermi-Dirac smearing with temperature of 0.001 Ha. In all cases we did a non-self-consistent calculation on 4 nodes, results are shown in the table below
Processor Grid | ScaLAPACK Block | Wall Time | ||
<#3801#> | 1 | 2318.599 | ||
2 | ||||
4 | ||||
<#3814#> | 1 | 8794.051 | ||
2 | ||||
4 |
As the results clearly shows that point parallelisation has a significant improvement on calculation speed given the same amount of resources compared to the original implementation. This improvement is more apparent for platforms where the ScaLAPACK libraries are not highly optimised.
Lianheng Tong 2011-03-02