Test Results

Preliminary test calculations were done on both HECToR XT5h (with Cray LibSci 10.5.0) and a local Sun Workstation with 2 $\times$ AMD Opteron 2214 (2 Core 2.2GHz) (with ACML for LAPACK and local compilation for ScaLAPACK). We used aluminium bulk with 32 atoms unit cell, with a $13 \times 13 \times 13$ $\vec{k}$ point mesh, Fermi-Dirac smearing with temperature of 0.001 Ha. In all cases we did a non-self-consistent calculation on 4 nodes, results are shown in the table below

		Processor Grid	ScaLAPACK Block	Wall Time
<#3801#>	1	$1 \times 4$	$26 \times 26$	2318.599
		2	$1 \times 2$	$26 \times 26$
		4	$1 \times 1$	$26 \times 26$
<#3814#>	1	$1 \times 4$	$26 \times 26$	8794.051
		2	$1 \times 2$	$26 \times 26$
		4	$1 \times 1$	$26 \times 26$

As the results clearly shows that $\vec{k}$ point parallelisation has a significant improvement on calculation speed given the same amount of resources compared to the original implementation. This improvement is more apparent for platforms where the ScaLAPACK libraries are not highly optimised.

Lianheng Tong 2011-03-02