ScaLAPACK Performance Profiling

The profiling of CONQUEST is done using CrayPAT on HECToR XT5h (compiled with Cray LibSci 10.5.0). The test is based on the calculation for bulk aluminium with 32 atoms unit cell and BLACS processor grid given by $1 \times 4$ . This processor grid is used because it is the recommended by BLACS for small size matrices and that the calculation failed at diagonalisation for process-grid of $2 \times 2$ . Table below shows the performance comparisons between different ScaLAPACK block dimensions.

ScaLAPACK Block	CONQUEST Wall Time (s)	CrayPAT Wall Time (s)
$13 \times 13$	8062.169	7981.555
$26 \times 26$	7673.137	7626.500
$52 \times 52$	8197.790	8163.051
$104 \times 104$	8477.819	8451.286

Results for non-square blocks are omitted because for these cases the calculation again failed at diagonalisation. As one can see the optimum block dimension of the calculation is $26 \times 26$ , this marks roughly 10% improvement over the CONQUEST default input value which is $104 \times 104$ . The main bottleneck and the largest load imbalance in the calculations are found (after specifying trace group MPI in CrayPAT) to be the MPI_recv calls within the ScaLAPACK subroutine pzhegvx used for diagonalisation. Table below lists the largest load imbalances in terms of percentages in the calculation for different block sizes. For calculations with larger block sizes however, the large load imbalances in the MPI_recv calls are partially off set by the relatively less (but still significant) number of calls, and we see a shift from MPI_recv to MPI_bcast being the main bottleneck.

Scalapack Block	Largest Load Imbalance (`MPI_recv`) %	% Time	Largest Load Imbalance (`MPI_Bcast`) %	% Time
$13 \times 13$	18.7806	29.4	3.0170	16.7
$26 \times 26$	10.1233	32.6	7.2327	15.2
$52 \times 52$	38.3164	48.0	13.1732	18.0
$104 \times 104$	50.6869	35.8	30.7551	27.5

This indicates clearly that the choice of ScaLAPACK block sizes is a determine factor on controlling the efficiency of the ScaLAPACK routines. For the 32 atom bulk aluminium calculation the optimal value seems to be $26 \times 26$ .

We also compared the efficiency of the ScaLAPACK subroutine pzhegvx with the LAPACK equivalent zhegvx for a single processor case. These calculations were performed on a local Sun workstation 2 $\times$ AMD Opteron 2214 (2.2 GHz 2 cores). The table below shows the results of comparison

Total Wall Time (s)
LAPACK	6590.180
ScaLAPACK	8767.027
Time Spent in Calculating Density Matrix (s)
LAPACK	5803.934
ScaLAPACK	7973.105

So it appears that the LAPACK subroutine is inherently more efficient. This may have a lot to do with the different quality in the libraries used. The LAPACK implementation is supplied by the AMD Core Math Library (ACML) as part of the standard set of libraries already on the machine, whereas the ScaLAPACK implementation is a locally compiled version.