ScaLAPACK Performance Profiling

The profiling of CONQUEST is done using CrayPAT on HECToR XT5h (compiled with Cray LibSci 10.5.0). The test is based on the calculation for bulk aluminium with 32 atoms unit cell and BLACS processor grid given by $1 \times 4$. This processor grid is used because it is the recommended by BLACS for small size matrices and that the calculation failed at diagonalisation for process-grid of $2
\times 2$. Table below shows the performance comparisons between different ScaLAPACK block dimensions.

ScaLAPACK Block CONQUEST Wall Time (s) CrayPAT Wall Time (s)
$13 \times 13$ 8062.169 7981.555
$26 \times 26$ 7673.137 7626.500
$52 \times 52$ 8197.790 8163.051
$104 \times 104$ 8477.819 8451.286

Results for non-square blocks are omitted because for these cases the calculation again failed at diagonalisation. As one can see the optimum block dimension of the calculation is $26 \times 26$, this marks roughly 10% improvement over the CONQUEST default input value which is $104 \times 104$. The main bottleneck and the largest load imbalance in the calculations are found (after specifying trace group MPI in CrayPAT) to be the MPI_recv calls within the ScaLAPACK subroutine pzhegvx used for diagonalisation. Table below lists the largest load imbalances in terms of percentages in the calculation for different block sizes. For calculations with larger block sizes however, the large load imbalances in the MPI_recv calls are partially off set by the relatively less (but still significant) number of calls, and we see a shift from MPI_recv to MPI_bcast being the main bottleneck.

Scalapack Block Largest Load Imbalance (MPI_recv) % % Time Largest Load Imbalance (MPI_Bcast) % % Time
$13 \times 13$ 18.7806 29.4 3.0170 16.7
$26 \times 26$ 10.1233 32.6 7.2327 15.2
$52 \times 52$ 38.3164 48.0 13.1732 18.0
$104 \times 104$ 50.6869 35.8 30.7551 27.5

This indicates clearly that the choice of ScaLAPACK block sizes is a determine factor on controlling the efficiency of the ScaLAPACK routines. For the 32 atom bulk aluminium calculation the optimal value seems to be $26 \times 26$.

We also compared the efficiency of the ScaLAPACK subroutine pzhegvx with the LAPACK equivalent zhegvx for a single processor case. These calculations were performed on a local Sun workstation 2 $\times$ AMD Opteron 2214 (2.2 GHz 2 cores). The table below shows the results of comparison

Total Wall Time (s)
LAPACK 6590.180
ScaLAPACK 8767.027
Time Spent in Calculating Density Matrix (s)
LAPACK 5803.934
ScaLAPACK 7973.105

So it appears that the LAPACK subroutine is inherently more efficient. This may have a lot to do with the different quality in the libraries used. The LAPACK implementation is supplied by the AMD Core Math Library (ACML) as part of the standard set of libraries already on the machine, whereas the ScaLAPACK implementation is a locally compiled version.

Lianheng Tong 2011-03-02