The profiling of CONQUEST is done using CrayPAT on HECToR XT5h (compiled with Cray LibSci 10.5.0). The test is based on the calculation for bulk aluminium with 32 atoms unit cell and BLACS processor grid given by . This processor grid is used because it is the recommended by BLACS for small size matrices and that the calculation failed at diagonalisation for process-grid of . Table below shows the performance comparisons between different ScaLAPACK block dimensions.
ScaLAPACK Block | CONQUEST Wall Time (s) | CrayPAT Wall Time (s) |
8062.169 | 7981.555 | |
7673.137 | 7626.500 | |
8197.790 | 8163.051 | |
8477.819 | 8451.286 |
Results for non-square blocks are omitted because for these
cases the calculation again failed at diagonalisation. As one can see
the optimum block dimension of the calculation is ,
this marks roughly 10% improvement over the CONQUEST default
input value which is
. The main bottleneck and the
largest load imbalance in the calculations are found (after specifying
trace group MPI in CrayPAT) to be the MPI_recv
calls within the
ScaLAPACK subroutine pzhegvx
used for diagonalisation. Table
below lists the largest load imbalances in terms of percentages in the
calculation for different block sizes. For calculations with larger
block sizes however, the large load imbalances in the MPI_recv
calls are partially off set by the relatively less (but still
significant) number of calls, and we see a shift from MPI_recv
to MPI_bcast
being the main bottleneck.
Scalapack Block | Largest Load Imbalance (MPI_recv) % | % Time | Largest Load Imbalance (MPI_Bcast) % | % Time |
18.7806 | 29.4 | 3.0170 | 16.7 | |
10.1233 | 32.6 | 7.2327 | 15.2 | |
38.3164 | 48.0 | 13.1732 | 18.0 | |
50.6869 | 35.8 | 30.7551 | 27.5 |
This indicates clearly that the choice of ScaLAPACK block sizes is a determine factor on controlling the efficiency of the ScaLAPACK routines. For the 32 atom bulk aluminium calculation the optimal value seems to be .
We also compared the efficiency of the ScaLAPACK subroutine pzhegvx
with the LAPACK equivalent zhegvx
for a single processor
case. These calculations were performed on a local Sun workstation 2
AMD Opteron 2214 (2.2 GHz 2 cores). The table below shows the
results of comparison
Total Wall Time (s) | |
LAPACK | 6590.180 |
ScaLAPACK | 8767.027 |
Time Spent in Calculating Density Matrix (s) | |
LAPACK | 5803.934 |
ScaLAPACK | 7973.105 |
So it appears that the LAPACK subroutine is inherently more efficient. This may have a lot to do with the different quality in the libraries used. The LAPACK implementation is supplied by the AMD Core Math Library (ACML) as part of the standard set of libraries already on the machine, whereas the ScaLAPACK implementation is a locally compiled version.
Lianheng Tong 2011-03-02