next up previous contents
Next: Baseline Up: Benchmarks Previous: Compiler   Contents

Node Usage

Each node on HECToR has two cores, or PEs, so we ran a series of Castep calculations to see how the performance and scaling of a Castep calculation depends on the number of PEs used per node. We also used these calculations to double-check the results of our investigation into different libraries. The results are shown in figure 3.5.

Figure 3.5: Comparison of Castep performance for the ACML and LibSci (Goto) BLAS libraries, and the generic GPFA and FFTW3 FFT libraries, run using two cores per node (3.5(a)) and one core per node (3.5(b))
[Execution time using 2 cores per node] \includegraphics[width=0.45\textwidth]{Al2O3_3x3_library.eps} [Execution time using 1 core per node] \includegraphics[width=0.45\textwidth]{Al2O3_3x3_library_ppn1.eps}

The best performance was achieved with the Goto BLAS in Cray's libsci version 10.2.0 coupled with the FFTW3 library, as can be seen in figure 3.5.

The best performance per core was achieved using only one core per node, though the performance improvement over using both cores was not sufficient to justify the expense (since jobs are charged per node not per core). Using Castep's facility for optimising communications within an SMP node3.1 the scaling was improved dramatically and rivals that of the one core per node runs.

It may be possible to run one MPI thread per core for the main Castep calculation and use the other core for threaded BLAS, but this would be a major project in itself and we do not propose to undertake it in this work.


next up previous contents
Next: Baseline Up: Benchmarks Previous: Compiler   Contents
Sarfraz A Nadeem 2008-09-01