[Execution time using 2 cores per node] [Execution time using 1 core per node]
The best performance was achieved with the Goto BLAS in Cray's libsci version 10.2.0 coupled with the FFTW3 library, as can be seen in figure 3.5.
The best performance per core was achieved using only one core per node, though the performance improvement over using both cores was not sufficient to justify the expense (since jobs are charged per node not per core). Using Castep's facility for optimising communications within an SMP node3.1 the scaling was improved dramatically and rivals that of the one core per node runs.
It may be possible to run one MPI thread per core for the main Castep calculation and use the other core for threaded BLAS, but this would be a major project in itself and we do not propose to undertake it in this work.