next up previous contents
Next: Direct density matrix solution Up: Tight Binding Molecular Dynamics Previous: Modifications to the density   Contents

Porting to GPU(s)

The initial aim of this task was to speedup the slowest part of the TB calculations, the diagonalisation, by offloading it to a single GPU. Once achieved, the plan would then be to utilise more GPUs which could possibly be connected to the nodes of a cluster. This would enable the combining of ScaLAPACK's MPI parallelisation with accelerated CUBLAS executing locally on the GPUs. Later, the other two heavy parts of the code: the structure constants evaluation and the density matrix were also to be ported. The Hamiltonian assembly involves rather divergent code paths. It is also very quick on a CPU and it was therefore deemed unnecessary to port it.

Initial testing of the new GPU revealed disappointing results for our greatest hope: the diagonalisation routines. While one card was nearly twice as fast, for sizes of $ \approx$10000, when compared with performance on the 16 cores of the Xeon E5-2650 2.0GHz machine. But it was also slower than on the Xeon E5-2690 2.90Hz machine. It was originally thought that this may have been due to a problem with either the test or setup. However, this was dissolved after the results were confirmed by the MAGMA team. The 2.0GHz cores are not twice slower than the 2.9GHz cores of the same model but the memory availability, latency and frequency do differ. Memory access affects dgemv severely and this is the major cause of the difference in the performance of the diagonalisation drivers.


Table 2: Time for diagonalisation with pdsyevd and MAGMA's dsyevd in $ s$.
N E5 2.9GHz K20c E5 2.0 GHz
6144 11.58 12.23 27.27
10240 47.62 50.45 121.7


For matrix multiplication however the situation was more satisfactory with one card being from 2 to 4 times faster, for sizes ranging in the mid thousands. For smaller sizes a CPU may still outperform a K20 card even with the highly optimised cublasDgemm. This is likely to be the reason why simply plugging CUBLAS/MAGMA in TBE to replace the BLAS/LAPACK in ScaLAPACK showed a significant slowdown. In this case, it is feeding matrices of block size equal to the single process BLAS/LAPACK and block sizes of 2048-3072 are impractical for TBE just to break even, on the slower Xeon E5-2650 2.0GHz CPU.


Table 3: Time for dgemm from MKL and CUBLAS in $ s$.
N E5 2.9GHz K20c E5 2.0 GHz
6144 3.70 0.45 4.16
10240 9.82 2.05 16.93




Subsections
next up previous contents
Next: Direct density matrix solution Up: Tight Binding Molecular Dynamics Previous: Modifications to the density   Contents
DP 2013-08-01