next up previous contents
Next: Conclusion and future work Up: Porting to GPU(s) Previous: Multi GPU   Contents

Structure constants on GPUs and MPI

With the density matrix obtained, albeit in a way not envisioned in the beginning, the next and final challenge was the structure constants matrix. All code necessary for the structure constants was rewritten in C and tested for bugs and performance on the CPU. An initial naive implementation on a single GPU using a single thread for a block corresponding to one atom pair was developed first. However, the performance was an order of magnitude slower than what could be achieved with 16 CPUs. After a number of incremental improvements, the time to build the full matrix on the GPU was brought lower than the one from the CPUs. The final implementation check was to determine the maximal block size available in the matrix and then launch an appropriate template function instance which is known to give reasonable performance. It is worthwhile to note that a thread block size set to the maximal B block size is fairly wasteful, but on the other hand a thread block too small (i.e. less than 32) may be divergent and will also overload the threads with the local memory requirements, thus resulting in not very many being launched, so there is a balance to be struck. Therefore, inside each thread block, one thread initialises all shared variables to the appropriate values and all threads cooperate in the evaluation of the Hankel function. Afterwards each thread computes one or more elements of the structure constants. This routine is integrated within the overarching MPI parallelisation giving each GPU a proportionally lower amount of data to be set. This setup gives a significant performance increase over the 16 core system.


next up previous contents
Next: Conclusion and future work Up: Porting to GPU(s) Previous: Multi GPU   Contents
DP 2013-08-01