Next: Conclusion and future work
Up: Porting to GPU(s)
Previous: Multi GPU
Contents
With the density matrix obtained, albeit in a way not envisioned in the beginning, the next and final challenge was the structure
constants matrix. All code necessary for the structure constants was rewritten in C and tested for bugs and performance on the CPU.
An initial naive implementation on a single GPU using a single thread for a block corresponding to one atom pair was developed first. However,
the performance was an order of magnitude slower than what could be achieved with 16 CPUs.
After a number of incremental improvements, the time to build the full matrix on the GPU was brought lower than the one from the CPUs. The final
implementation check was to determine the maximal block size available in the matrix and then launch an appropriate template function instance which is
known to give reasonable performance. It is worthwhile to note that a thread block size set to the maximal B block size is fairly wasteful, but on the other hand a thread
block too small (i.e. less than 32) may be divergent and will also overload the threads with the local memory requirements, thus resulting in not very many being
launched, so there is a balance to be struck. Therefore, inside each thread block, one thread initialises all shared variables to the appropriate
values and all threads cooperate in the evaluation of the Hankel function. Afterwards each thread computes one or more elements
of the structure constants. This routine is integrated within the overarching MPI parallelisation giving each GPU a proportionally lower
amount of data to be set. This setup gives a significant performance increase over the 16 core system.
Next: Conclusion and future work
Up: Porting to GPU(s)
Previous: Multi GPU
Contents
DP 2013-08-01