
Firstly, Table () shows that the original code does not scale at all over 32 cores. The FFT communications cost becomes the bottleneck and it is not possible to perform the computation in less than 400 secs with the original code.
Table () on the other hand shows that with the kpoints parallelized code we can employ 4 times more cores and we complete the simulation in 122 secs (3.6 speedup). The problem though is that we cannot use more than 128 cores in this case, where potentially we could use 640 (number of kpoints (20) 32) for this case. This is because during the specific calculation new kpoint meshes are generated. When KPAR is an exact divisor of the number of the kpoints in the new mesh our kpoints parallelized code performs the calculation efficiently. When not, it exits. In this case the original kmesh had 20 kpoint, the second kmesh 52 and the third 68. Only the numbers 2 and 4 are common divisors of the aforementioned 3 numbers. Hence the biggest value that can be used for KPAR is 4.
Asimina Maniopoulou 20110709