- Test 1: A hydrogen defect in 32 atoms of palladium. 10 k-points are used.
- Test 2: A unit cell of Litharge ( -PbO), a total of 4 atoms. 108 k-points are used.
- Test 3: A cell of PbO using 126 k-points.
- Test 4: Again -PbO using 24 k-points.
- Test 5: A phonon calculation with 20 k-points.

In Tests 1-3 the PBE exchange correlation functional is used and the k-mesh is generated by the
Monkhorst-Pack method.
Tests 4-5 involve Hartee-Fock calculations and the k-mesh is generated with the Gamma centered method.
All runs except for the phonon calculation are a single point
energy calculation.

All runs have been performed on the phase 2b component of the HECToR system, the UK's national supercomputing service. This is a large Cray XE6 system. The nodes are based upon AMD Magny-Cours processors, and contain 24 cores each clocking at 2.1GHz. There is 32 Gbytes of memory associated with each node, and inter-node communication is via Cray's Gemini network. More details may be found at the HECToR web site ([7]).

In Tables (), (), (), () and () we compare the performance of VASP 5.2.2 with the new k-point parallelized code and study the scaling of the new code. In
each case the original code is compared with the k-point code with increasing numbers of k-point
groups. All times reported are total run times, i.e. not just the time for the energy minimisation.

In Tables (), (), () and () we compare the performance of VASP 5.2.2 using the optimal NPAR value with the performance
of the k-point parallelized code, when the same number of cores is utilized. We demonstrate that efficient use of large number of cores is now possible for cases with more than one k-point.

It should be noted that the use of an appropriate NPAR value is imperative for the efficient running of VASP. The optimal value of NPAR in the original code depends on the total number of cores employed. For the k-point parallelized code, the optimal value of NPAR depends on the number of cores in one k-group. Hence the
value of NPAR that was optimal for the original code on cores, will be also the most efficient choice for k-groups on cores, when using the k-points parallelized code.