next up previous contents
Next: Conclusion Up: Benchmark Results Previous: HECToR Phase 2b   Contents

Application Benchmarks

To demonstrate the benefits of these changes to the user, the full PW.X executable was compiled with the original Alltoallv implementation, and the new padded Alltoall and SHM alltoallv modifications. This was then used to run the first two stages of the GWW calculation (on HECToR Phase 2a). The results are shown in table 2.


Table 2: Comparison of FFT transpose methods for application benchmarks (times in seconds except where specified)
  CNT40 (16 cores) Silicon (64 cores) CNT80 (64 cores)
Version exc_scf exc_nscf exc_scf exc_nscf exc_scf exc_nscf1 exc_nscf2
Original alltoallv 18.9 33.0 102 657 595 2h15m 4h55m
Padded alltoall 16.7 29.0 76 559 441 2h10m 5h15m
Speedup 12% 12% 25% 15% 26% 4% -7%
SHM alltoallv 14.6 21.0 77 466 424 2h10m 3h16m
Speedup 23% 36% 25% 29% 29% 4% 34%

In all cases but one, both the Padded alltoall and SHM alltoallv methods are faster than the original alltoallv implementation. Note that for CNT80, the nscf step is split in two (see section 2) and the first part is dominated by linear algebra, rather than the FFT. In all cases the SHM alltoallv outperfoms the padded alltoall, so it is recommended that this method always be used on HECToR


next up previous contents
Next: Conclusion Up: Benchmark Results Previous: HECToR Phase 2b   Contents
Iain Bethune
2010-12-10