Application Benchmarks

Next: Conclusion Up: Benchmark Results Previous: HECToR Phase 2b Contents

Application Benchmarks

To demonstrate the benefits of these changes to the user, the full PW.X executable was compiled with the original Alltoallv implementation, and the new padded Alltoall and SHM alltoallv modifications. This was then used to run the first two stages of the GWW calculation (on HECToR Phase 2a). The results are shown in table 2.

Table 2: Comparison of FFT transpose methods for application benchmarks (times in seconds except where specified)

	CNT40 (16 cores)		Silicon (64 cores)		CNT80 (64 cores)
Version	exc_scf	exc_nscf	exc_scf	exc_nscf	exc_scf	exc_nscf1	exc_nscf2
Original alltoallv	18.9	33.0	102	657	595	2h15m	4h55m
Padded alltoall	16.7	29.0	76	559	441	2h10m	5h15m
Speedup	12%	12%	25%	15%	26%	4%	-7%
SHM alltoallv	14.6	21.0	77	466	424	2h10m	3h16m
Speedup	23%	36%	25%	29%	29%	4%	34%

In all cases but one, both the Padded alltoall and SHM alltoallv methods are faster than the original alltoallv implementation. Note that for CNT80, the nscf step is split in two (see section 2) and the first part is dominated by linear algebra, rather than the FFT. In all cases the SHM alltoallv outperfoms the padded alltoall, so it is recommended that this method always be used on HECToR

Next: Conclusion Up: Benchmark Results Previous: HECToR Phase 2b Contents

Iain Bethune
2010-12-10