The following two areas of code highlight regions that would benefit from vectorisation. Each area belongs to a particular subroutine which is called every time step iteration and therefore any optimisation would be critical for improved performance of CABARET. In particular, the PHASE2, BOUND and VISCOSITY subroutines involve loops formed over the side of each cell. These are highlighted by CrayPat as where the most intensive computation takes place - around 60% of the cpu time:
PHASE1 and PHASE3
DO K=1,NCELL ... KEYE=KEY(K,1) ... NSI=GEMCELLSIDE(K,1) ... UI=SIDE(NSI,1) .... SXI=SIDE(NSI,11)*DFLOAT(KEYE) ... END DO
PHASE2 BOUND and VISCOSITY
DO I=1,NSIDE NCF=GEMSIDECELL(I,1) NCB=GEMSIDECELL(I,2) IF((NCF/=0).AND.(NCB/=0))THEN CALL TAKESTENCIL1F(I) CALL TAKESTENCIL1B(I) IF (ABS(CHAR3B)<DEPS) CHAR3B=0 IF (ABS(CHAR3F)<DEPS) CHAR3F=0 IF(CHAR3B+CHAR3F.LE.0)THEN... ENDIF ENDIF IF(NCF<0) THEN ... ENDIF IF(NCB<0) THEN ... ENDIF ... NCF=GEMSIDECELL(I,1) NCB=GEMSIDECELL(I,2) NSTYPE=GEMSIDECELL(I,3) NC=NCF IF(NCF==0)NC=NCB NA1=GEMCELLAPEX(NC,1) ... NA7=GEMCELLAPEX(NC,7) IF(NSTYPE==2651)THEN NSA1=NA2 NSA2=NA6 NSA3=NA5 NSA4=NA1 ENDIF ... END DO
If the compiler is able to create packed SSE instructions for these loops then some performance improvement would certainly be gained. But as the loops appear there is not much scope for vectorisation and the algorithm constraints do not allow for any code re-structuring. We have compiled the CABARET code with the default versions of the main compilers that are currently available on HECToR, i.e. PGI 10.9.0, Pathscale 3.2.99, Cray 7.2.8 and GNU 4.5.1.
The results show that none of the compilers can perform any level of vectorisation on PHASE2, BOUND and VISCOSITY. However, Pathscale and Cray both manage to produce partially vectorised instructions for the loops in PHASE1 and PHASE3. The Cray compiler reports that it has estimated the number of vector registers required. The Pathscale compiler reports that the LOOP WAS VECTORIZED. Removing the mixed data type in PHASE1 and PHASE3 eliminates the type conversion from the INTEGER KEYE to REAL(KIND=8) with the intrinsic function DFLOAT and helps to achieve partially vectorised instructions for the loops with PGI.
For all three compilers, the optimisation options (i.e. compilation flags) used are to enable inter-procedural analysis, loop unrolling, loop nesting (where possible) for the target processor in 64-bit mode. In terms of producing the best performing object code for CABARET on HECToR, the Pathscale compiled version performs on average 5% faster than the PGI or the Cray generated object code. The worst performance wise is the GNU generated object code which is on average 5% slower than the PGI and the Cray generated object code.
Phil Ridley 2011-02-01