Loop Structure

The following two areas of code highlight regions that would benefit from vectorisation. Each area belongs to a particular subroutine which is called every time step iteration and therefore any optimisation would be critical for improved performance of CABARET. In particular, the PHASE2, BOUND and VISCOSITY subroutines involve loops formed over the side of each cell. These are highlighted by CrayPat as where the most intensive computation takes place - around 60% of the cpu time:

PHASE1 and PHASE3

DO K=1,NCELL
  ...
  KEYE=KEY(K,1)
  ...
  NSI=GEMCELLSIDE(K,1)
  ...
  UI=SIDE(NSI,1)
  ....
  SXI=SIDE(NSI,11)*DFLOAT(KEYE)
  ... 
END DO

PHASE2 BOUND and VISCOSITY

DO I=1,NSIDE
  NCF=GEMSIDECELL(I,1) 
  NCB=GEMSIDECELL(I,2) 
  IF((NCF/=0).AND.(NCB/=0))THEN
    CALL TAKESTENCIL1F(I)
    CALL TAKESTENCIL1B(I)
    IF (ABS(CHAR3B)<DEPS) CHAR3B=0
    IF (ABS(CHAR3F)<DEPS) CHAR3F=0
    IF(CHAR3B+CHAR3F.LE.0)THEN... ENDIF
  ENDIF
  IF(NCF<0) THEN ... ENDIF
  IF(NCB<0) THEN ... ENDIF
  ...
  NCF=GEMSIDECELL(I,1)
  NCB=GEMSIDECELL(I,2)
  NSTYPE=GEMSIDECELL(I,3)
  NC=NCF
  IF(NCF==0)NC=NCB
  NA1=GEMCELLAPEX(NC,1)
  ...
  NA7=GEMCELLAPEX(NC,7)
  IF(NSTYPE==2651)THEN
    NSA1=NA2	
    NSA2=NA6
    NSA3=NA5
    NSA4=NA1
  ENDIF
  ...
END DO

If the compiler is able to create packed SSE instructions for these loops then some performance improvement would certainly be gained. But as the loops appear there is not much scope for vectorisation and the algorithm constraints do not allow for any code re-structuring. We have compiled the CABARET code with the default versions of the main compilers that are currently available on HECToR, i.e. PGI 10.9.0, Pathscale 3.2.99, Cray 7.2.8 and GNU 4.5.1.

The results show that none of the compilers can perform any level of vectorisation on PHASE2, BOUND and VISCOSITY. However, Pathscale and Cray both manage to produce partially vectorised instructions for the loops in PHASE1 and PHASE3. The Cray compiler reports that it has estimated the number of vector registers required. The Pathscale compiler reports that the LOOP WAS VECTORIZED. Removing the mixed data type in PHASE1 and PHASE3 eliminates the type conversion from the INTEGER KEYE to REAL(KIND=8) with the intrinsic function DFLOAT and helps to achieve partially vectorised instructions for the loops with PGI.

For all three compilers, the optimisation options (i.e. compilation flags) used are to enable inter-procedural analysis, loop unrolling, loop nesting (where possible) for the target processor in 64-bit mode. In terms of producing the best performing object code for CABARET on HECToR, the Pathscale compiled version performs on average 5% faster than the PGI or the Cray generated object code. The worst performance wise is the GNU generated object code which is on average 5% slower than the PGI and the Cray generated object code.

Phil Ridley 2011-02-01