Hybrid Parallelism in CABARET

In the pure distributed data approach to parallel CABARET, the data parallelism is carried out on partitioned sub grids of the Gambit generated computational grid. Each one of these sub grids being assigned as a single MPI process within the global communicator. Non-blocking MPI calls are implemented within PHASE1, PHASE2 and PHASE3 and placed so that computation is performed while communication is in process. The unstructured decomposition used within the core CABARET algorithm has an indirect referencing scheme which manages access to halo data and associated communications. But, to improve performance for Phase 2b we need to reduce the contention for interconnect bandwidth by reducing the number of off-node MPI communications with a shared memory approach for on node parallelism.

We have a choice when making the decision on which strategy to adopt in implementing Hybrid parallelism in CABARET.

Assign on node partitions to OpenMP threads rather than have them as MPI processes.
Parallelise the loops involving the conservative and flux variable updates, within PHASE1, VISCOSITY, PHASE2, BOUND and PHASE3.

Both methods will potentially reduce the number of off-node MPI communications, but 1. should effectively be already happening as part of MPT's ability to manage on node MPI processes and therefore should have little benefit, so choosing 2. makes more sense and will introduce a new level of parallelism within CABARET.

The computational stencil for CABARET involves only nearest neighbour grid points. This enables great potential for thread safe OpenMP within the main computational loops. For the loops involving the conservative and flux variable updates, within PHASE1, VISCOSITY, PHASE2, BOUND and PHASE3, shared memory OpenMP parallel for directives with a static schedule were implemented. These divide the loops into thread safe chunks of size NCELL/OMP_num_threads or NSIDE/OMP_num_threads and assigns a thread to a separate (local) chunk of the loop. The MPI_THREAD_FUNNELLED approach is used allowing only the master thread to perform the inter-node communication, e.g.

!$OMP BARRIER 
!$OMP MASTER
        DO I=1,APEXNEIGHS
! pre-post required for CELL transfers

         CALL MPI_IRECV(CELLD(0,1,I),10*(MAXNCELL+1)
     &   ,MPI_DOUBLE_PRECISION,NEIGH(I),0,MPI_COMM_WORLD
     &   ,REQUESTIN(I),IERR)
        
        END DO ! APEXNEIGHS
!$OMP END MASTER
!$OMP BARRIER
...
!$OMP MASTER
        DO I=1,APEXNEIGHS

         CALL MPI_ISSEND(CELL(0,1),10*(MAXNCELL+1)
     &  ,MPI_DOUBLE_PRECISION,NEIGH(I),0,MPI_COMM_WORLD
     &  ,REQUESTOUT(I),IERR )

        END DO ! APEXNEIGHS
!$OMP END MASTER
...
       DO I=1,APEXNEIGHS
        CALL MPI_WAIT(REQUESTOUT(I),STATUS,IERR)
       END DO ! APEXNEIGHS

All multi-threaded loops are able to run with NOWAIT, again due to the need to consider only nearest neighbour grid points.

It is also more useful to implement OpenMP within the PHASE1, VISCOSITY, PHASE2, BOUND and PHASE3 loops, since these take around 60% of the cpu time and will not vectorise for the reasons discussed in Section .

Phil Ridley 2011-02-01