A communicator is set up with the name bc_comm and within this communicator each of the MPI processes inserts each of the rows that it has been assigned into arrays to be sorted. Since each process sorts the rows that it will subsequently own during the diagonalization phase, little communication is required during the PETSc assembly stage. Process numbers in figure are process IDs within bc_comm. Note that although the schematic might imply parallelism during the construction of the BC block each process in fact runs through the entire nested loops involved during construction, i.e, there is no partitioning of work during the construction of the BC block. Rather, during the construction, each process only inserts the rows that it has been assigned into arrays to be sorted.
While the BB block is difficult to handle, the BC block has caused more issues during this
project. Initially, the BC block was parallelized in a straight-forward manner. The BC block
is built row-by-row but the columns within each row are unordered. The construction of the
BC block was initially parallelized by partitioning equal blocks of rows across all MPI processes.
This was implemented by a straight-forward parallelization over the outer-most loop of the all the loops associated
with the BC block construction. As each process constructs its BC rows, it subsequently inserts the
sparse matrix elements into ``BC arrays'' to be sorted. Upon sorting the arrays, the elements are
inserted into the PETSc matrix. The problem with this mechanism is that the parallelization over the outer-most loop lends itself to
the insertion of the elements into a lower-triangular matrix. However, as noted above, PETSc only accepts an upper-triangular
matrix. While the matrix elements can be easily transposed and subsequently inserted into the PETSc Mat object,
the subsequent assembly stage of the PETSc Mat object is extremely costly as too much data needs
to be communicated between processes. In essence, if the construction of the BC block is parallelized
over the outer-most loop, each process is not constructing the elements that it will later
own during the diagonalization stage. It is important that each process should construct (as much as possible)
the elements that it will own during the diagonalization stage so as to cut down on communication during
the PETSc assembly stage. Re-engineering the construction of the BC block so that this can be the case for
a parallelized build appears to be a non-trivial task.
The current approach taken is for each MPI process within bc_comm to sweep through the loops associated with the construction of the BC block in full, but for each process to insert only the rows that have been assigned to it. This clearly amounts to a serial construction of the BC block, but cuts down significantly on the amount of communication during the PETSc matrix assembly stage due to the fact that each process is inserting only the elements that it will later own during the diagonalization stage.
Paul Roberts 2012-06-01