A communicator is set up with the name bb_comm. Within this communicator each of the MPI processes constructs each of the rows that it has been assigned. Since each process constructs the rows that it will subsequently own during the diagonalization phase, little communication is required during the PETSc assembly stage. Note also that due to the underlying algorithm implemented during the construction of the BB block, the BB elements constructed in an ordered fashion and so no sorting is required. Process numbers in figure are process IDs within bb_comm
Although the BB block is the largest block in the matrix and is also the genuinely sparse (and unstructured) part of the matrix, from an algorithmic viewpoint its construction is quite straightforward and relies on conditional statements that are associated with symmetry principles occuring within nested loops. Fortunately, the BB block is constructed in an ordered fashion in that it is built column by column. Again, because PETSc accepts only the upper-triangular part of a symmetric matrix, the indices associated with column number in scatci_serial now point to row number and the indices associated with row number in scatci_serial now point to column number. In order to parallelize the construction of the BB block, the outer-most do loop of all the do loops associated with the BB block construction has been modified so that each MPI process runs over its local range of the outer do loop.
While the structure of the CC and BC blocks are known before the matrix is constructed, the unstructured nature of the BB block is not known and so there is little possibility of load-balancing the construction of the matrix. Likewise, there is little possibility of load-balancing the diagonalization stage of scatci_slepc unless information on the structure of the BB block is known before inserting the Hamiltonian matrix elements into the PETSc Mat object. A crude approach taken within this version of scatci_slepc has been to first sweep through the nested loops associated with construction of the BB block. During this initial sweep, the BB block is not constructed, but rather an array is filled that holds the number of non-zeros associated with each row of the BB block. This initial sweep is parallelized so that each process is assigned an equal number of rows to sweep through and is implemented by all MPI processes in PETSC_COMM_WORLD (MPI_COMM_WORLD). Upon each process filling its array containing the number of non-zero elements within each of the rows assigned to it, each of these arrays is gathered onto the master process. Once this is done, a load-balancing mechanism is invoked that returns the number of rows that should be assigned to each MPI process so as to ensure that each MPI process holds approximately the same number of non-zero values for the diagonalization stage. Then a second sweep through the loops associated with the construction of the BB block occurs, this time with each process constructing its assigned number of rows and subsequently inserting these elements directly into the PETSc Mat object. Only processes within a newly created communicator called bb_comm invoke this second sweep. Fortunately, the BB elements do not need to be sorted prior to being inserted into the PETSc Mat object. A schematic of how the BB elements are inserted can be seen in figure .
Paul Roberts 2012-06-01