The Hamiltonian building routine has seen a significant rewrite which has resulted in a tenfold speedup, due to memory access optimisations. The cost of the routine was already linear and the real space Hamiltonian storage format sparse. Parallelisation over the atoms did show a speedup but the necessary gather negated any gains. This was not because the gather was slow but rather due to the routine being particularly fast. The time for this routine is now negligble and it is executed only once before the self-consistency takes place.
The Bloch transform applied at each -point produces -dependent matrices which use the real space ones. This routine has also been modified, such that only the necessary local pieces of the global 2D block cyclic distributed arrays are built, thus avoiding the usage of darray_scatter.
It is possible to avoid the repeated Bloch transforms during self-consistency by saving the matrices for all -points then only transform the diagonal blocks. This is because the off diagonal updates, when needed, are applied directly to . Table 1 lists the estimated requirements for typical use scenarios. There are cases with few atoms, 100-300, which should be possible to execute on small computers but the memory usage will be prohibitively high.