Therefore, the solution was to use the cards for what they were good at, namely large matrix multiplication (dgemm)3, with an algorithm for obtaining the density matrix only through multiplication and no diagonalisation.
The basic idea behind the method is as follows. The Hamiltonian matrix and the eigenvalue diagonal matrix are representations of the same operator
but in different bases. Exactly the same transformation links the occupancy function and the density matrix. The occupancy is a function of the eigenvalues and for the simple
case of insulators it is a Heaviside step function positioned anywhere within the band gap. If a fast converging polynomial
approximation is found for the occupancy function and its position (Fermi level) is known a priori, then evaluating the polynomial
directly on the Hamiltonian by only using matrix multiplications and scaling will result in the full density matrix. A suitable
polynomial may be generated by recursively applying
[8]. Each iteration brings the resultant
polynomial closer to the step function. For water and titania approximately 7 iterations are enough to converge
to within machine
precision, from the one computed through diagonalisation. There are two matrix multiplications per iteration and a serial diagonalisation
is roughly 5-6 times slower than a matrix multiplication in the same conditions. In the end a full density matrix obtained from a single
GPU card was about 1.5-2 times faster than one obtained from the fast Xeon E5-2690 2.90Hz machine. A number of polynomials were found or developed and
tested and the most efficient and convenient is the sign function approximation:
with a
subsequent flip and shift of the resulting array. It is desirable to pick the
chemical potential closer to the centre because the order of the polynomial will be lower and the number of operations will be
reduced.