Bootstrapping and support vector machines with R and SPRINT on HECToR
This Distributed Computational Science and Engineering (dCSE) project optimised for HECToR two further statistical functions in the SPRINT package, bootstrapping and support vector machines (SVM). SPRINT aims to provide parallelised bio-statistical analysis functions to the R community, allowing user-friendly usage of HPC systems. This work follows on from the two previous projects SPRINTing with HECToR and SPRINTing further with HECToR.
Bootstrapping is a very generic function with applicability wherever estimates or results are calculated on data. It is a very popular resampling method. Prior to this project EPCC and the Division of Pathway Medicine (DPM) had previously established that bootstrapping in R could be parallelised using SPRINT but this prototype implementation was not optimised for HECToR. The R implementation of bootstrapping is, however similar in structure to the R Random Forest and Rank Product functions, which have been successfully optimised for HECToR with SPRINTing further with HECToR. This dCSE project therefore took the existing SPRINT prototype implementation of bootstrapping and optimised it for HECToR using similar approaches to those used for Random Forest and Rank Product. Depending on the complexity of the statistic(s) and the number of replicates used, this new SPRINT implementation of R bootstrapping, can achieve a speedup of between 20 and 40 compared to the original serial code on HECToR. This new implementation was made publicly available in SPRINT via the Comprehensive R Archive Network (CRAN) from V1.0.3 onwards.
Some parallel implementations of SVM exist but none for R. This means that typical life scientists who use R cannot exploit these advances in their existing analysis workflows. This dCSE project therefore took the exising R implementation of SVM and integrated it with SPRINT to enable it to exploit HECToR. In particular, this implementation enables SVM cross-validation runs to be run in parallel. This new implementation was made publicly available in SPRINT via the Comprehensive R Archive Network (CRAN) from V1.0.4 onwards.
Please see here for a report which summarises the bootstrapping work and here for a report on the SVM work.