SPRINTing further with HECToR
This Distributed Computational Science and Engineering (dCSE) project was to develop SPRINT, which is an addon package for the R language and environment for statistical computing and graphics. The Simple Parallel R INTerface (SPRINT) offers both a parallel functions library and an interface for adding parallel functions to R. This work follows on from the previous project SPRINTing with HECToR
The key aims of this project are:
- Optimise the randomForest decision tree classifier for parallel implementation on HECToR and then make it available for general R usage on HECToR through SPRINT.
- Analyse the performance of SPRINT's rank product and optimise.
- Benchmark both randomForest and rank product for up to 512 processes on Phase 2b.
The individual achievements of the project are summarised below:
- A parallel wrapper was added around the serial randomForest algorithm along with a tree reduction approach for combining results in parallel. For typical cases, a 40 times speed up can now be achieved. However, the serial randomForest code was designed for datasets with fewer variables than in bioscientific cases, which limits scalability to around 64 processes.
- A task parallel method using the existing serial rank product calculation was also implemented by distributing the bootstrap samples. For certain problem sizes, excellent scalability was shown on Phase 2b (XT6) for 512+ processes.
Please see here for a report which summarises the randomForest work and here for a report on the rank product work.