SPRINT

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

SPRINT[1] is a parallel framework for R, it allows certain functions to be run across several cores at the same time.

The full manual for SPRINT can be found here [2]

Below is some instructions on using SPRINT with the IBERS HPC.

Available Parallel Functions

The following is a list of functions that are can be ran in parallel using SPRINT, more information on each can be seen in the manual linked above.

  • papply() - parallel version of apply() or lapply()
  • pboot() - parallel bootstrapping
  • pcor() - parallel Pearson's Correlation
  • pmaxT() - parallel version of mt.maxT (from multtest package)
  • ppam() - parallel version of pam() (from cluster package)
  • prandomForest() - parallel implementation of randomForest (from randomForest package)
  • pRP() - parallel rank product analysis algorithm (comparable to RP() from RankProd package)


Using Sprint

Sprint requires two HPC modules to be loaded

module load R/R-3.0.2
module load openmpi/gcc

To make use of sprint you need to import the sprint library into you R script, you do that using the command

library(sprint)

You can then load data and carry on your normal R processing as normal. To use a parallel function you simply call one of the above functions as normal, once the R interpreter requires this functions it will automatically handle all the parallel processing. In example.

papply(my_data, some_function)

When you wish to stop using the parallel interface you need to call a function to tell R you have finished working in parallel, the following code will do that.

pterminate()

Example Sprint Script

To make use of the parallel implementations of these functions supplied by the SPRINT framework, we also need to use the MPI framework to allow for SPRINT to execute R code across multiple cores at the same time, the following is a test script for SPRINT.

First the R Script, it was given the filename sprint_test.R

#Load the sprint library
library(sprint) 

#Function provided by SPRINT to test functionality
ptest() 

#Needed to end MPI calls and return to serial processing
pterminate() 

#End R Script
quit()

Now the sun grid engine script, given filename sprint_test_run.sge

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q

#Remember to add hard/soft limits for memory if required.
#$ -l h_vmem=6G

#Specify number of cores required (in this case five)
#$ -pe mpich 5

#Load our required modules
module load R/R-3.0.2
module load openmpi/gcc

# The following is the command that tells the HPC that the script needs to be run in parallel
# using the MPI framework, you also here need to tell MPI how many cores it can use, this number
# needs to match the number given to the Grid engine above. MPI is started using the following command
# mpiexec -n num_of_cores command_to_run
# R -f filename is just the command to tell R to run the commands within the given file.
mpiexec -n 5 R -f sprint_test.R

Then from the command line we execute the sprint_test_run.sge

qsub -N sprint_test run_sprint_test.sge

And then in the output file we should see

R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

.... Omitted Output To Reduce Wiki Page Size ....


> library(sprint)
> library(sprint)
> library(sprint)
> library(sprint)
> library(sprint)
>
> ptest()
[1] "HELLO, FROM PROCESSOR: 0" "HELLO, FROM PROCESSOR: 4"
[3] "HELLO, FROM PROCESSOR: 2" "HELLO, FROM PROCESSOR: 3"
[5] "HELLO, FROM PROCESSOR: 1"
> pterminate()
> quit()