From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

ABySS is a denovo assembler of short read data.


I used the following script to install abyss in my home folder (/home/rov). I first had to install sparsehash and boost libraries.


#make and install abyss in my local directory
#cd into abyss source distribution dir
#and run this script using ../build_abyss.sh

load module openmpi


#boost does not need to be compiled

#found using find / 2> /dev/null | grep openmpi

#see https://groups.google.com/forum/#!msg/abyss-users/6NXwP959RTI/tqLtO14a4A8J

#probably only one of these required
RJVCPPFLAGS="-I${MYDIR}/sparsehash-2.0.2/src/google/sparsehash -I${MYDIR}/sparsehash-2.0.2/src/google -I${MYDIR}/sparsehash-2.0.2/src"

#where to put abyss binaries

#ensure max kmer size is > 64

make install

preparing reads

Before assembling my Illumina reads, I first used trimmomatic to remove adapter sequences. I then organised the fastq files into libraries, one for each fragment size.

I then used a custom python script to process the mate pair libraries to filter out read pairs which did not contain a valid mate pair.

I then renamed the reads from the default Illumina name to ABySS compatible names. ABySS expected the first read of a pair to end with /1, and the second read to have the same name but end with /2. I used my own python script to do this renaming.


I used the following script to assemble my Illumina paired end and mate pair data.

I found that using /dev/shm as the temporary directory (TMPDIR in the script below) increased the running speed. This folder is actually a RAM based file system, so the temporary files stay in RAM. The default temporary folder (probably /tmp, only 2GB) is too small. Using a folder on the scratch drive means that data has to cross the network, which makes it too slow.

Using /dev/shm means that the temporary files are actually taking up RAM on the node. I found that for my data set only 25G of temporary files are needed, therefore this can fit into RAM okay. /dev/shm is limited to half the total RAM on the machine (i.e. 500/2 = 250G for the large node). Be sure to factor this into your -l h_maxvmem in the SGE script. Files in /dev/shm are not automatically deleted when the job ends, so be sure to explicitly delete them somehow, even if the job aborts. /dev/shm is shared between all job running the the node, so put your files under a folder called /dev/shm/[your username] or similar to avoid interferring with other users files.

#$ -S /bin/sh
#$ -N abyss
#$ -o ../logs/$JOB_NAME.out.$JOB_ID
#$ -e ../logs/$JOB_NAME.err.$JOB_ID
#$ -cwd
#$ -l h_vmem=245G

module load openmpi

export PATH=${PATH}:/ibers/ernie/home/rov/programs/abyss-local/bin
#export TMPDIR=/ibers/ernie/scratch/rov/abyss_atlantica_assembly_2013-08-08/tmp
export TMPDIR=/dev/shm/rov-abyss-tmp/k51_001

rm -rf ${TMPDIR}
mkdir -p ${TMPDIR}

mkdir -p $OUT
cd $OUT
echo `hostname` slots=${SLOTS} > ${HOSTFILE}

np=${SLOTS} mpirun="mpirun -hostfile ${HOSTFILE}" \
k=${KMER} \
name=${OUT} \
lib='pe200 pe700 peMP1' \
mp='mpMP1' \
pe200="${RDIR}pe200-R1.fq ${RDIR}pe200-R2.fq" \
pe700="${RDIR}pe700-R1.fq ${RDIR}pe700-R2.fq" \
peMP1="${RDIR}peMP1-R1.fq ${RDIR}peMP1-R2.fq" \
mpMP1="${RDIR}mpMP1-R1.fq ${RDIR}mpMP1-R2.fq" \

rm -rf ${TMPDIR}

Any questions, email rov@aber.ac.uk

Below is an example of another working script, as seen ABySS and openmpi had to be loaded in script before ABySS would function. This example shows a paired end fragmented library being assembled by ABySS with the k-mer size 25.

#$ -S /bin/sh
#$ -N [running name]
#$ -j y
#$ -M [email address]
#$ -q intel.q
#$ -cwd

module load openmpi/open64
module load abyss/1.3.7 

abyss-pe -j6 k=25 n=10 name=McCabeABySS lib='L001 L002 L003 L004 L006 L007' \
L001='McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L001_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L001_R2_001_qualtrim.trimmed5P.fastq.oneline.matched' \
L002='McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L002_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L002_R2_001_qualtrim.trimmed5P.fastq.oneline.matched' \
L003='McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L003_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-LAB_NoIndex_L003_R2_001_qualtrim.trimmed5P.fastq.oneline.matched' \
L004='McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L004_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L004_R2_001_qualtrim.trimmed5P.fastq.oneline.matched' \
L006='McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L006_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L006_R2_001_qualtrim.trimmed5P.fastq.oneline.matched' \
L007='McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L007_R1_001_qualtrim.trimmed5P.fastq.oneline.matched McCabe-gDNA-RichardDewhurst-SAB_NoIndex_L007_R2_001_qualtrim.trimmed5P.fastq.oneline.matched'