Difference between revisions of "5. Trimming"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 36: Line 36:
  
 
'''MINLEN''': minimum read length
 
'''MINLEN''': minimum read length
 +
At  the  end  of  the  run,  you  should  get  a  report  on  screen  telling  how  many  reads  passed  the filters, for example:
 +
Input Read Pairs: 786742 Both Surviving: 770010 (97.87%) Forward Only Surviving: 14831 (1.89%) Reverse Only Surviving: 1596 (0.20%) Dropped: 305 (0.04%)
 +
TrimmomaticPE: Completed successfully
 +
With these parameters, all output files are written in the data folder, where the original  fastq files are, but you can chose a different setting. The files '''2cells_1.trim.fastq''' and '''2cells_2.trim.fastq''' contain  trimmed  reads  that  after  the  filtering  have  maintained  their  mate,  while  the  files  '''2cells_1.trim.unpaired.fastq''' and '''2cells_2.trim.unpaired.fastq''' contain  reads  that  have  lost  their  mate.  Using  the  Unix  commands  that  you  used  before,  count  the  number  of  reads  in  the  trimmed  fastq  files,  and  verify  that  the  forward  and  reverse  fastq  files  have  the  same  number  of  reads.  Moreover,  by  looking  at  the  beginning  of  the  forward  and  reverse  file,  verify  that  the pairing order of reads has been maintained (i.e. the first read of the forward file must  be the mate of the first read of the reverse file).

Revision as of 16:13, 27 January 2016

We will use Trimmomatic for read trimming and filtering, and adapters removal.

Let’s see its general usage. To run trimmomatic you need to specify its full path: /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar):

$ java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar -h

Usage:

 PE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-basein <inputBase> | <inputFile1> <inputFile2>] 
      [-baseout <outputBase> | <outputFile1P> <outputFile1U><outputFile2P> <outputFile2U>] <trimmer1>...

or:

SE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] <inputFile> <outputFile> <trimmer1>...

For paired end reads (PE), after specifying some options, the arguments are the input forward and reverse fastq files, the names of the paired and unpaired remaining reads after trimming for the forward reads, and the names of the paired and unpaired remaining reads after trimming for the reverse reads. Finally, you need to specify the parameters for the trimming operations. Since the Trimmomatic output files are fastq files, you must be careful of not mixing them with the original un-trimmed fastq files, or to overwrite the un-trimmed fastq files. You can create a specific folder for the trimmed files, or use specific names that will remind you what they are. Let’s try for the forward 2-cells fastq file. We will use again the submission script but with larger memory limits (lets say 40GB. The java VM consumes a lot of memory) and the running command for trimmomatic:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=40G
#$ -e run_trimmomatic.e
#$ -N run_trimmomatic
#$ -o run_trimmomatic.o
java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar PE -threads 2 -phred33 data/2cells_1.fastq data/2cells_2.fastq \
data/2cells_1.trim.fastq data/2cells_1.trim.unpaired.fastq data/2cells_2.trim.fastq data/2cells_2.trim.unpaired.fastq \
LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:25

The arguments are:

PE: specify that reads are paired end

-threads: number of threads

-phred33: quality scale paired forward reads output file unpaired forward reads output file paired reverse reads output file unpaired reverse reads output file

LEADING: quality threshold for removing nucleotides from the 5’ end

TRAILING: quality threshold for removing nucleotides from the 3’ end

AVGQUAL: mean read quality threshold

MINLEN: minimum read length At the end of the run, you should get a report on screen telling how many reads passed the filters, for example:

Input Read Pairs: 786742 Both Surviving: 770010 (97.87%) Forward Only Surviving: 14831 (1.89%) Reverse Only Surviving: 1596 (0.20%) Dropped: 305 (0.04%)
TrimmomaticPE: Completed successfully

With these parameters, all output files are written in the data folder, where the original fastq files are, but you can chose a different setting. The files 2cells_1.trim.fastq and 2cells_2.trim.fastq contain trimmed reads that after the filtering have maintained their mate, while the files 2cells_1.trim.unpaired.fastq and 2cells_2.trim.unpaired.fastq contain reads that have lost their mate. Using the Unix commands that you used before, count the number of reads in the trimmed fastq files, and verify that the forward and reverse fastq files have the same number of reads. Moreover, by looking at the beginning of the forward and reverse file, verify that the pairing order of reads has been maintained (i.e. the first read of the forward file must be the mate of the first read of the reverse file).