Difference between revisions of "5. Trimming"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
 
(12 intermediate revisions by the same user not shown)
Line 6: Line 6:
 
Usage:
 
Usage:
 
   PE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-basein <inputBase> | <inputFile1> <inputFile2>]  
 
   PE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-basein <inputBase> | <inputFile1> <inputFile2>]  
[-baseout <outputBase> | <outputFile1P> <outputFile1U><outputFile2P> <outputFile2U>] <trimmer1>...
+
      [-baseout <outputBase> | <outputFile1P> <outputFile1U><outputFile2P> <outputFile2U>] <trimmer1>...
 
or:
 
or:
 
  SE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] <inputFile> <outputFile> <trimmer1>...
 
  SE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] <inputFile> <outputFile> <trimmer1>...
 +
For  paired  end  reads  (PE),  after  specifying  some  options,  the  arguments  are  the  input  forward  and  reverse  fastq  files,  the  names  of  the  paired  and  unpaired  remaining  reads  after  trimming  for  the  forward  reads,  and  the  names  of  the  paired  and  unpaired  remaining  reads  after  trimming  for  the  reverse  reads.  Finally,  you  need  to  specify  the  parameters for the trimming operations. Since the Trimmomatic output files are fastq  files,  you  must  be  careful  of  not  mixing  them  with  the  original  un-trimmed  fastq  files,  or  to  overwrite  the  un-trimmed  fastq  files.  You  can  create  a  specific  folder  for  the  trimmed  files,  or  use  specific  names  that  will  remind  you  what  they  are.  Let’s  try  for  the  forward  2-cells fastq file.
 +
We will use again the submission script but with larger memory limits (lets say 40GB. The java VM consumes a lot of memory) and the running command for trimmomatic:
 +
#$ -S /bin/sh
 +
#$ -cwd
 +
#$ -q amd.q,large.q,intel.q
 +
#$ -l h_vmem=40G
 +
#$ -e run_trimmomatic.e
 +
#$ -N run_trimmomatic
 +
#$ -o run_trimmomatic.o
 +
java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar PE -threads 2 -phred33 data/2cells_1.fastq data/2cells_2.fastq \
 +
data/2cells_1.trim.fastq data/2cells_1.trim.unpaired.fastq data/2cells_2.trim.fastq data/2cells_2.trim.unpaired.fastq \
 +
LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:25
 +
The arguments are:
 +
 +
'''PE''': specify that reads are paired end
 +
 +
'''-threads''': number of threads
 +
 +
'''-phred33''': quality scale paired forward reads output file unpaired forward reads output file paired reverse reads output file unpaired reverse reads output file
 +
 +
'''LEADING''': quality threshold for removing nucleotides from the 5’ end
 +
 +
'''TRAILING''': quality threshold for removing nucleotides from the 3’ end
 +
 +
'''AVGQUAL''': mean read quality threshold
 +
 +
'''MINLEN''': minimum read length
 +
At  the  end  of  the  run,  you  should  get  a  report  on  screen  telling  how  many  reads  passed  the filters, for example:
 +
Input Read Pairs: 786742 Both Surviving: 770010 (97.87%) Forward Only Surviving: 14831 (1.89%) Reverse Only Surviving:1596 (0.20%)
 +
Dropped: 305 (0.04%)
 +
TrimmomaticPE: Completed successfully
 +
With these parameters, all output files are written in the '''data''' folder, where the original  fastq files are, but you can chose a different setting. The files '''2cells_1.trim.fastq''' and '''2cells_2.trim.fastq''' contain  trimmed  reads  that  after  the  filtering  have  maintained  their  mate,  while  the  files  '''2cells_1.trim.unpaired.fastq''' and '''2cells_2.trim.unpaired.fastq''' contain  reads  that  have  lost  their  mate.  Using  the  Unix  commands  that  you  used  before,  count  the  number  of  reads  in  the  trimmed  fastq  files,  and  verify  that  the  forward  and  reverse  fastq  files  have  the  same  number  of  reads.  Moreover,  by  looking  at  the  beginning  of  the  forward  and  reverse  file,  verify  that  the pairing order of reads has been maintained (i.e. the first read of the forward file must  be the mate of the first read of the reverse file).
 +
Since paired end read files must have the same number of reads and with the same order  in the forward and reverse files, unpaired reads must kept separated. 
 +
Count  the  number  of  the  reads  that  have  lost  their  mate  for  the  forward  and  reverse  reads.
 +
In  this  tutorial,  unpaired  reads  will  be  ignored  from  now  on,  but  they  can  be  used  in  principle by treating them as single end reads.  If you want to verify that the trimming procedure worked, you could run again '''FastQC''' on  the  '''trimmed  fastq'''  files  and  compare  the  output  reports  against  the  raw  fastq  files,  but  let’s skip this step.
 +
Now try yourself repeating the same steps for the other sample (the 6-houts embryo) in  the  data  folder  (be  careful  on  the  input  and  output  file  names),  and  once  finished  check that in the data folder now you have all trimmed fastq files, paired and unpaired  (8 trimmed files, 4 for the 2-cells embryo and 4 for the 6-hours embryo).

Latest revision as of 10:51, 4 February 2016

We will use Trimmomatic for read trimming and filtering, and adapters removal.

Let’s see its general usage. To run trimmomatic you need to specify its full path: /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar):

$ java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar -h

Usage:

 PE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-basein <inputBase> | <inputFile1> <inputFile2>] 
      [-baseout <outputBase> | <outputFile1P> <outputFile1U><outputFile2P> <outputFile2U>] <trimmer1>...

or:

SE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] <inputFile> <outputFile> <trimmer1>...

For paired end reads (PE), after specifying some options, the arguments are the input forward and reverse fastq files, the names of the paired and unpaired remaining reads after trimming for the forward reads, and the names of the paired and unpaired remaining reads after trimming for the reverse reads. Finally, you need to specify the parameters for the trimming operations. Since the Trimmomatic output files are fastq files, you must be careful of not mixing them with the original un-trimmed fastq files, or to overwrite the un-trimmed fastq files. You can create a specific folder for the trimmed files, or use specific names that will remind you what they are. Let’s try for the forward 2-cells fastq file. We will use again the submission script but with larger memory limits (lets say 40GB. The java VM consumes a lot of memory) and the running command for trimmomatic:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=40G
#$ -e run_trimmomatic.e
#$ -N run_trimmomatic
#$ -o run_trimmomatic.o
java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar PE -threads 2 -phred33 data/2cells_1.fastq data/2cells_2.fastq \
data/2cells_1.trim.fastq data/2cells_1.trim.unpaired.fastq data/2cells_2.trim.fastq data/2cells_2.trim.unpaired.fastq \
LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:25

The arguments are:

PE: specify that reads are paired end

-threads: number of threads

-phred33: quality scale paired forward reads output file unpaired forward reads output file paired reverse reads output file unpaired reverse reads output file

LEADING: quality threshold for removing nucleotides from the 5’ end

TRAILING: quality threshold for removing nucleotides from the 3’ end

AVGQUAL: mean read quality threshold

MINLEN: minimum read length At the end of the run, you should get a report on screen telling how many reads passed the filters, for example:

Input Read Pairs: 786742 Both Surviving: 770010 (97.87%) Forward Only Surviving: 14831 (1.89%) Reverse Only Surviving:1596 (0.20%)
Dropped: 305 (0.04%)
TrimmomaticPE: Completed successfully

With these parameters, all output files are written in the data folder, where the original fastq files are, but you can chose a different setting. The files 2cells_1.trim.fastq and 2cells_2.trim.fastq contain trimmed reads that after the filtering have maintained their mate, while the files 2cells_1.trim.unpaired.fastq and 2cells_2.trim.unpaired.fastq contain reads that have lost their mate. Using the Unix commands that you used before, count the number of reads in the trimmed fastq files, and verify that the forward and reverse fastq files have the same number of reads. Moreover, by looking at the beginning of the forward and reverse file, verify that the pairing order of reads has been maintained (i.e. the first read of the forward file must be the mate of the first read of the reverse file). Since paired end read files must have the same number of reads and with the same order in the forward and reverse files, unpaired reads must kept separated. Count the number of the reads that have lost their mate for the forward and reverse reads. In this tutorial, unpaired reads will be ignored from now on, but they can be used in principle by treating them as single end reads. If you want to verify that the trimming procedure worked, you could run again FastQC on the trimmed fastq files and compare the output reports against the raw fastq files, but let’s skip this step. Now try yourself repeating the same steps for the other sample (the 6-houts embryo) in the data folder (be careful on the input and output file names), and once finished check that in the data folder now you have all trimmed fastq files, paired and unpaired (8 trimmed files, 4 for the 2-cells embryo and 4 for the 6-hours embryo).