5. Trimming

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

We will use Trimmomatic for read trimming and filtering, and adapters removal.

Let’s see its general usage. To run trimmomatic you need to specify its full path: /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar):

$ java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar -h

Usage:

 PE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] [-basein <inputBase> | <inputFile1> <inputFile2>] 
      [-baseout <outputBase> | <outputFile1P> <outputFile1U><outputFile2P> <outputFile2U>] <trimmer1>...

or:

SE [-threads <threads>] [-phred33|-phred64] [-trimlog <trimLogFile>] <inputFile> <outputFile> <trimmer1>...

For paired end reads (PE), after specifying some options, the arguments are the input forward and reverse fastq files, the names of the paired and unpaired remaining reads after trimming for the forward reads, and the names of the paired and unpaired remaining reads after trimming for the reverse reads. Finally, you need to specify the parameters for the trimming operations. Since the Trimmomatic output files are fastq files, you must be careful of not mixing them with the original un-trimmed fastq files, or to overwrite the un-trimmed fastq files. You can create a specific folder for the trimmed files, or use specific names that will remind you what they are. Let’s try for the forward 2-cells fastq file. We will use again the submission script but with larger memory limits (lets say 40GB. The java VM consumes a lot of memory) and the running command for trimmomatic:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=40G
#$ -e run_trimmomatic.e
#$ -N run_trimmomatic
#$ -o run_trimmomatic.o
java -jar /cm/shared/apps/trimmomatic/0.33/trimmomatic-0.33.jar PE -threads 2 -phred33 data/2cells_1.fastq data/2cells_2.fastq \
data/2cells_1.trim.fastq data/2cells_1.trim.unpaired.fastq data/2cells_2.trim.fastq data/2cells_2.trim.unpaired.fastq \
LEADING:20 TRAILING:20 AVGQUAL:20 MINLEN:25

The arguments are:

PE: specify that reads are paired end

-threads: number of threads

-phred33: quality scale paired forward reads output file unpaired forward reads output file paired reverse reads output file unpaired reverse reads output file

LEADING: quality threshold for removing nucleotides from the 5’ end

TRAILING: quality threshold for removing nucleotides from the 3’ end

AVGQUAL: mean read quality threshold

MINLEN: minimum read length At the end of the run, you should get a report on screen telling how many reads passed the filters, for example:

Input Read Pairs: 786742 Both Surviving: 770010 (97.87%) Forward Only Surviving: 14831 (1.89%) Reverse Only Surviving:1596 (0.20%)
Dropped: 305 (0.04%)
TrimmomaticPE: Completed successfully

With these parameters, all output files are written in the data folder, where the original fastq files are, but you can chose a different setting. The files 2cells_1.trim.fastq and 2cells_2.trim.fastq contain trimmed reads that after the filtering have maintained their mate, while the files 2cells_1.trim.unpaired.fastq and 2cells_2.trim.unpaired.fastq contain reads that have lost their mate. Using the Unix commands that you used before, count the number of reads in the trimmed fastq files, and verify that the forward and reverse fastq files have the same number of reads. Moreover, by looking at the beginning of the forward and reverse file, verify that the pairing order of reads has been maintained (i.e. the first read of the forward file must be the mate of the first read of the reverse file). Since paired end read files must have the same number of reads and with the same order in the forward and reverse files, unpaired reads must kept separated. Count the number of the reads that have lost their mate for the forward and reverse reads. In this tutorial, unpaired reads will be ignored from now on, but they can be used in principle by treating them as single end reads. If you want to verify that the trimming procedure worked, you could run again FastQC on the trimmed fastq files and compare the output reports against the raw fastq files, but let’s skip this step. Now try yourself repeating the same steps for the other sample (the 6-houts embryo) in the data folder (be careful on the input and output file names), and once finished check that in the data folder now you have all trimmed fastq files, paired and unpaired (8 trimmed files, 4 for the 2-cells embryo and 4 for the 6-hours embryo).