Difference between revisions of "6.2 Read mapping using tophat"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 27: Line 27:
  
 
Now write in the script file the tophat command (all in one line):  
 
Now write in the script file the tophat command (all in one line):  
  tophat --solexa-quals -g 2 --library-type fr-unstranded -o tophat_out/2cells bt2Index/ZV9 data/2cells_1.trim.fastq \
+
  tophat --solexa-quals -g 2 --library-type fr-unstranded -o tophat_out/2cells bt2Index/Z12 data/2cells_1.trim.fastq \
 
  data/2cells_2.trim.fastq
 
  data/2cells_2.trim.fastq
 +
 +
The parameters are:
 +
 +
'''--solexa-quals''': quality scale of the reads
 +
 +
'''-g''': maximum number of multihits allowed. Short reads are likely to map to more than  one locations in the genome even though these reads can have originated from only one  of these regions. In general, only a restricted number of multihits is tolerated, and in this  case you are asking tophat to report only reads that map at most onto 2 different loci.
 +
 +
'''--library-type''': type of the sequencing (in your case paired end forward/reverse  unstranded)
 +
 +
'''-o''': output folder name. Given that for every run the name of the output files is the same, you need to specify different folders for each run, lest the output files will be overwritten.

Revision as of 13:39, 4 February 2016

There are numerous tools performing short read alignment and the choice of aligner should be carefully made according to the analysis goals/requirements. Here you will use tophat2, a widely used ultrafast aligner that performs spliced alignments. Let’s first load the tophat module so that you can see its several parameters:

$ module load tophat/2.0.14
$ tophat --help

The general format of the tophat command is:

$ tophat [options]* <index_base> <reads_1> <reads_2>

where the last two arguments are the fastq files of the paired end trimmed reads, and the argument before is the prefix of the indexed genome which in your case is Z12 (unless you choose a different one). Let’s create a folder in which you will run and store the tophat output:

$ mkdir tophat_out

Now open a text file with a text editor that you will use to align the 2-cells zebrafish embryo sample, calling the file with a meaningful name (for example tophat_2cells.sh), and write the header for the scheduler as you did before for running bowtie2-build. Load the needed modules and point the script to the folder you just created. You script should look something like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e run_tophat_2cells.e
#$ -N tophat_2cells
#$ -o tophat_2cells.o
module load tophat/2.0.14
module load samtools/0.1.19
cd /ibers/ernie/home/vpl/zebrafish

Here you need to take an extra step to solve a compatibility issue between tophat and some versions of samtools (a suite of utilities for handling SAM/BAM files). You will need to add to your path an older version of samtools. Write the following in your script (in one line):

export PATH=//cineca/prod/applications/samtools/0.1.19/gnu-- 4.8.3/bin/:$PATH

Now write in the script file the tophat command (all in one line):

tophat --solexa-quals -g 2 --library-type fr-unstranded -o tophat_out/2cells bt2Index/Z12 data/2cells_1.trim.fastq \
data/2cells_2.trim.fastq

The parameters are:

--solexa-quals: quality scale of the reads

-g: maximum number of multihits allowed. Short reads are likely to map to more than one locations in the genome even though these reads can have originated from only one of these regions. In general, only a restricted number of multihits is tolerated, and in this case you are asking tophat to report only reads that map at most onto 2 different loci.

--library-type: type of the sequencing (in your case paired end forward/reverse unstranded)

-o: output folder name. Given that for every run the name of the output files is the same, you need to specify different folders for each run, lest the output files will be overwritten.