Difference between revisions of "6.2 Read mapping using tophat"

Revision as of 13:46, 4 February 2016

There are numerous tools performing short read alignment and the choice of aligner should be carefully made according to the analysis goals/requirements. Here you will use tophat2, a widely used ultrafast aligner that performs spliced alignments. Let’s first load the tophat module so that you can see its several parameters:

$ module load tophat/2.0.14
$ tophat --help

The general format of the tophat command is:

$ tophat [options]* <index_base> <reads_1> <reads_2>

where the last two arguments are the fastq files of the paired end trimmed reads, and the argument before is the prefix of the indexed genome which in your case is Z12 (unless you choose a different one). Let’s create a folder in which you will run and store the tophat output:

$ mkdir tophat_out

Now open a text file with a text editor that you will use to align the 2-cells zebrafish embryo sample, calling the file with a meaningful name (for example tophat_2cells.sh), and write the header for the scheduler as you did before for running bowtie2-build. Load the needed modules and point the script to the folder you just created. You script should look something like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e run_tophat_2cells.e
#$ -N tophat_2cells
#$ -o tophat_2cells.o
module load tophat/2.0.14
module load samtools/0.1.19
cd /ibers/ernie/home/vpl/zebrafish

Notice:Due to compatibility issue between tophat and some versions of samtools (a suite of utilities for handling SAM/BAM files) You have to load the module that includes an older version of samtools (samtools-0.1.19).

Now write in the script file the tophat command (all in one line):

tophat --solexa-quals -g 2 --library-type fr-unstranded -o tophat_out/2cells bt2Index/Z12 data/2cells_1.trim.fastq \
data/2cells_2.trim.fastq

The parameters are:

--solexa-quals: quality scale of the reads

-g: maximum number of multihits allowed. Short reads are likely to map to more than one locations in the genome even though these reads can have originated from only one of these regions. In general, only a restricted number of multihits is tolerated, and in this case you are asking tophat to report only reads that map at most onto 2 different loci.

--library-type: type of the sequencing (in your case paired end forward/reverse unstranded)

-o: output folder name. Given that for every run the name of the output files is the same, you need to specify different folders for each run, lest the output files will be overwritten. Now save and close the file, and submit it using qsub as you did before. This is going to be the longest step of the pipeline, requiring few minutes (10/20 minutes at most). While the job is running, you can see the files that tophat is writing (using ls –l). A temporary folder will be created where large files will be written, then deleted after completion.

In the meantime, let’s prepare the script file for the 6-hours post-fertilization sample. Open a file (call it for example tophat_6h.sh), write the header, load the modules and point to the right folder, identically as you did for the 2-cells sample. Then write the tophat command for the 6 hours sample (remember to change the output folder name, otherwise the second run will overwrite the first run files). You should know now how to do that, so try by yourself before looking at the command below (all in one line):

tophat --solexa-quals -g 2 --library-type fr-unstranded -o 6hours ../bt2Index/ZV9 ../data/6h_1.trim.fastq \
../data/6h_2.trim.fastq

@@ Line 10: / Line 10: @@
   $ mkdir tophat_out
 Now  open  a  text  file  with  a  text  editor  that  you  will  use  to  align  the  2-cells  zebrafish  embryo  sample,  calling  the  file  with  a  meaningful  name  (for  example  '''tophat_2cells.sh'''),  and  write  the  header  for  the  scheduler  as  you  did  before  for  running  '''bowtie2-build'''.  Load  the  needed  modules  and  point  the  script  to  the  folder  you just created. You script should look something like this:
@@ Line 39: / Line 38: @@
 '''-o''': output folder name. Given that for every run the name of the output files is the same, you need to specify different folders for each run, lest the output files will be overwritten.
+Now save and close the file, and submit it using '''qsub''' as you did before. This is going to  be  the  longest  step  of  the  pipeline,  requiring  few  minutes  (10/20  minutes  at  most).  While the job is running, you can see the files that tophat is writing (using '''ls –l'''). A  temporary  folder  will  be  created  where  large  files  will  be  written,  then  deleted  after  completion.
+In  the  meantime,  let’s  prepare  the  script  file  for  the  6-hours  post-fertilization  sample.  Open  a  file  (call  it  for  example  '''tophat_6h.sh'''),  write  the  header,  load  the  modules and point  to  the  right  folder, identically as you did for the 2-cells sample. Then write the '''tophat''' command for the 6  hours  sample  ('''remember  to  change  the  output  folder  name,  otherwise  the  second  run  will overwrite the first run files'''). You should know now how to do that, so try by yourself  before looking at the command below (all in one line):
+ tophat --solexa-quals -g 2 --library-type fr-unstranded -o 6hours ../bt2Index/ZV9 ../data/6h_1.trim.fastq \
+ ../data/6h_2.trim.fastq

Difference between revisions of "6.2 Read mapping using tophat"

Revision as of 13:46, 4 February 2016

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools