6.2 Read mapping using tophat

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

There are numerous tools performing short read alignment and the choice of aligner should be carefully made according to the analysis goals/requirements. Here you will use tophat2, a widely used ultrafast aligner that performs spliced alignments. Let’s first load the tophat module so that you can see its several parameters:

$ module load tophat/2.0.14
$ tophat --help

The general format of the tophat command is:

$ tophat [options]* <index_base> <reads_1> <reads_2>

where the last two arguments are the fastq files of the paired end trimmed reads, and the argument before is the prefix of the indexed genome which in your case is Z12 (unless you choose a different one). Let’s create a folder in which you will run and store the tophat output:

$ mkdir tophat_out

Now open a text file with a text editor that you will use to align the 2-cells zebrafish embryo sample, calling the file with a meaningful name (for example tophat_2cells.sh), and write the header for the scheduler as you did before for running bowtie2-build. Load the needed modules and point the script to the folder you just created. You script should look something like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e run_tophat_2cells.e
#$ -N tophat_2cells
#$ -o run_tophat_2cells.o
module load tophat/2.0.14
module load samtools/0.1.19
cd /ibers/ernie/home/vpl/zebrafish

Notice:Due to compatibility issue between tophat and some versions of samtools (a suite of utilities for handling SAM/BAM files) You have to load the module that includes an older version of samtools (samtools-0.1.19).

Now write in the script file the tophat command (all in one line):

tophat --solexa-quals -g 2 --library-type fr-unstranded -o tophat_out/2cells bt2Index/Z12 data/2cells_1.trim.fastq \
data/2cells_2.trim.fastq

The parameters are:

--solexa-quals: quality scale of the reads

-g: maximum number of multihits allowed. Short reads are likely to map to more than one locations in the genome even though these reads can have originated from only one of these regions. In general, only a restricted number of multihits is tolerated, and in this case you are asking tophat to report only reads that map at most onto 2 different loci.

--library-type: type of the sequencing (in your case paired end forward/reverse unstranded)

-o: output folder name. Given that for every run the name of the output files is the same, you need to specify different folders for each run, lest the output files will be overwritten. Now save and close the file, and submit it using qsub as you did before. This is going to be the longest step of the pipeline, requiring few minutes (10/20 minutes at most). While the job is running, you can see the files that tophat is writing (using ls –l). A temporary folder will be created where large files will be written, then deleted after completion.

In the meantime, let’s prepare the script file for the 6-hours post-fertilization sample. Open a file (call it for example tophat_6h.sh), write the header, load the modules and point to the right folder, identically as you did for the 2-cells sample. Then write the tophat command for the 6 hours sample (remember to change the output folder name, otherwise the second run will overwrite the first run files). You should know now how to do that, so try by yourself before looking at the command below (all in one line):

tophat --solexa-quals -g 2 --library-type fr-unstranded -o 6hours ../bt2Index/ZV9 ../data/6h_1.trim.fastq \
../data/6h_2.trim.fastq

Save and close, run it with qsub, check the status with qstat. When the two jobs are finished, look into the output folders. The file that we need is the accepted_hits.bam, where the read alignments are stored in a binary version of the Sequence Alignment Map (SAM) format. For more information regarding the SAM format please see [1] Another interesting file is the align_summary.txt file, where some statistics on the run outcome are reported. For the 2-cells embryo sample, your align_summary.txt file should look something like this:

Left reads:
         Input     :     25930
          Mapped   :     25247 (97.4% of input)
           of these:      4230 (16.8%) have multiple alignments (4227 have >2)
Right reads:
         Input     :     25930
          Mapped   :     25247 (97.4% of input)
           of these:      4230 (16.8%) have multiple alignments (4227 have >2)
97.4% overall read mapping rate.
Aligned pairs:     25246
    of these:      4230 (16.8%) have multiple alignments
                  24642 (97.6%) are discordant alignments
2.3% concordant pair alignment rate.

The BAM files cannot be looked, but their SAM counterparts can. You will see later that the samotools suite of tools offers several utilities for conversion and analysis of SAM/BAM files. For example, to convert a BAM into SAM and check the first lines of the SAM file you can write (but don’t do it now):

$ samtools view file.bam | head