Difference between revisions of "6.1 Genome indexing for bowtie"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 52: Line 52:
 
'''Z12''': the prefix that I chose for all files written by bowtie2-build (but you can chose a  different one)
 
'''Z12''': the prefix that I chose for all files written by bowtie2-build (but you can chose a  different one)
  
Save and close the file. Now you can submit it to the scheduler using the qsub command,  like this:  
+
Save and close the file. Now you can submit it to the scheduler using the '''qsub''' command,  like this:  
 
   
 
   
 
  $ sub bowtie2index.sh
 
  $ sub bowtie2index.sh
The job is submitted to the scheduler, and the job identifier is written on screen. You can  check the status of the job using the qstat command. This command returns the status  of all jobs from all users, hence to check only your job you can write something like this:  
+
The job is submitted to the scheduler, and the job identifier is written on screen. You can  check the status of the job using the '''qstat''' command. This command returns the status  of all jobs from all users, hence to check only your job you can write something like this:  
 
   
 
   
 
  $ qstat | grep <your user name>
 
  $ qstat | grep <your user name>

Revision as of 11:21, 4 February 2016

To map reads to a reference genome, all mapping tools require the genome to be converted into particular structures for quick access and to keep the memory footprint small. In this tutorial we will use tophat2 for the transcriptome reconstruction, which in turn requires bowtie2 to run for the read mapping to the genome. Bowtie2 needs the genome to be indexed using the BurrowsAWheeler transform, and provides a tool (bowtie2-build) to obtain this transformation starting from the genome sequence stored in a text file in fasta format. You will also try another mapper, STAR, which performs quite well and can be easily integrated with cufflinks/cuffdif, as you will see later.

Let’s first look at the bowtie2-build options:

$ module load bowtie2
$ bowtie2-build --help

The general usage is:

Usage:

bowtie2-build [options]* <reference_in> <bt2_index_base>

You need to specify the genomic sequence file (reference_in) and a label to identify the index (bt2_index_base), which will be the prefix of all files written by bowtie2-build. Copy a folder containing the genomic sequence with the following command:

$ cp –r /ibers/repository/public/courses/Rna-Seq/genome .

Check the content of the genome folder that you just copied:

$ ls –l genome

Have a look at the file with the commands more or head. This is a file in fasta format, widely used for storing nucleotide or amino acid sequences. The first row, and any row starting with >, is the name of the sequence. For this tutorial, only the zebrafish chromosome 12 is in this file, and all the reads were already filtered for mapping on it. Try to count the number of nucleotides in the chromosome 12: knowing that each row in the fasta file contains 60 nucleotides, how would you know the chromosome size? To keep things well organized, you can create a folder to store the index (name it for example bt2Index or any other name you like):

$ mkdir bt2Index

Once you created the folder, let’s open a text file with a text editor. You can call the file any name you want, for example bowtie2index.sh. Write at the beginning of the file the header for the scheduler, and then load the required modules:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=10G
#$ -e run_bowtie_index.e
#$ -N run_bowtie_index
#$ -o run_bowtie_index.o
module load bowtie/2.2.3

Now write the Unix command to instruct the script to go in the folder in which to execute the commands. For my account, I would write in the script:

cd /ibers/ernie/home/vpl/zebrafish/bt2Index 

If you are not sure the full path, i.e. where your folder is, check the path of the folder using pwd. Now you can write the command to index the genome:

bowtie2-build -f ../genome/zebrafish.fa Z12

-f: specify that the genome is in fast format

../genome/zebrafish.fa: path to the genome fasta file (remember that the command .. indicates the parent folder, i.e. the one right above that in which you currently are)

Z12: the prefix that I chose for all files written by bowtie2-build (but you can chose a different one)

Save and close the file. Now you can submit it to the scheduler using the qsub command, like this:

$ sub bowtie2index.sh

The job is submitted to the scheduler, and the job identifier is written on screen. You can check the status of the job using the qstat command. This command returns the status of all jobs from all users, hence to check only your job you can write something like this:

$ qstat | grep <your user name>