Difference between revisions of "6.3 Transcriptome reconstruction and expression using cufflinks"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 30: Line 30:
 
The options that you used above are:  
 
The options that you used above are:  
 
'''-o''': output directory  
 
'''-o''': output directory  
 +
 
'''-G''':  tells  cufflinks  to  use  the  supplied  annotation  in  order  to  estimate  isoform  annotation.  
 
'''-G''':  tells  cufflinks  to  use  the  supplied  annotation  in  order  to  estimate  isoform  annotation.  
'''-b''':  instructs  cufflinks  to  run  a  bias  detection  and  correction  algorithm  which  can  significantly improve accuracy of transcript abundance estimates (the genome sequence  is required to perform this).  
+
 
 +
'''-b''':  instructs  cufflinks  to  run  a  bias  detection  and  correction  algorithm  which  can  significantly improve accuracy of transcript abundance estimates (the genome sequence  is required to perform this).
 +
 
'''-u''':  tells  cufflinks  to  do  an  initial  estimation  procedure  to  more  accurately  weight  reads mapping to multiple locations in the genome.  
 
'''-u''':  tells  cufflinks  to  do  an  initial  estimation  procedure  to  more  accurately  weight  reads mapping to multiple locations in the genome.  
 +
 
'''--library-type''': specify the employed sequencing strategy  
 
'''--library-type''': specify the employed sequencing strategy  
 
   
 
   
 
Save,  close  and  run  using  qsub.  With  your  small  dataset,  it  should  not  take  more  than  few  minutes.  While  waiting,  you  can  prepare  and  run  a  similar  file  for  the  6-hours  embryo  sample  (again,  remember  to  change  the  output  folder  name,  otherwise  the  second run will overwrite the first run files). You should know now how to do that, so try  by yourself before looking at the command below:
 
Save,  close  and  run  using  qsub.  With  your  small  dataset,  it  should  not  take  more  than  few  minutes.  While  waiting,  you  can  prepare  and  run  a  similar  file  for  the  6-hours  embryo  sample  (again,  remember  to  change  the  output  folder  name,  otherwise  the  second run will overwrite the first run files). You should know now how to do that, so try  by yourself before looking at the command below:

Revision as of 17:05, 15 February 2016

There are a number of tools that perform reconstruction of the transcriptome and for this workshop you are going to use cufflinks, which can do transcriptome assembly with and without a reference annotation. It also quantifies the isoform expression in FPKMs. Let’s import the cufflinks module and check its (rather long) list of parameters:

$ module load cufflinks/2.1.1
$ cufflinks --help

You are going to use cufflinks with the UCSC annotation file for zebrafish limiting the transcriptome reconstruction strictly to the available annotations. You could also use the annotations as a guide but letting the algorithm look also for transcribed loci not included in the annotations. In the first case cufflinks will only report isoforms that are included in the annotation, while in the latter case it will report novel isoforms as well. First, copy the annotation file from UCSC for chromosome 12 of Danio rerio:

$ cp –r /ibers/repository/public/courses/Rna-Seq/annotations .  

Then let’s create a folder for the cufflinks output storage:

$ mkdir cuff_out

Now you are ready to run cufflinks. The general format of the cufflinks command is:

cufflinks [options]* <aligned_reads.(sam/bam)>

where the input is the aligned reads (either in SAM or BAM format). Prepare a script file to launch cufflinks, by opening it with a text editor (you can call it for example cufflinks_2cells.sh). Copy the same header for the scheduler that you used previously, use the commands to load the required modules, and point the script to your working space. Your script should look something like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=24G
#$ -e cff_12.e
#$ -N cff
#$ -o cff_12.o
module load cufflinks/2.2.1
cd /ibers/ernie/home/vpl/zebrafish

Then you can write (all in one line) the command:

cufflinks -o cuff_out_chr12/2cells -G annotations/Zebrafish_refGene_chr12.gtf \
-b /ibers/ernie/home/vpl/zebrafish/bt2Index/chr12.fa \
-u --library-type fr-unstranded /ibers/ernie/scratch/vpl/zebra_fish/tophat_out_chr12/accepted_hits.bam

The options that you used above are: -o: output directory

-G: tells cufflinks to use the supplied annotation in order to estimate isoform annotation.

-b: instructs cufflinks to run a bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates (the genome sequence is required to perform this).

-u: tells cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome.

--library-type: specify the employed sequencing strategy

Save, close and run using qsub. With your small dataset, it should not take more than few minutes. While waiting, you can prepare and run a similar file for the 6-hours embryo sample (again, remember to change the output folder name, otherwise the second run will overwrite the first run files). You should know now how to do that, so try by yourself before looking at the command below: