Difference between revisions of "6.3 Transcriptome reconstruction and expression using cufflinks"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
 
(10 intermediate revisions by the same user not shown)
Line 25: Line 25:
 
  cd /ibers/ernie/home/vpl/zebrafish
 
  cd /ibers/ernie/home/vpl/zebrafish
 
Then you can write (all in one line) the command:
 
Then you can write (all in one line) the command:
  cufflinks -o cuff_out_chr12/2cells -G annotations/Zebrafish_refGene_chr12.gtf -b /ibers/ernie/home/vpl/zebrafish/bt2Index/chr12.fa \
+
  cufflinks -o cuff_out/2cells -G annotations/Zebrafish_refGene_chr12.gtf \
  -u --library-type fr-unstranded /ibers/ernie/scratch/vpl/zebra_fish/tophat_out_chr12/accepted_hits.bam
+
-b bt2Index/chr12.fa \
 +
-u --library-type fr-unstranded tophat_out/2cells/accepted_hits.bam
 +
The options that you used above are:
 +
'''-o''': output directory
 +
 
 +
'''-G''':  tells  cufflinks  to  use  the  supplied  annotation  in  order  to  estimate  isoform  annotation.
 +
 
 +
'''-b''':  instructs  cufflinks  to  run  a  bias  detection  and  correction  algorithm  which  can  significantly improve accuracy of transcript abundance estimates (the genome sequence  is required to perform this).
 +
 +
'''-u''':  tells  cufflinks  to  do  an  initial  estimation  procedure  to  more  accurately  weight  reads mapping to multiple locations in the genome.
 +
 
 +
'''--library-type''': specify the employed sequencing strategy
 +
 +
Save,  close  and  run  using  qsub.  With  your  small  dataset,  it  should  not  take  more  than  few  minutes.  While  waiting,  you  can  prepare  and  run  a  similar  file  for  the  6-hours  embryo  sample  (again,  remember  to  change  the  output  folder  name,  otherwise  the  second run will overwrite the first run files). You should know now how to do that, so try  by yourself before looking at the command below:
 +
 
 +
cufflinks -o cuff_out/6hours -G annotations/Zebrafish_refGene_chr12.gtf \
 +
-b /ibers/ernie/home/vpl/zebrafish/bt2Index/chr12.fa \
 +
  -u --library-type fr-unstranded /ibers/ernie/scratch/vpl/zebra_fish/tophat_out_6hours/accepted_hits.bam
 +
When  both  have  finished,  you  can  take  a  look  at  the  output  folders  that  have  been  created. The results from cufflinks are stored in 4 different files:
 +
'''genes.fpkm_tracking''': This file contains the estimated geneAlevel expression values  in the generic FPKM Tracking Format.
 +
 
 +
'''isoforms.fpkm_tracking''': This file contains the estimated isoform level expression  values in the generic FPKM Tracking Format.
 +
 
 +
'''transcripts.gtf''': This GTF file contains the assembled isoforms.
 +
 
 +
'''skipped.gtf''': This GTF file contains the skipped loci (it’s often empty)
 +
 
 +
Let’s have a look at the file '''genes.fpkm_tracking''', where gene expression is stored.  The  first  column  reports  the  UCSC  identifier  of  the  gene,  the  fifth  its  common  name,  the  seventh  its  genomic  coordinates,  and  the  tenth  its  expression  in  FPKM.  You  would  like  to  know  which  are  the  most  expressed  genes  in  the  sample.  This  can  be  done  in  many  ways,  for  example  opening  the  file  as  a  spreadsheet  with  Excel,  but  let’s  do  it  using  Unix  commands.  You  can  use  the  '''sort'''  command,  which  can  order  a  file  based  on  the  values in a specific column:
 +
$ sort -t$'\t' -r -g -k 10 cuff_out/2cells/genes.fpkm_tracking > cuff_out/2cells/genes.sorted.fpkm_tracking
 +
 +
If  you  want  to  better  understand  the  command  that  you  just  typed,  check  the  manual  page for the sort command using:
 +
$ man sort
 +
The  first  parameter  of  sort  tells  the  command  that  the  file  is  divided  in  columns  separated by tabs; the '''–k''' parameter specify which column you are interested in; the '''–g'''  tells sort that the column contains numbers (the default sort behavior is to order by  lexicographic  order);  the  '''–r'''  tells  sort  to  order  from  the  largest  to  the  smallest  value.  The  sort  output  is  redirected  into  a  new  file.  If  you  open  this  new  file,  the  genes  are  now  ordered  by  decreasing  expression.  If  you  now  want  to  know,  for  example,  which  is  the most expressed gen, you can copy UCSC identifier in the first row of the sorted  file, the  UCSC  genome  browser  to  get  more  information  about  it . Do that for a few genes. 
 +
 +
Now  do  the  same  also  for  the  6-hours  embryo  sample.  Which  are  the  most  expressed  genes in this other sample? Are they in common with the 2-cells embryo?
 +
 +
Now  that  you  have  all  the  strongly  expressed  genes  in  the  two  embryos,  you  might  wonder  in  which  biological  processes  they  participate.  Let’s  see  how  you  can  use  an  automatic  web-based  functional  annotation  tool  ('''DAVID''')  to  help  you  in  the rationalization  of  which  are  the  most  represented  process  in  a  biological  sample.  To  use  '''DAVID'''  you  need  to  submit  a  list  of  gene  identifiers.  In  the  '''genes.sorted.fpkm_tracking''', the first column contains the UCSC gene ids of  the  genes  in  the  annotation  '''GTF'''  file.  You  need  to  extract  these  identifiers  from  this  file.  Let’s  take  the  top  100  of  the  genes  sorted  by  '''FPKM'''. Let’s use the command '''head''' to select the top 100 genes, and the command '''cut'''  to  extract  only  the  first  column  from  these  rows  (if  you  want,  use  '''man'''  to  have  more  details on how these commands work):
 +
 +
$ head -100 cuff_out/2cells/genes.sorted.fpkm_tracking | cut -f 1 > cuff_out/2cells/top100byFpkm.txt
 +
 
 +
Remember that the '''>''' operator redirects the output of a command into a file. Now open  the '''top100byFpkm.txt''' file, verify that it’s a single-column file with the UCSC gene  identifiers, and copy its content. Open any browser and go to: [https://david.ncifcrf.gov/home.jsp]
 +
Click the '''Start Analysis''' tab at the top of the screen, click the '''Upload''' tab on the left  of  the  screen  and  paste  the  list  of  genes  into  the  box  labeled  Step 1: '''Enter Gene List'''. In Step 2: '''Select Identifiers''' choose UCSC_GENE_ID from the list,  and mark '''Gene List''' in Step 3: '''List Type''', then click '''Submit List''' in Step 4. After submission '''DAVID''' will ask you to convert the gene list into the appropriate format ('''DAVID''' format). Click on the '''Submit to conversion tool''' and save it as new converted list. After  the  page  ends  the  processing,  click  '''Functional Annotation Tool'''  to  be  redirected  to  the  result  page.  Here  you  can  expand  all  tabs  and  click  on  the  different  functional  categories  to  see  if  there  is  some  enrichment  in  your  gene  list.  The  one  that  might  interest  you  the  most  is  '''GOTERM_BP_FAT'''  under  '''Gene_Ontology'''
 +
Do  these  categories make sense given the samples you are studying?
 +
 +
Now you can repeat  for  the  6-hours  embryo  sample and see if there  are differences  between  the  two  samples.

Latest revision as of 10:58, 18 February 2016

There are a number of tools that perform reconstruction of the transcriptome and for this workshop you are going to use cufflinks, which can do transcriptome assembly with and without a reference annotation. It also quantifies the isoform expression in FPKMs. Let’s import the cufflinks module and check its (rather long) list of parameters:

$ module load cufflinks/2.1.1
$ cufflinks --help

You are going to use cufflinks with the UCSC annotation file for zebrafish limiting the transcriptome reconstruction strictly to the available annotations. You could also use the annotations as a guide but letting the algorithm look also for transcribed loci not included in the annotations. In the first case cufflinks will only report isoforms that are included in the annotation, while in the latter case it will report novel isoforms as well. First, copy the annotation file from UCSC for chromosome 12 of Danio rerio:

$ cp –r /ibers/repository/public/courses/Rna-Seq/annotations .  

Then let’s create a folder for the cufflinks output storage:

$ mkdir cuff_out

Now you are ready to run cufflinks. The general format of the cufflinks command is:

cufflinks [options]* <aligned_reads.(sam/bam)>

where the input is the aligned reads (either in SAM or BAM format). Prepare a script file to launch cufflinks, by opening it with a text editor (you can call it for example cufflinks_2cells.sh). Copy the same header for the scheduler that you used previously, use the commands to load the required modules, and point the script to your working space. Your script should look something like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=24G
#$ -e cff_12.e
#$ -N cff
#$ -o cff_12.o
module load cufflinks/2.2.1
cd /ibers/ernie/home/vpl/zebrafish

Then you can write (all in one line) the command:

cufflinks -o cuff_out/2cells -G annotations/Zebrafish_refGene_chr12.gtf \
-b bt2Index/chr12.fa \
-u --library-type fr-unstranded tophat_out/2cells/accepted_hits.bam

The options that you used above are: -o: output directory

-G: tells cufflinks to use the supplied annotation in order to estimate isoform annotation.

-b: instructs cufflinks to run a bias detection and correction algorithm which can significantly improve accuracy of transcript abundance estimates (the genome sequence is required to perform this).

-u: tells cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome.

--library-type: specify the employed sequencing strategy

Save, close and run using qsub. With your small dataset, it should not take more than few minutes. While waiting, you can prepare and run a similar file for the 6-hours embryo sample (again, remember to change the output folder name, otherwise the second run will overwrite the first run files). You should know now how to do that, so try by yourself before looking at the command below:

cufflinks -o cuff_out/6hours -G annotations/Zebrafish_refGene_chr12.gtf \
-b /ibers/ernie/home/vpl/zebrafish/bt2Index/chr12.fa \
-u --library-type fr-unstranded /ibers/ernie/scratch/vpl/zebra_fish/tophat_out_6hours/accepted_hits.bam

When both have finished, you can take a look at the output folders that have been created. The results from cufflinks are stored in 4 different files: genes.fpkm_tracking: This file contains the estimated geneAlevel expression values in the generic FPKM Tracking Format.

isoforms.fpkm_tracking: This file contains the estimated isoform level expression values in the generic FPKM Tracking Format.

transcripts.gtf: This GTF file contains the assembled isoforms.

skipped.gtf: This GTF file contains the skipped loci (it’s often empty)

Let’s have a look at the file genes.fpkm_tracking, where gene expression is stored. The first column reports the UCSC identifier of the gene, the fifth its common name, the seventh its genomic coordinates, and the tenth its expression in FPKM. You would like to know which are the most expressed genes in the sample. This can be done in many ways, for example opening the file as a spreadsheet with Excel, but let’s do it using Unix commands. You can use the sort command, which can order a file based on the values in a specific column:

$ sort -t$'\t' -r -g -k 10 cuff_out/2cells/genes.fpkm_tracking > cuff_out/2cells/genes.sorted.fpkm_tracking

If you want to better understand the command that you just typed, check the manual page for the sort command using:

$ man sort

The first parameter of sort tells the command that the file is divided in columns separated by tabs; the –k parameter specify which column you are interested in; the –g tells sort that the column contains numbers (the default sort behavior is to order by lexicographic order); the –r tells sort to order from the largest to the smallest value. The sort output is redirected into a new file. If you open this new file, the genes are now ordered by decreasing expression. If you now want to know, for example, which is the most expressed gen, you can copy UCSC identifier in the first row of the sorted file, the UCSC genome browser to get more information about it . Do that for a few genes.

Now do the same also for the 6-hours embryo sample. Which are the most expressed genes in this other sample? Are they in common with the 2-cells embryo?

Now that you have all the strongly expressed genes in the two embryos, you might wonder in which biological processes they participate. Let’s see how you can use an automatic web-based functional annotation tool (DAVID) to help you in the rationalization of which are the most represented process in a biological sample. To use DAVID you need to submit a list of gene identifiers. In the genes.sorted.fpkm_tracking, the first column contains the UCSC gene ids of the genes in the annotation GTF file. You need to extract these identifiers from this file. Let’s take the top 100 of the genes sorted by FPKM. Let’s use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows (if you want, use man to have more details on how these commands work):

$ head -100 cuff_out/2cells/genes.sorted.fpkm_tracking | cut -f 1 > cuff_out/2cells/top100byFpkm.txt

Remember that the > operator redirects the output of a command into a file. Now open the top100byFpkm.txt file, verify that it’s a single-column file with the UCSC gene identifiers, and copy its content. Open any browser and go to: [1] Click the Start Analysis tab at the top of the screen, click the Upload tab on the left of the screen and paste the list of genes into the box labeled Step 1: Enter Gene List. In Step 2: Select Identifiers choose UCSC_GENE_ID from the list, and mark Gene List in Step 3: List Type, then click Submit List in Step 4. After submission DAVID will ask you to convert the gene list into the appropriate format (DAVID format). Click on the Submit to conversion tool and save it as new converted list. After the page ends the processing, click Functional Annotation Tool to be redirected to the result page. Here you can expand all tabs and click on the different functional categories to see if there is some enrichment in your gene list. The one that might interest you the most is GOTERM_BP_FAT under Gene_Ontology Do these categories make sense given the samples you are studying?

Now you can repeat for the 6-hours embryo sample and see if there are differences between the two samples.