Difference between revisions of "6.4 Differential Expression using cuffdiff"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 31: Line 31:
  
 
'''-b''', '''-u''', '''--library-type''': same meaning as described before for cufflinks
 
'''-b''', '''-u''', '''--library-type''': same meaning as described before for cufflinks
 +
The output folder (check its contents with '''ls –l''') contains a number of files reporting  differential  usage  for  different  genomic  entities.  You  are  mostly  interested  in  the  differential  expression  at  the  gene  level.  This  is  reported  in  the  file  '''gene_exp.diff'''.  Look at the first few lines of the file using the following command:
 +
 +
$ head cdiff_out/gene_exp.diff
 +
You would like to see which are the most significantly differentially expressed genes. To  do  this,  you  can  '''sort'''  the  above  file  according  to  the  '''q-value'''  (corrected  p-value  for  multiple testing). Let’s use a sort command to sort the file and write the sorted file in a  different  one  called  '''gene_exp_sorted.diff'''.  Write  the  following  command  (all  in  one line):
 +
 +
$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff
 +
Look again at the first few lines of the sorted file. If everything went as it should have, the first  18  genes  should  have  q-value  <  0.05,  chosen  as  significance  threshold,  and  are  therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours  embryo.
 +
 +
Now  you  have  all  the  genes  that  change  between  the  two  conditions.  In  your  example,  the  number  of  these  genes  is  quite  small,  but  in  more  realistic  cases  you  might  end  up  with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the  rationalization  of  which  biological  functions  change  between  the  two  analyzed  conditions.  Let’s  take  the  top  100  of  the  genes  sorted  by  q-value  from  the  '''gene_exp_sorted.diff''' file (in more realistic cases you would have taken only genes  having q-value < 0.05, but in your example the 18 genes satisfying this condition are too  few  to  have  a  significant  functional  characterization).  As  you  did  before,  use  the  command  head  to  select  the  top  100  genes,  and  the  command  cut  to  extract  only  the  first column from these rows:

Revision as of 10:45, 18 February 2016

One of the stand-alone tools that are part of the cufflinks package, and that performs differential expression estimation, is cuffdiff. You can use this tool to compare between two conditions; for example control and disease, or wild-type and mutant, or, as in your case, we want to identify genes that are differentially expressed between two developmental stages. You don’t need to load a different module to use cuffdiff since it is part of the cufflinks module. So let’s just create an output folder to store its results:

$ mkdir cdiff_out

The general format of the cuffdiff command is:

cuffdiff [options]* <transcripts.gtf>\
<sample1_replicate1.sam[,...,sample1_replicateM]>\
<sample2_replicate1.sam[,...,sample2_replicateM.sam]>\

where the input is an annotation file and the aligned reads (either in SAM or BAM format) for the two conditions to be compared. You can now prepare the script for running cuffdiff, by opening a text file (let’s call it cuffdiff.sh), and writing the header, the loading module commands, and pointing to your working space:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e cuffdiff.e
#$ -N cuffdiff
#$ -o cuffdiff.o
module load cufflinks/2.2.1

Now write the command to run cuffdiff (all in one line):

cuffdiff -o cdiff_out -L ZV9_2cells,ZV9_6h -b genome/Danio_rerio.Zv9.66.dna.fa -u \
--library-type fr-unstranded \
annotations/Zebrafish_refGene.gtf tophat_out/2cells/accepted_hits.bam tophat_out/6hours/accepted_hits.bam

The options that you used above are:

-o: output directory,

-L: labels for the different conditions,

-b, -u, --library-type: same meaning as described before for cufflinks The output folder (check its contents with ls –l) contains a number of files reporting differential usage for different genomic entities. You are mostly interested in the differential expression at the gene level. This is reported in the file gene_exp.diff. Look at the first few lines of the file using the following command:

$ head cdiff_out/gene_exp.diff

You would like to see which are the most significantly differentially expressed genes. To do this, you can sort the above file according to the q-value (corrected p-value for multiple testing). Let’s use a sort command to sort the file and write the sorted file in a different one called gene_exp_sorted.diff. Write the following command (all in one line):

$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff

Look again at the first few lines of the sorted file. If everything went as it should have, the first 18 genes should have q-value < 0.05, chosen as significance threshold, and are therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours embryo.

Now you have all the genes that change between the two conditions. In your example, the number of these genes is quite small, but in more realistic cases you might end up with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the rationalization of which biological functions change between the two analyzed conditions. Let’s take the top 100 of the genes sorted by q-value from the gene_exp_sorted.diff file (in more realistic cases you would have taken only genes having q-value < 0.05, but in your example the 18 genes satisfying this condition are too few to have a significant functional characterization). As you did before, use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows: