Difference between revisions of "6.4 Differential Expression using cuffdiff"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
(Created page with "One of the stand-alone tools that are part of the '''cufflinks''' package, and that performs differential expression estimation, is '''cuffdiff'''. You can use this tool to c...")
 
 
(3 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
  #$ -cwd
 
  #$ -cwd
 
  #$ -q amd.q,large.q,intel.q
 
  #$ -q amd.q,large.q,intel.q
  #$ -l h_vmem=24G
+
  #$ -l h_vmem=16G
  #$ -e cff_whole.e
+
  #$ -e cuffdiff.e
  #$ -N cff
+
  #$ -N cuffdiff
  #$ -o cff_whole.o
+
  #$ -o cuffdiff.o
 
  module load cufflinks/2.2.1
 
  module load cufflinks/2.2.1
 +
 +
Now write the command to run cuffdiff (all in one line):
 +
 +
cuffdiff -o cdiff_out -L ZV9_2cells,ZV9_6h -b genome/Danio_rerio.Zv9.66.dna.fa -u \
 +
--library-type fr-unstranded \
 +
annotations/Zebrafish_refGene.gtf tophat_out/2cells/accepted_hits.bam tophat_out/6hours/accepted_hits.bam
 +
 +
The options that you used above are:
 +
 +
'''-o''': output directory,
 +
 +
'''-L''': labels for the different conditions,
 +
 +
'''-b''', '''-u''', '''--library-type''': same meaning as described before for cufflinks
 +
The output folder (check its contents with '''ls –l''') contains a number of files reporting  differential  usage  for  different  genomic  entities.  You  are  mostly  interested  in  the  differential  expression  at  the  gene  level.  This  is  reported  in  the  file  '''gene_exp.diff'''.  Look at the first few lines of the file using the following command:
 +
 +
$ head cdiff_out/gene_exp.diff
 +
You would like to see which are the most significantly differentially expressed genes. To  do  this,  you  can  '''sort'''  the  above  file  according  to  the  '''q-value'''  (corrected  p-value  for  multiple testing). Let’s use a sort command to sort the file and write the sorted file in a  different  one  called  '''gene_exp_sorted.diff'''.  Write  the  following  command  (all  in  one line):
 +
 +
$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff
 +
Look again at the first few lines of the sorted file. If everything went as it should have, the first  18  genes  should  have  q-value  <  0.05,  chosen  as  significance  threshold,  and  are  therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours  embryo.
 +
 +
Now  you  have  all  the  genes  that  change  between  the  two  conditions.  In  your  example,  the  number  of  these  genes  is  quite  small,  but  in  more  realistic  cases  you  might  end  up  with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the  rationalization  of  which  biological  functions  change  between  the  two  analyzed  conditions.  Let’s  take  the  top  100  of  the  genes  sorted  by  q-value  from  the  '''gene_exp_sorted.diff''' file (in more realistic cases you would have taken only genes  having q-value < 0.05, but in your example the 18 genes satisfying this condition are too  few  to  have  a  significant  functional  characterization).  As  you  did  before,  use  the  command  head  to  select  the  top  100  genes,  and  the  command  cut  to  extract  only  the  first column from these rows:
 +
$ head -100 cdiff_out/gene_exp_sorted.diff | cut -f 1 > top100genes.txt
 +
Remember that the '''>''' operator redirects the output of a command into a file. Now open  the '''top100genes.txt''' file, verify that it’s a single-column file with the UCSC gene  identifiers, and copy its content. Open any browser and go to: [https://david.ncifcrf.gov/home.jsp]
 +
 +
The  procedure  is  the  same  that  you  did  for  the  cufflinks  output.  Click  the  '''Start Analysis''' tab at the top of the screen, click the '''Upload tab''' on the left of the screen and  paste the list of genes into the box labeled Step 1: '''Enter Gene List'''. In Step 2: Select Identifiers  choose '''UCSC_GENE_ID'''  from  the  list,  and  mark  '''Gene List''' in Step 3: '''List Type''', then click '''Submit List''' in Step 4.
 +
After  the  page  ends  the  processing,  click  '''Functional Annotation Tool'''  to  be  redirected  to  the  result  page.  Here  you  can  expand  all  tabs  and  click  on  the  different  functional  categories  to  se  if  there  is  some  enrichment  in  your  gene  list.  The  one  that  might  interest  you  the  most  is  '''GOTERM_BP_FAT'''  under  '''Gene_Ontology''';  do  these  categories make sense given the samples you are studying?

Latest revision as of 10:48, 18 February 2016

One of the stand-alone tools that are part of the cufflinks package, and that performs differential expression estimation, is cuffdiff. You can use this tool to compare between two conditions; for example control and disease, or wild-type and mutant, or, as in your case, we want to identify genes that are differentially expressed between two developmental stages. You don’t need to load a different module to use cuffdiff since it is part of the cufflinks module. So let’s just create an output folder to store its results:

$ mkdir cdiff_out

The general format of the cuffdiff command is:

cuffdiff [options]* <transcripts.gtf>\
<sample1_replicate1.sam[,...,sample1_replicateM]>\
<sample2_replicate1.sam[,...,sample2_replicateM.sam]>\

where the input is an annotation file and the aligned reads (either in SAM or BAM format) for the two conditions to be compared. You can now prepare the script for running cuffdiff, by opening a text file (let’s call it cuffdiff.sh), and writing the header, the loading module commands, and pointing to your working space:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e cuffdiff.e
#$ -N cuffdiff
#$ -o cuffdiff.o
module load cufflinks/2.2.1

Now write the command to run cuffdiff (all in one line):

cuffdiff -o cdiff_out -L ZV9_2cells,ZV9_6h -b genome/Danio_rerio.Zv9.66.dna.fa -u \
--library-type fr-unstranded \
annotations/Zebrafish_refGene.gtf tophat_out/2cells/accepted_hits.bam tophat_out/6hours/accepted_hits.bam

The options that you used above are:

-o: output directory,

-L: labels for the different conditions,

-b, -u, --library-type: same meaning as described before for cufflinks The output folder (check its contents with ls –l) contains a number of files reporting differential usage for different genomic entities. You are mostly interested in the differential expression at the gene level. This is reported in the file gene_exp.diff. Look at the first few lines of the file using the following command:

$ head cdiff_out/gene_exp.diff

You would like to see which are the most significantly differentially expressed genes. To do this, you can sort the above file according to the q-value (corrected p-value for multiple testing). Let’s use a sort command to sort the file and write the sorted file in a different one called gene_exp_sorted.diff. Write the following command (all in one line):

$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff

Look again at the first few lines of the sorted file. If everything went as it should have, the first 18 genes should have q-value < 0.05, chosen as significance threshold, and are therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours embryo.

Now you have all the genes that change between the two conditions. In your example, the number of these genes is quite small, but in more realistic cases you might end up with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the rationalization of which biological functions change between the two analyzed conditions. Let’s take the top 100 of the genes sorted by q-value from the gene_exp_sorted.diff file (in more realistic cases you would have taken only genes having q-value < 0.05, but in your example the 18 genes satisfying this condition are too few to have a significant functional characterization). As you did before, use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows: $ head -100 cdiff_out/gene_exp_sorted.diff | cut -f 1 > top100genes.txt Remember that the > operator redirects the output of a command into a file. Now open the top100genes.txt file, verify that it’s a single-column file with the UCSC gene identifiers, and copy its content. Open any browser and go to: [1]

The procedure is the same that you did for the cufflinks output. Click the Start Analysis tab at the top of the screen, click the Upload tab on the left of the screen and paste the list of genes into the box labeled Step 1: Enter Gene List. In Step 2: Select Identifiers choose UCSC_GENE_ID from the list, and mark Gene List in Step 3: List Type, then click Submit List in Step 4. After the page ends the processing, click Functional Annotation Tool to be redirected to the result page. Here you can expand all tabs and click on the different functional categories to se if there is some enrichment in your gene list. The one that might interest you the most is GOTERM_BP_FAT under Gene_Ontology; do these categories make sense given the samples you are studying?