6.4 Differential Expression using cuffdiff

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

One of the stand-alone tools that are part of the cufflinks package, and that performs differential expression estimation, is cuffdiff. You can use this tool to compare between two conditions; for example control and disease, or wild-type and mutant, or, as in your case, we want to identify genes that are differentially expressed between two developmental stages. You don’t need to load a different module to use cuffdiff since it is part of the cufflinks module. So let’s just create an output folder to store its results:

$ mkdir cdiff_out

The general format of the cuffdiff command is:

cuffdiff [options]* <transcripts.gtf>\
<sample1_replicate1.sam[,...,sample1_replicateM]>\
<sample2_replicate1.sam[,...,sample2_replicateM.sam]>\

where the input is an annotation file and the aligned reads (either in SAM or BAM format) for the two conditions to be compared. You can now prepare the script for running cuffdiff, by opening a text file (let’s call it cuffdiff.sh), and writing the header, the loading module commands, and pointing to your working space:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e cuffdiff.e
#$ -N cuffdiff
#$ -o cuffdiff.o
module load cufflinks/2.2.1

Now write the command to run cuffdiff (all in one line):

cuffdiff -o cdiff_out -L ZV9_2cells,ZV9_6h -b genome/Danio_rerio.Zv9.66.dna.fa -u \
--library-type fr-unstranded \
annotations/Zebrafish_refGene.gtf tophat_out/2cells/accepted_hits.bam tophat_out/6hours/accepted_hits.bam

The options that you used above are:

-o: output directory,

-L: labels for the different conditions,

-b, -u, --library-type: same meaning as described before for cufflinks The output folder (check its contents with ls –l) contains a number of files reporting differential usage for different genomic entities. You are mostly interested in the differential expression at the gene level. This is reported in the file gene_exp.diff. Look at the first few lines of the file using the following command:

$ head cdiff_out/gene_exp.diff

You would like to see which are the most significantly differentially expressed genes. To do this, you can sort the above file according to the q-value (corrected p-value for multiple testing). Let’s use a sort command to sort the file and write the sorted file in a different one called gene_exp_sorted.diff. Write the following command (all in one line):

$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff

Look again at the first few lines of the sorted file. If everything went as it should have, the first 18 genes should have q-value < 0.05, chosen as significance threshold, and are therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours embryo.

Now you have all the genes that change between the two conditions. In your example, the number of these genes is quite small, but in more realistic cases you might end up with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the rationalization of which biological functions change between the two analyzed conditions. Let’s take the top 100 of the genes sorted by q-value from the gene_exp_sorted.diff file (in more realistic cases you would have taken only genes having q-value < 0.05, but in your example the 18 genes satisfying this condition are too few to have a significant functional characterization). As you did before, use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows: $ head -100 cdiff_out/gene_exp_sorted.diff | cut -f 1 > top100genes.txt Remember that the > operator redirects the output of a command into a file. Now open the top100genes.txt file, verify that it’s a single-column file with the UCSC gene identifiers, and copy its content. Open any browser and go to: [1]

The procedure is the same that you did for the cufflinks output. Click the Start Analysis tab at the top of the screen, click the Upload tab on the left of the screen and paste the list of genes into the box labeled Step 1: Enter Gene List. In Step 2: Select Identifiers choose UCSC_GENE_ID from the list, and mark Gene List in Step 3: List Type, then click Submit List in Step 4. After the page ends the processing, click Functional Annotation Tool to be redirected to the result page. Here you can expand all tabs and click on the different functional categories to se if there is some enrichment in your gene list. The one that might interest you the most is GOTERM_BP_FAT under Gene_Ontology; do these categories make sense given the samples you are studying?