Difference between revisions of "6.4 Differential Expression using cuffdiff"
Line 31: | Line 31: | ||
'''-b''', '''-u''', '''--library-type''': same meaning as described before for cufflinks | '''-b''', '''-u''', '''--library-type''': same meaning as described before for cufflinks | ||
+ | The output folder (check its contents with '''ls –l''') contains a number of files reporting differential usage for different genomic entities. You are mostly interested in the differential expression at the gene level. This is reported in the file '''gene_exp.diff'''. Look at the first few lines of the file using the following command: | ||
+ | |||
+ | $ head cdiff_out/gene_exp.diff | ||
+ | You would like to see which are the most significantly differentially expressed genes. To do this, you can '''sort''' the above file according to the '''q-value''' (corrected p-value for multiple testing). Let’s use a sort command to sort the file and write the sorted file in a different one called '''gene_exp_sorted.diff'''. Write the following command (all in one line): | ||
+ | |||
+ | $ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff | ||
+ | Look again at the first few lines of the sorted file. If everything went as it should have, the first 18 genes should have q-value < 0.05, chosen as significance threshold, and are therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours embryo. | ||
+ | |||
+ | Now you have all the genes that change between the two conditions. In your example, the number of these genes is quite small, but in more realistic cases you might end up with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the rationalization of which biological functions change between the two analyzed conditions. Let’s take the top 100 of the genes sorted by q-value from the '''gene_exp_sorted.diff''' file (in more realistic cases you would have taken only genes having q-value < 0.05, but in your example the 18 genes satisfying this condition are too few to have a significant functional characterization). As you did before, use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows: |
Revision as of 10:45, 18 February 2016
One of the stand-alone tools that are part of the cufflinks package, and that performs differential expression estimation, is cuffdiff. You can use this tool to compare between two conditions; for example control and disease, or wild-type and mutant, or, as in your case, we want to identify genes that are differentially expressed between two developmental stages. You don’t need to load a different module to use cuffdiff since it is part of the cufflinks module. So let’s just create an output folder to store its results:
$ mkdir cdiff_out
The general format of the cuffdiff command is:
cuffdiff [options]* <transcripts.gtf>\ <sample1_replicate1.sam[,...,sample1_replicateM]>\ <sample2_replicate1.sam[,...,sample2_replicateM.sam]>\
where the input is an annotation file and the aligned reads (either in SAM or BAM format) for the two conditions to be compared. You can now prepare the script for running cuffdiff, by opening a text file (let’s call it cuffdiff.sh), and writing the header, the loading module commands, and pointing to your working space:
#$ -S /bin/sh #$ -cwd #$ -q amd.q,large.q,intel.q #$ -l h_vmem=16G #$ -e cuffdiff.e #$ -N cuffdiff #$ -o cuffdiff.o module load cufflinks/2.2.1
Now write the command to run cuffdiff (all in one line):
cuffdiff -o cdiff_out -L ZV9_2cells,ZV9_6h -b genome/Danio_rerio.Zv9.66.dna.fa -u \ --library-type fr-unstranded \ annotations/Zebrafish_refGene.gtf tophat_out/2cells/accepted_hits.bam tophat_out/6hours/accepted_hits.bam
The options that you used above are:
-o: output directory,
-L: labels for the different conditions,
-b, -u, --library-type: same meaning as described before for cufflinks The output folder (check its contents with ls –l) contains a number of files reporting differential usage for different genomic entities. You are mostly interested in the differential expression at the gene level. This is reported in the file gene_exp.diff. Look at the first few lines of the file using the following command:
$ head cdiff_out/gene_exp.diff
You would like to see which are the most significantly differentially expressed genes. To do this, you can sort the above file according to the q-value (corrected p-value for multiple testing). Let’s use a sort command to sort the file and write the sorted file in a different one called gene_exp_sorted.diff. Write the following command (all in one line):
$ sort -t$'\t' -g -k 13 cdiff_out/gene_exp.diff > cdiff_out/gene_exp_sorted.diff
Look again at the first few lines of the sorted file. If everything went as it should have, the first 18 genes should have q-value < 0.05, chosen as significance threshold, and are therefore differentially expressed between the zebrafish 2-cells embryo and the 6 hours embryo.
Now you have all the genes that change between the two conditions. In your example, the number of these genes is quite small, but in more realistic cases you might end up with hundreds of differentially expressed genes. Let’s use again DAVID to help you in the rationalization of which biological functions change between the two analyzed conditions. Let’s take the top 100 of the genes sorted by q-value from the gene_exp_sorted.diff file (in more realistic cases you would have taken only genes having q-value < 0.05, but in your example the 18 genes satisfying this condition are too few to have a significant functional characterization). As you did before, use the command head to select the top 100 genes, and the command cut to extract only the first column from these rows: