8. Expression analysis

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

A plethora of tools are currently available for identifying differentially expressed transcripts based on RNA-Seq data, and of these, edgeR and DESeq2 are very popular and highly accurate. The edgeR software is part of the R Bioconductor package, and we provide support for using it in the Trinity package.

Having biological replicates for each of your samples is crucial for accurate detection of differentially expressed transcripts. Lets say that in your data set, you have three biological replicates for each of your conditions, and in general, having three or more replicates for each experimental condition is highly recommended.

Create a samples.txt file containing the contents below (tab-delimited), indicating the name of the condition followed by the name of the biological replicate. The replicate names must match up with the column headings of your counts matrix:

head -n1 Trinity_trans.counts.matrix | tee samples.txt
A_rep1  A_rep2  A_rep3  B_rep1  B_rep2  B_rep3  C_rep1   C_rep2   D_rep3

Now edit file 'samples.txt' to contain the tab-delimited 2-column format:

sample_name    unique_replicate_name

and it should look like so:

cat samples.txt
A     A_rep1
A     A_rep2
A     A_rep3
B     B_rep1
B     B_rep2
B     B_rep3
C     C_rep1
C     C_rep2
C     C_rep3

Another thing to validate would be to ensure that there are physical tab characters that delimit the columns. You can do this using 'cat -te filename' like so:

cat -te samples.txt
A^A_rep1$
A^A_rep2$
A^A_rep3$
B^B_rep1$
B^B_rep2$
B^B_rep3$
C^C_rep1$
C^C_rep2$
C^C_rep3$

Here, you should see '^I' at positions of tab characters, and '$' at newline characters. This is invaluable for verifying contents of text files when the formatting has to be very precise. If your view doesn't look exactly as above, then you'll need to re-edit your file and be sure to replace any space characters you see with single tab characters, then re-examine it using the 'cat -te' approach.

The condition name (left column) can be named more or less arbitrarily but should reflect your experimental condition. Importantly, again, the replicate names (right column) need to match up exactly with the column headers of your RNA-Seq counts matrix.

Unsupervised Data Exploration Before delving into differential expression analysis, you should examine your data to determine if there are any confounding issues. For example, your experimental replicates should look highly similar, and they should ideally be more similar to each other than to different sample types, assuming that the signal in the data is stronger than noise.

Trinity comes with a 'PtR' script that we use to simplify making various charts and plots based on a matrix input file. We'll use PtR for our plotting below. Note, PtR uses the 'R' software and environment, essentially generating an R script to generate an image file in pdf format, which you can view using the 'xpdf' utility.

Compare your replicates Run the following to compare your biological replicates:

$TRINITY_HOME/Analysis/DifferentialExpression/PtR \

-m Trinity_trans.counts.matrix -s samples.txt \
--log2 --compare_replicates 

This should have generated pdf files, with one pdf report for each of your sample types. See what files were most recently created (via. 'ls -ltr') and you should find the following:

A.rep_compare.pdf
B.rep_compare.pdf
C.rep_compare.pdf

Compare your replicates across all samples One common way to compare the sample replicates in the context of all samples is to generate a replicate correlation plot containing all replicates across all samples:

$TRINITY_HOME/Analysis/DifferentialExpression/PtR \
     -m Trinity_trans.counts.matrix -s samples.txt \
     --log2 --sample_cor_matrix

This should generate a heatmap. (use 'ls -ltr' to find the name of the file, which should be 'Trinity_trans.counts.matrix.log2.sample_cor_matrix.pdf').

Ideally, you should see that your replicates are well correlated and cluster together according to sample type, being more similar to each other than to the different samples. This will generally be the case when the signal in the data is strong.

Another powerful technique for comparing samples and replicates is to use Principal Components Analysis (PCA). Run PCA on your data like so:

$TRINITY_HOME/Analysis/DifferentialExpression/PtR\
      -m Trinity_trans.counts.matrix -s samples.txt\
      --log2 --CPM --prin_comp 3

This generates a pdf file containing the PCA plots.

PCA is one of the best ways to identify potentially confounding effects in your data. If your replicates do not cluster according to sample and instead cluster according to some other aspect such as by the person who processed the replicates, the machine the replicates were sequenced on, etc., then you have batch effects that will need to be accounted for. It's best to avoid potential batch effects whenever and wherever possible.

Differential expression analysis To detect differentially expressed transcripts, you have to run sophisticated statistical analysis methods integrated into the R Bioconductor suite, such as edgeR or DESeq2. These are both integrated into the Trinity analysis framework to simplify execution.

Try running edgeR using the counts matrix:

$TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
     --matrix Trinity_trans.counts.matrix \
     --samples_file samples.txt \
     --method edgeR \
     --output edgeR_trans

The output files '*.DE_results' contain the output from running EdgeR to identify differentially expressed transcripts in each of the pairwise sample comparisons.

head edgeR_trans/Trinity_trans.counts.matrix.A_vs_B.edgeR.DE_results | column -t
                          logFC                    logCPM             PValue            FDR
TRINITY_DN5022_c0_g2_i1  9.49707206927776   10.8064019152136  2.76394324232293e-290  2.11773331226783e-286
TRINITY_DN1623_c0_g1_i1  7.07626413415546   10.7248178757945  1.72659888876553e-263  6.61460034286075e-260
TRINITY_DN5038_c0_g1_i1  8.2143888197428    12.590824463825   8.16475148164346e-260  2.08527752841174e-256
TRINITY_DN611_c0_g1_i1   6.99748245283561   10.3886702337903  2.35164882236904e-251  4.50458331924789e-248
TRINITY_DN1196_c0_g1_i1  7.66693385377398   9.62429477927643  3.14625702637954e-240  4.82132426722401e-237
TRINITY_DN235_c0_g1_i1   -6.84340232453515  9.31940620686212  1.2712197490933e-233   1.62334761959215e-230
TRINITY_DN6470_c0_g1_i1  -7.52247429960845  7.02692650884081  1.5022536219591e-222   1.64432389306437e-219
TRINITY_DN348_c0_g1_i1   -8.34481183802738  7.20863899236057  5.99388275515287e-216  5.74064120874766e-213
TRINITY_DN2419_c0_g1_i1  7.87366689800258   8.47510448730953  4.70441688285631e-213  4.00502690627167e-210

These data include the log fold change (logFC), log counts per million (logCPM), P- value from an exact test, and false discovery rate (FDR).

The EdgeR analysis above generated both MA and Volcano plots based on these data.

Trinity facilitates analysis of these data, including scripts for extracting transcripts that are above some statistical significance (FDR thre shold) and fold-change in expression, and generating figures such as heatmaps and other useful plots.

Extracting differentially expressed transcripts and generating heatmaps Perform the following operations from within the edgeR_trans/ directory:

Extract those differentially expressed (DE) transcripts that are at least 4-fold differentially expressed at a significance of <= 0.001 in any of the pairwise sample comparisons:

$TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
     --matrix ../Trinity_trans.TMM.EXPR.matrix \
     --samples ../samples.txt \
     -P 1e-3 -C 2 

The above generates several output files with a prefix diffExpr.P1e-3_C2', indicating the parameters chosen for filtering, where P (FDR actually) is set to 0.001, and fold change (C) is set to 2^(2) or 4-fold. (These are default parameters for the above script).

Included among these files are: ‘diffExpr.P1e-3_C2.matrix’ : the subset of the FPKM matrix corresponding to the DE transcripts identified at this threshold. The number of DE transcripts identified at the specified thresholds can be obtained by examining the number of lines in this file.

wc -l diffExpr.P1e-3_C2.matrix

Also included among these files is a heatmap 'diffExpr.P1e-3_C2.matrix.log2.centered.genes_vs_samples_heatmap.pdf' with transcripts clustered along the vertical axis and samples clustered along the horizontal axis.

Using PtR, you can revise the heatmap view, adjusting the scaling to further brighten those features that are most highly differentially expressed:

$TRINITY_HOME/Analysis/DifferentialExpression/PtR  \
    -m diffExpr.P1e-3_C2.matrix.log2.centered.dat \
    -s ../samples.txt \
    --gene_dist euclidean --sample_dist euclidean \
     --heatmap --heatmap_scale_limits "-4,4"

Samples of volcano plots and heatMaps you can find on Trinity assembly wiki: [1]