Difference between revisions of "6. Construct an expression matrix"
Line 28: | Line 28: | ||
TRINITY_DN4819_c0_g1_i1 136 128 165 58 64 | TRINITY_DN4819_c0_g1_i1 136 128 165 58 64 | ||
TRINITY_DN1223_c0_g1_i1 7 4 6 6 9 | TRINITY_DN1223_c0_g1_i1 7 4 6 6 9 | ||
+ | |||
+ | You'll see that the above matrix has integer values representing the number of RNA-Seq paired-end fragments that are estimated to have been derived from that corresponding transcript in each of the samples. Don't be surprised if you see some values that are not exact integers but rather fractional read counts. This happens if there are multiply-mapped reads (such as to common sequence regions of different isoforms), in which case the multiply-mapped reads are fractionally assigned to the corresponding transcripts according to their maximum likelihood. | ||
+ | |||
+ | The counts matrix will be used by edgeR (or other tools in Bioconductor, eg. DESeq2) for statistical analysis and identifying significantly differentially expressed transcripts. | ||
+ | |||
+ | Now take a look at the first few lines of the normalized expression matrix: | ||
+ | |||
+ | head -n5 Trinity_trans.TMM.EXPR.matrix | column -t | ||
+ | A B C D E | ||
+ | TRINITY_DN323_c0_g1_i1 48.683 46.605 41.263 86.390 84.111 | ||
+ | TRINITY_DN2438_c0_g1_i1 171.798 155.077 134.165 5.737 4.305 | ||
+ | TRINITY_DN4819_c0_g1_i1 174.458 170.392 206.578 80.622 86.288 | ||
+ | TRINITY_DN1223_c0_g1_i1 9.443 5.597 7.932 8.778 12.765 | ||
+ | |||
+ | These are the normalized expression values, which have been further cross-sample normalized using TMM normalization to adjust for any differences in sample composition. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property. TMM normalization is described in A scaling normalization method for differential expression analysis of RNA-Seq data, Robinson and Oshlack, Genome Biology 2010. [http://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25] | ||
+ | |||
+ | You will use the TMM-normalized expression matrix when you will plot expression values in heatmaps and other expression analyses. |
Latest revision as of 15:42, 6 March 2017
Now, given the expression estimates for each of the transcripts in each of the samples, you're going to pull together expression values into matrices containing transcript IDs in the rows, and sample replicate names in the columns. you'll make two matrices, one containing the estimated counts, and another containing the TPM expression values that are cross-sample normalized using the TMM method. This is all done for you by the following script in Trinity, indicating the method you used for expression estimation and providing the list of individual sample abundance estimate files:
#$ -S /bin/sh #$ -cwd #$ -q large.q #$ -l h_vmem=300G #$ -N Matrix
Module you have to load:
module load perl/5.22.2 module load trinity/2.2.0 module load R/R-3.1.2
/cm/shared/apps/trinity/2.2.0/util/abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix S5_Trinity_trans_all \ --name_sample_by_basedir /ibers/ernie/scratch/userName/RSEM_A/A.isoforms.results \ /ibers/ernie/scratch/userName/RSEM_B/B.isoforms.results \ /ibers/ernie/scratch/userName/RSEM_C/Cl.isoforms.results \ /ibers/ernie/scratch/userName/RSEM_D/D.isoforms.results \ /ibers/ernie/scratch/userName/RSEM_E/E.isoforms.results \
You should find a matrix file called 'Trinity_trans.counts.matrix', which contains the counts of RNA-Seq fragments mapped to each transcript. Examine the first few lines of the counts matrix:
head -n5 Trinity_trans.counts.matrix | column -t
A B C D E TRINITY_DN323_c0_g1_i1 846 782 792 1403 1397 TRINITY_DN2438_c0_g1_i1 418 364 353 13 10 TRINITY_DN4819_c0_g1_i1 136 128 165 58 64 TRINITY_DN1223_c0_g1_i1 7 4 6 6 9
You'll see that the above matrix has integer values representing the number of RNA-Seq paired-end fragments that are estimated to have been derived from that corresponding transcript in each of the samples. Don't be surprised if you see some values that are not exact integers but rather fractional read counts. This happens if there are multiply-mapped reads (such as to common sequence regions of different isoforms), in which case the multiply-mapped reads are fractionally assigned to the corresponding transcripts according to their maximum likelihood.
The counts matrix will be used by edgeR (or other tools in Bioconductor, eg. DESeq2) for statistical analysis and identifying significantly differentially expressed transcripts.
Now take a look at the first few lines of the normalized expression matrix:
head -n5 Trinity_trans.TMM.EXPR.matrix | column -t A B C D E TRINITY_DN323_c0_g1_i1 48.683 46.605 41.263 86.390 84.111 TRINITY_DN2438_c0_g1_i1 171.798 155.077 134.165 5.737 4.305 TRINITY_DN4819_c0_g1_i1 174.458 170.392 206.578 80.622 86.288 TRINITY_DN1223_c0_g1_i1 9.443 5.597 7.932 8.778 12.765
These are the normalized expression values, which have been further cross-sample normalized using TMM normalization to adjust for any differences in sample composition. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property. TMM normalization is described in A scaling normalization method for differential expression analysis of RNA-Seq data, Robinson and Oshlack, Genome Biology 2010. [1]
You will use the TMM-normalized expression matrix when you will plot expression values in heatmaps and other expression analyses.