6. Construct an expression matrix

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Now, given the expression estimates for each of the transcripts in each of the samples, you're going to pull together expression values into matrices containing transcript IDs in the rows, and sample replicate names in the columns. you'll make two matrices, one containing the estimated counts, and another containing the TPM expression values that are cross-sample normalized using the TMM method. This is all done for you by the following script in Trinity, indicating the method you used for expression estimation and providing the list of individual sample abundance estimate files:

#$ -S /bin/sh
#$ -cwd
#$ -q large.q
#$ -l h_vmem=300G
#$ -N Matrix

Module you have to load:

module load perl/5.22.2
module load trinity/2.2.0
module load R/R-3.1.2
/cm/shared/apps/trinity/2.2.0/util/abundance_estimates_to_matrix.pl --est_method RSEM --out_prefix S5_Trinity_trans_all \
--name_sample_by_basedir /ibers/ernie/scratch/userName/RSEM_A/A.isoforms.results \
/ibers/ernie/scratch/userName/RSEM_B/B.isoforms.results \
/ibers/ernie/scratch/userName/RSEM_C/Cl.isoforms.results \
/ibers/ernie/scratch/userName/RSEM_D/D.isoforms.results \
/ibers/ernie/scratch/userName/RSEM_E/E.isoforms.results \

You should find a matrix file called 'Trinity_trans.counts.matrix', which contains the counts of RNA-Seq fragments mapped to each transcript. Examine the first few lines of the counts matrix:

head -n5 Trinity_trans.counts.matrix | column -t
                                              A                       B                    C                      D                      E  
TRINITY_DN323_c0_g1_i1    846                782                 792                 1403                1397                
TRINITY_DN2438_c0_g1_i1  418                364                353                   13                    10                  
TRINITY_DN4819_c0_g1_i1  136                 128                165                   58                  64                 
TRINITY_DN1223_c0_g1_i1   7                     4                      6                      6                     9

You'll see that the above matrix has integer values representing the number of RNA-Seq paired-end fragments that are estimated to have been derived from that corresponding transcript in each of the samples. Don't be surprised if you see some values that are not exact integers but rather fractional read counts. This happens if there are multiply-mapped reads (such as to common sequence regions of different isoforms), in which case the multiply-mapped reads are fractionally assigned to the corresponding transcripts according to their maximum likelihood.

The counts matrix will be used by edgeR (or other tools in Bioconductor, eg. DESeq2) for statistical analysis and identifying significantly differentially expressed transcripts.

Now take a look at the first few lines of the normalized expression matrix:

head -n5 Trinity_trans.TMM.EXPR.matrix | column -t 
                                               A                      B                        C                       D                        E
TRINITY_DN323_c0_g1_i1   48.683              46.605              41.263              86.390              84.111              
TRINITY_DN2438_c0_g1_i1  171.798             155.077             134.165             5.737               4.305               
TRINITY_DN4819_c0_g1_i1  174.458             170.392             206.578             80.622              86.288             
TRINITY_DN1223_c0_g1_i1  9.443               5.597               7.932               8.778               12.765              

These are the normalized expression values, which have been further cross-sample normalized using TMM normalization to adjust for any differences in sample composition. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property. TMM normalization is described in A scaling normalization method for differential expression analysis of RNA-Seq data, Robinson and Oshlack, Genome Biology 2010. [1]

You will use the TMM-normalized expression matrix when you will plot expression values in heatmaps and other expression analyses.