5. Transcript expression quantitation

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Abundance estimation using RSEM

To estimate the expression levels of the Trinity-reconstructed transcripts, you can use the strategy supported by the RSEM software involving read alignment followed by expectation maximization to assign reads according to maximum likelihood estimates. In essence, you have to first align the original rna-seq reads back against the Trinity transcripts, then run RSEM to estimate the number of rna-seq fragments that map to each contig. Because the abundance of individual transcripts may significantly differ between samples, the reads from each sample (and each biological replicate) must be examined separately, obtaining sample-specific abundance values.

There is a script to facilitate running of RSEM on Trinity transcript assemblies. The script runs the Bowtie aligner to align reads to the Trinity transcripts, and RSEM will then evaluate those alignments to estimate expression values. Again, you need to run this separately for each sample and biological replicate (ie. each pair of fastq files).

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=50G
#$ -N RSEM
module load perl/5.22.2
module load trinity/2.2.0

RSEM command:

/cm/shared/apps/trinity/2.2.0/util/align_and_estimate_abundance.pl --seqType fq \
--left /ibers/ernie/scratch/userName/Fastq_trimmed/reads_1_1.fq.gz.p \
--right /ibers/ernie/scratch/userName/Fastq_trimmed/reads_1_2.fq.gz.p \
--transcripts /ibers/ernie/scratch/userName/trinity_out_dir/Trinity.fasta --output_prefix RSEM_name \
--est_method RSEM  --aln_method bowtie --trinity_mode --coordsort_bam \
--output_dir /ibers/ernie/scratch/userName/outputName --prep_reference

The outputs generated from running the command above will exist in the RSEM_name/ directory, as is indicated in the --output_dir parameter above. Since this is the first time of running RSEM, you need to prepare the Trinity.fasta file for alignment with the bowtie software, and so you have tol simply include the --prep_reference parameter to invoke that functionality.

The primary output generated by RSEM is the file containing the expression values for each of the transcripts. Examine this file like so:

head RSEM_name/name.isoforms.results | column -t
transcript_id            gene_id               length  effective_length  expected_count  TPM     FPKM    IsoPct
TRINITY_DN0_c0_g1_i1     TRINITY_DN0_c0_g1     2010    1881.38           225.00          72.01   65.65   100.00
TRINITY_DN0_c0_g2_i1     TRINITY_DN0_c0_g2     2011    1882.38           0.00            0.00    0.00    100.00
TRINITY_DN1000_c0_g1_i1  TRINITY_DN1000_c0_g1  1664    1535.38           208.00          81.57   74.37   100.00
TRINITY_DN1001_c0_g1_i1  TRINITY_DN1001_c0_g1  2948    2819.38           8.00            1.71    1.56    100.00
TRINITY_DN1002_c0_g1_i1  TRINITY_DN1002_c0_g1  351     222.54            1.00            2.71    2.47    100.00
TRINITY_DN1003_c0_g1_i1  TRINITY_DN1003_c0_g1  1913    1784.38           91.00           30.71   28.00   100.00
TRINITY_DN1004_c0_g1_i1  TRINITY_DN1004_c0_g1  2395    2266.38           167.00          44.37   40.45   100.00
TRINITY_DN1005_c0_g1_i1  TRINITY_DN1005_c0_g1  3076    2947.38           425.00          86.82   79.16   100.00
TRINITY_DN1006_c0_g1_i1  TRINITY_DN1006_c0_g1  1141    1012.38           253.00          150.47  137.19  100.00

The key columns in the above RSEM output are the transcript identifier, the 'expected_count' corresponding to the number of RNA-Seq fragments predicted to be derived from that transcript, and the 'TPM' or 'FPKM' columns, which provide normalized expression values for the expression of that transcript in the sample.

The FPKM expression measurement normalizes read counts according to the length of transcripts from which they are derived (as longer transcripts generate more reads at the same expression level), and normalized according to sequencing depth. The FPKM acronym stand for 'fragments per kilobase of cDNA per million fragments mapped'.

TPM 'transcripts per million' is generally the favored metric, as all TPM values should sum to 1 million, and TPM nicely reflects the relative molar concentration of that transcript in the sample. FPKM values, on the other hand, do not always sum to the same value, and do not have the similar property of inherrently representing a proportion within a sample, making comparisons between samples less straightforward. TPM values can be easily computed from FPKM values like so: TPMi = FPKMi / (sum all FPKM values) * 1 million.

Note, a similar RSEM outputted gene expression file exists for the 'gene' level expression data. For example:

head RSEM_name/name.genes.results | column -t 
gene_id               transcript_id(s)         length   effective_length  expected_count  TPM     FPKM
TRINITY_DN0_c0_g1     TRINITY_DN0_c0_g1_i1     2010.00  1881.38           225.00          72.01   65.65
TRINITY_DN0_c0_g2     TRINITY_DN0_c0_g2_i1     2011.00  1882.38           0.00            0.00    0.00
TRINITY_DN1000_c0_g1  TRINITY_DN1000_c0_g1_i1  1664.00  1535.38           208.00          81.57   74.37
TRINITY_DN1001_c0_g1  TRINITY_DN1001_c0_g1_i1  2948.00  2819.38           8.00            1.71    1.56
TRINITY_DN1002_c0_g1  TRINITY_DN1002_c0_g1_i1  351.00   222.54            1.00            2.71    2.47
TRINITY_DN1003_c0_g1  TRINITY_DN1003_c0_g1_i1  1913.00  1784.38           91.00           30.71   28.00
TRINITY_DN1004_c0_g1  TRINITY_DN1004_c0_g1_i1  2395.00  2266.38           167.00          44.37   40.45
TRINITY_DN1005_c0_g1  TRINITY_DN1005_c0_g1_i1  3076.00  2947.38           425.00