Difference between revisions of "5. Transcript expression quantitation"
Line 37: | Line 37: | ||
TRINITY_DN1005_c0_g1_i1 TRINITY_DN1005_c0_g1 3076 2947.38 425.00 86.82 79.16 100.00 | TRINITY_DN1005_c0_g1_i1 TRINITY_DN1005_c0_g1 3076 2947.38 425.00 86.82 79.16 100.00 | ||
TRINITY_DN1006_c0_g1_i1 TRINITY_DN1006_c0_g1 1141 1012.38 253.00 150.47 137.19 100.00 | TRINITY_DN1006_c0_g1_i1 TRINITY_DN1006_c0_g1 1141 1012.38 253.00 150.47 137.19 100.00 | ||
+ | |||
+ | The key columns in the above RSEM output are the transcript identifier, the 'expected_count' corresponding to the number of RNA-Seq fragments predicted to be derived from that transcript, and the 'TPM' or 'FPKM' columns, which provide normalized expression values for the expression of that transcript in the sample. | ||
+ | |||
+ | The FPKM expression measurement normalizes read counts according to the length of transcripts from which they are derived (as longer transcripts generate more reads at the same expression level), and normalized according to sequencing depth. The FPKM acronym stand for 'fragments per kilobase of cDNA per million fragments mapped'. | ||
+ | |||
+ | TPM 'transcripts per million' is generally the favored metric, as all TPM values should sum to 1 million, and TPM nicely reflects the relative molar concentration of that transcript in the sample. FPKM values, on the other hand, do not always sum to the same value, and do not have the similar property of inherrently representing a proportion within a sample, making comparisons between samples less straightforward. TPM values can be easily computed from FPKM values like so: TPMi = FPKMi / (sum all FPKM values) * 1 million. | ||
+ | |||
+ | Note, a similar RSEM outputted gene expression file exists for the 'gene' level expression data. For example: | ||
+ | |||
+ | head RSEM_name/name.genes.results | column -t | ||
+ | |||
+ | gene_id transcript_id(s) length effective_length expected_count TPM FPKM | ||
+ | TRINITY_DN0_c0_g1 TRINITY_DN0_c0_g1_i1 2010.00 1881.38 225.00 72.01 65.65 | ||
+ | TRINITY_DN0_c0_g2 TRINITY_DN0_c0_g2_i1 2011.00 1882.38 0.00 0.00 0.00 | ||
+ | TRINITY_DN1000_c0_g1 TRINITY_DN1000_c0_g1_i1 1664.00 1535.38 208.00 81.57 74.37 | ||
+ | TRINITY_DN1001_c0_g1 TRINITY_DN1001_c0_g1_i1 2948.00 2819.38 8.00 1.71 1.56 | ||
+ | TRINITY_DN1002_c0_g1 TRINITY_DN1002_c0_g1_i1 351.00 222.54 1.00 2.71 2.47 | ||
+ | TRINITY_DN1003_c0_g1 TRINITY_DN1003_c0_g1_i1 1913.00 1784.38 91.00 30.71 28.00 | ||
+ | TRINITY_DN1004_c0_g1 TRINITY_DN1004_c0_g1_i1 2395.00 2266.38 167.00 44.37 40.45 | ||
+ | TRINITY_DN1005_c0_g1 TRINITY_DN1005_c0_g1_i1 3076.00 2947.38 425.00 |
Latest revision as of 15:17, 6 March 2017
Abundance estimation using RSEM
To estimate the expression levels of the Trinity-reconstructed transcripts, you can use the strategy supported by the RSEM software involving read alignment followed by expectation maximization to assign reads according to maximum likelihood estimates. In essence, you have to first align the original rna-seq reads back against the Trinity transcripts, then run RSEM to estimate the number of rna-seq fragments that map to each contig. Because the abundance of individual transcripts may significantly differ between samples, the reads from each sample (and each biological replicate) must be examined separately, obtaining sample-specific abundance values.
There is a script to facilitate running of RSEM on Trinity transcript assemblies. The script runs the Bowtie aligner to align reads to the Trinity transcripts, and RSEM will then evaluate those alignments to estimate expression values. Again, you need to run this separately for each sample and biological replicate (ie. each pair of fastq files).
#$ -S /bin/sh #$ -cwd #$ -q amd.q,large.q,intel.q #$ -l h_vmem=50G #$ -N RSEM module load perl/5.22.2 module load trinity/2.2.0
RSEM command:
/cm/shared/apps/trinity/2.2.0/util/align_and_estimate_abundance.pl --seqType fq \ --left /ibers/ernie/scratch/userName/Fastq_trimmed/reads_1_1.fq.gz.p \ --right /ibers/ernie/scratch/userName/Fastq_trimmed/reads_1_2.fq.gz.p \ --transcripts /ibers/ernie/scratch/userName/trinity_out_dir/Trinity.fasta --output_prefix RSEM_name \ --est_method RSEM --aln_method bowtie --trinity_mode --coordsort_bam \ --output_dir /ibers/ernie/scratch/userName/outputName --prep_reference
The outputs generated from running the command above will exist in the RSEM_name/ directory, as is indicated in the --output_dir parameter above. Since this is the first time of running RSEM, you need to prepare the Trinity.fasta file for alignment with the bowtie software, and so you have tol simply include the --prep_reference parameter to invoke that functionality.
The primary output generated by RSEM is the file containing the expression values for each of the transcripts. Examine this file like so:
head RSEM_name/name.isoforms.results | column -t
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct TRINITY_DN0_c0_g1_i1 TRINITY_DN0_c0_g1 2010 1881.38 225.00 72.01 65.65 100.00 TRINITY_DN0_c0_g2_i1 TRINITY_DN0_c0_g2 2011 1882.38 0.00 0.00 0.00 100.00 TRINITY_DN1000_c0_g1_i1 TRINITY_DN1000_c0_g1 1664 1535.38 208.00 81.57 74.37 100.00 TRINITY_DN1001_c0_g1_i1 TRINITY_DN1001_c0_g1 2948 2819.38 8.00 1.71 1.56 100.00 TRINITY_DN1002_c0_g1_i1 TRINITY_DN1002_c0_g1 351 222.54 1.00 2.71 2.47 100.00 TRINITY_DN1003_c0_g1_i1 TRINITY_DN1003_c0_g1 1913 1784.38 91.00 30.71 28.00 100.00 TRINITY_DN1004_c0_g1_i1 TRINITY_DN1004_c0_g1 2395 2266.38 167.00 44.37 40.45 100.00 TRINITY_DN1005_c0_g1_i1 TRINITY_DN1005_c0_g1 3076 2947.38 425.00 86.82 79.16 100.00 TRINITY_DN1006_c0_g1_i1 TRINITY_DN1006_c0_g1 1141 1012.38 253.00 150.47 137.19 100.00
The key columns in the above RSEM output are the transcript identifier, the 'expected_count' corresponding to the number of RNA-Seq fragments predicted to be derived from that transcript, and the 'TPM' or 'FPKM' columns, which provide normalized expression values for the expression of that transcript in the sample.
The FPKM expression measurement normalizes read counts according to the length of transcripts from which they are derived (as longer transcripts generate more reads at the same expression level), and normalized according to sequencing depth. The FPKM acronym stand for 'fragments per kilobase of cDNA per million fragments mapped'.
TPM 'transcripts per million' is generally the favored metric, as all TPM values should sum to 1 million, and TPM nicely reflects the relative molar concentration of that transcript in the sample. FPKM values, on the other hand, do not always sum to the same value, and do not have the similar property of inherrently representing a proportion within a sample, making comparisons between samples less straightforward. TPM values can be easily computed from FPKM values like so: TPMi = FPKMi / (sum all FPKM values) * 1 million.
Note, a similar RSEM outputted gene expression file exists for the 'gene' level expression data. For example:
head RSEM_name/name.genes.results | column -t
gene_id transcript_id(s) length effective_length expected_count TPM FPKM TRINITY_DN0_c0_g1 TRINITY_DN0_c0_g1_i1 2010.00 1881.38 225.00 72.01 65.65 TRINITY_DN0_c0_g2 TRINITY_DN0_c0_g2_i1 2011.00 1882.38 0.00 0.00 0.00 TRINITY_DN1000_c0_g1 TRINITY_DN1000_c0_g1_i1 1664.00 1535.38 208.00 81.57 74.37 TRINITY_DN1001_c0_g1 TRINITY_DN1001_c0_g1_i1 2948.00 2819.38 8.00 1.71 1.56 TRINITY_DN1002_c0_g1 TRINITY_DN1002_c0_g1_i1 351.00 222.54 1.00 2.71 2.47 TRINITY_DN1003_c0_g1 TRINITY_DN1003_c0_g1_i1 1913.00 1784.38 91.00 30.71 28.00 TRINITY_DN1004_c0_g1 TRINITY_DN1004_c0_g1_i1 2395.00 2266.38 167.00 44.37 40.45 TRINITY_DN1005_c0_g1 TRINITY_DN1005_c0_g1_i1 3076.00 2947.38 425.00