Difference between revisions of "4. Evaluating the assembly"
Line 40: | Line 40: | ||
right_only 867 0.87 | right_only 867 0.87 | ||
Total aligned rnaseq fragments: 99213 | Total aligned rnaseq fragments: 99213 | ||
+ | |||
+ | Generally, in a high quality assembly, you would expect to see at least ~70% and at least ~70% of the reads to exist as proper pairs. If you find a large number of 'improper_pairs', it generally indicates that both paired reads align to the assembly, but they align to different transcript contigs - which is usually the sign of a fractured assembly. In that case, deeper sequencing and assembly of more reads would be expected to lead to major improvements here. |
Latest revision as of 14:50, 6 March 2017
There are several ways to quantitatively as well as qualitatively assess the overall quality of the assembly. Check Trinity wiki [1].
First of all you can check how many contains you have:
grep '>' Trinity.fasta | wc -l
You also can capture some basic statistics about the Trinity assembly:
$TRINITY_HOME/util/TrinityStats.pl Trinity.fasta
You can run this command on fly. It is quite fast and you don't have to submit it as a job :)
Representation of reads:
A high quality transcriptome assembly is expected to have strong representation of the reads input to the assembler. By aligning the RNA-Seq reads back to the transcriptome assembly, we can quantify read representation. In Trinity,there is a script that will align each of the fastq files of the paired read set to the assembly separately, and then link up the pairs. This way, we can count the number of reads that are found as properly paired alignments in addition to those that align to separate contigs (improperly paired), in addition to those cases where only one of the paired reads (the left or the right read of the pair) aligns to a contig. The following script wraps the Bowtie software to do the read alignments:
$TRINITY_HOME/util/bowtie_PE_separate_then_join.pl \ --target Trinity.fasta \ --seqType fq \ --left left.fastq --right right.fastq \ --aligner bowtie -- -p 2 --all --best --strata -m 300
Submit the above command as a job.
The output files will exist in a bowtie_out/ subdirectory. You can look at the list of files generated like so:
ls -ltr bowtie_out/
Among the outputs, you should find a name-sorted bam file (where aligned reads are sorted by their name, instead of by their coordinate) called 'bowtie_out/bowtie_out.nameSorted.bam'. Using this file and another script in the Trinity software suite, you can count the number of reads that correspond to the different alignment categories:
$TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl \ bowtie_out/bowtie_out.nameSorted.bam
The output could be like:
#read_type count pct proper_pairs 96611 97.38 improper_pairs 868 0.87 left_only 867 0.87 right_only 867 0.87 Total aligned rnaseq fragments: 99213
Generally, in a high quality assembly, you would expect to see at least ~70% and at least ~70% of the reads to exist as proper pairs. If you find a large number of 'improper_pairs', it generally indicates that both paired reads align to the assembly, but they align to different transcript contigs - which is usually the sign of a fractured assembly. In that case, deeper sequencing and assembly of more reads would be expected to lead to major improvements here.