4. Evaluating the assembly

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

There are several ways to quantitatively as well as qualitatively assess the overall quality of the assembly. Check Trinity wiki [1].

First of all you can check how many contains you have:

grep '>' Trinity.fasta | wc -l

You also can capture some basic statistics about the Trinity assembly:

$TRINITY_HOME/util/TrinityStats.pl Trinity.fasta

You can run this command on fly. It is quite fast and you don't have to submit it as a job :)

Representation of reads:

A high quality transcriptome assembly is expected to have strong representation of the reads input to the assembler. By aligning the RNA-Seq reads back to the transcriptome assembly, we can quantify read representation. In Trinity,there is a script that will align each of the fastq files of the paired read set to the assembly separately, and then link up the pairs. This way, we can count the number of reads that are found as properly paired alignments in addition to those that align to separate contigs (improperly paired), in addition to those cases where only one of the paired reads (the left or the right read of the pair) aligns to a contig. The following script wraps the Bowtie software to do the read alignments:

$TRINITY_HOME/util/bowtie_PE_separate_then_join.pl \
     --target Trinity.fasta \
     --seqType fq \
     --left left.fastq --right right.fastq \
     --aligner bowtie -- -p 2 --all --best --strata -m 300

Submit the above command as a job.

The output files will exist in a bowtie_out/ subdirectory. You can look at the list of files generated like so:

ls -ltr bowtie_out/

Among the outputs, you should find a name-sorted bam file (where aligned reads are sorted by their name, instead of by their coordinate) called 'bowtie_out/bowtie_out.nameSorted.bam'. Using this file and another script in the Trinity software suite, you can count the number of reads that correspond to the different alignment categories:

$TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl \
      bowtie_out/bowtie_out.nameSorted.bam

The output could be like:

#read_type  count   pct
proper_pairs    96611   97.38
improper_pairs  868 0.87
left_only   867 0.87
right_only  867 0.87
Total aligned rnaseq fragments: 99213

Generally, in a high quality assembly, you would expect to see at least ~70% and at least ~70% of the reads to exist as proper pairs. If you find a large number of 'improper_pairs', it generally indicates that both paired reads align to the assembly, but they align to different transcript contigs - which is usually the sign of a fractured assembly. In that case, deeper sequencing and assembly of more reads would be expected to lead to major improvements here.