4. Evaluating the assembly

From IBERS Bioinformatics and HPC Wiki
Revision as of 14:49, 6 March 2017 by Val1 (talk | contribs)
Jump to: navigation, search

There are several ways to quantitatively as well as qualitatively assess the overall quality of the assembly. Check Trinity wiki [1].

First of all you can check how many contains you have:

grep '>' Trinity.fasta | wc -l

You also can capture some basic statistics about the Trinity assembly:

$TRINITY_HOME/util/TrinityStats.pl Trinity.fasta

You can run this command on fly. It is quite fast and you don't have to submit it as a job :)

Representation of reads:

A high quality transcriptome assembly is expected to have strong representation of the reads input to the assembler. By aligning the RNA-Seq reads back to the transcriptome assembly, we can quantify read representation. In Trinity,there is a script that will align each of the fastq files of the paired read set to the assembly separately, and then link up the pairs. This way, we can count the number of reads that are found as properly paired alignments in addition to those that align to separate contigs (improperly paired), in addition to those cases where only one of the paired reads (the left or the right read of the pair) aligns to a contig. The following script wraps the Bowtie software to do the read alignments:

$TRINITY_HOME/util/bowtie_PE_separate_then_join.pl \
     --target Trinity.fasta \
     --seqType fq \
     --left left.fastq --right right.fastq \
     --aligner bowtie -- -p 2 --all --best --strata -m 300

Submit the above command as a job.

The output files will exist in a bowtie_out/ subdirectory. You can look at the list of files generated like so:

ls -ltr bowtie_out/

Among the outputs, you should find a name-sorted bam file (where aligned reads are sorted by their name, instead of by their coordinate) called 'bowtie_out/bowtie_out.nameSorted.bam'. Using this file and another script in the Trinity software suite, you can count the number of reads that correspond to the different alignment categories:

$TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl \
      bowtie_out/bowtie_out.nameSorted.bam

The output could be like:

#read_type  count   pct
proper_pairs    96611   97.38
improper_pairs  868 0.87
left_only   867 0.87
right_only  867 0.87
Total aligned rnaseq fragments: 99213