7. Evaluating the assembly with ExN50

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Below is described the use of an alternative statistic - the ExN50 value, which you assert is more useful in assessing the quality of the transcriptome assembly. The ExN50 indicates the N50 contig statistic but restricted to the top most highly expressed transcripts. Compute it like so:

$TRINITY_HOME/util/misc/contig_ExN50_statistic.pl Trinity_trans.TMM.EXPR.matrix \
       Trinity.fasta > ExN50.stats

View the contents of the above output file:

cat ExN50.stats  | column -t

A sample file:

#E    min_expr    E-N50  num_transcripts
E3    320852.974  290    1
E5    20156.591   290    2
E6    20156.591   290    3
E7    20156.591   415    4
E8    20156.591   427    5
E9    14609.172   610    6
E10   9892.739    801    7
...
E79   151.033     1716   1030
E80   149.749     1757   1107
E81   139.449     1780   1189
E82   133.932     1801   1278
E83   118.854     1819   1375
E84   101.459     1848   1481
E85   93.910      1860   1596
E86   87.649      1897   1722
E87   80.252      1920   1860
E88   72.408      1939   2011
E89   65.075      1984   2178
E90   57.569      2008   2361
E91   51.728      2022   2565
E92   47.303      2043   2794
E93   41.027      2091   3053
E94   35.334      2132   3350
E95   30.830      2166   3695
E96   25.734      2220   4107
E97   20.764      2245   4613
E98   14.500      2234   5273
E99   12.416      2152   6181
E100  0.037       2066   7704

The N50 based on all expressed transcript contigs (E100, in the above sample) is 2066, but we find the peak N50 at E97 of 2245. The peak ExN50 can vary considerably depending on the assembly, and it can often be an indicator of the quality of the assembly.

Plotting the ExN50 statistics:

$TRINITY_HOME/util/misc/plot_ExN50_statistic.Rscript ExN50.stats
xpdf ExN50.stats.plot.pdf

Examples of ExN50 plots based on assemblies varying the number of input reads are available here: [1]