Difference between revisions of "8. De novo assembly with Velvet"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 3: Line 3:
  
 
'''Velvet''' is composed by two modules, '''velveth''' and '''velvetg''', that need to be run one  after the other. The first module analyzes the reads, decomposes them in sub-sequences  of  fixed  length  k  (called  '''k-mers''')  and  builds  an  '''index'''  of  all  k-mers.  The  second  module  takes  as  input  the  output  of  '''velveth'''  and  builds  structures  in  which  reads  overlap,  called '''contigs''', corresponding to '''transcripts''' when the input contains RNA-seq reads.  Let’s  open  a  script  for  running  '''velvet''',  and  write  the  header,  load  the  modules  and  point to the working directory:
 
'''Velvet''' is composed by two modules, '''velveth''' and '''velvetg''', that need to be run one  after the other. The first module analyzes the reads, decomposes them in sub-sequences  of  fixed  length  k  (called  '''k-mers''')  and  builds  an  '''index'''  of  all  k-mers.  The  second  module  takes  as  input  the  output  of  '''velveth'''  and  builds  structures  in  which  reads  overlap,  called '''contigs''', corresponding to '''transcripts''' when the input contains RNA-seq reads.  Let’s  open  a  script  for  running  '''velvet''',  and  write  the  header,  load  the  modules  and  point to the working directory:
 +
 +
#$ -S /bin/sh
 +
#$ -cwd
 +
#$ -q amd.q,large.q,intel.q
 +
#$ -l h_vmem=16G
 +
#$ -e velveth.e
 +
#$ -N velveth
 +
#$ -o velveth.o
 +
module load velvet/1.2.10
 +
cd /ibers/ernie/scratch/vpl/zebra_fish
 +
 +
The general usage of the velveth module is:
 +
 +
Usage:
 +
./velveth directory hash_length {[-file_format][-read_type][- separate|-interleaved] filename1 [filename2 ...]} {...} [options]
 +
 +
where '''directory''' is the name of the output folder and '''hash_length''' is the size of the  '''k-mers'''  in  which  reads  are  decomposed  (must  be  an  even  number).  Now  let’s  write  the  velveth  command  to  assemble  reads  from  the 6 hours  zebrafish  embryo (all  in  one  line):
 +
 +
velveth velvet_out 29 -fastq data/6h_1.trim.fastq data/6h_2.trim.fastq
 +
The parameters are:
 +
 +
'''velvet_out''': the output folder
 +
 +
'''29''': the value of k
 +
 +
-'''fastq''': the input format
 +
 +
The second module, '''velvetg''', has many options, but can also simply run just specifying  the  folder  with  the  '''velveth'''  output.  Let’s  now  add  the  '''velvetg'''  command  in  the  file  that you are writing:
 +
velvetg velvet_out
 +
Now  save  and  close  the  file  and  run  with  '''qsub'''.  When  it  is  finished,  have  a  look  at  the  output  folder  (using  for  example  '''ls –l velvet_out''').  Some  files  (e.g.  Roadmaps,  Sequences)  report  the  reads  decomposition  and  indexing.  The  file  with  the  reconstructed transcript is '''contigs.txt'''. Open it, and you will see the sequence of all  transcripts in fasta format. 
 +
 +
Let’s  now  see  if  '''velvet'''  was  able  to  correctly  assemble  zebrafish  transcripts.  Open contigs.fa,  choose  one  transcript  sufficiently  long  and  copy  it.  Now  open  your
 +
browser and go to: [http://blast.ncbi.nlm.nih.gov/Blast.cgi]
 +
 +
Click on nucleotide blast on the left size of the screen. Paste the chosen sequence  from  the  velvet  output  into  the  text  box,  select  '''Nucleotide collection''' (nr/nt)  from '''Choose Search Set''',  then  click  the  '''BLAST'''  button  and  wait  until  completion.  Verify  that  the  best  match  in  the  BLAST  output  is  a  zebrafish  gene.  If  you  want,  try  again  with  other  reconstructed  transcripts,  trying  some  large  ones  and  some  small ones. Do all of them have a good match with zebrafish genes?

Latest revision as of 16:30, 22 February 2016

Velvet is an assembly algorithm developed for genomic assembly, but it can be applied to transcriptome assembly as well. A more recent algorithm from the same authors, called Oases, was developed specifically for un-guided transcriptome reconstruction, but for our purposes the original velvet is enough. Other de novo transcriptome reconstruction algorithms exist, for example Trinity, which is generally more accurate but has very large computational requirements, especially for memory. Trinity offers also modules for gene expression and differential expression estimation, while velvet is limited to transcript reconstruction. We will make a similar usage guide for Trinity in another session.

Velvet is composed by two modules, velveth and velvetg, that need to be run one after the other. The first module analyzes the reads, decomposes them in sub-sequences of fixed length k (called k-mers) and builds an index of all k-mers. The second module takes as input the output of velveth and builds structures in which reads overlap, called contigs, corresponding to transcripts when the input contains RNA-seq reads. Let’s open a script for running velvet, and write the header, load the modules and point to the working directory:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=16G
#$ -e velveth.e
#$ -N velveth
#$ -o velveth.o
module load velvet/1.2.10
cd /ibers/ernie/scratch/vpl/zebra_fish

The general usage of the velveth module is:

Usage:

./velveth directory hash_length {[-file_format][-read_type][- separate|-interleaved] filename1 [filename2 ...]} {...} [options]

where directory is the name of the output folder and hash_length is the size of the k-mers in which reads are decomposed (must be an even number). Now let’s write the velveth command to assemble reads from the 6 hours zebrafish embryo (all in one line):

velveth velvet_out 29 -fastq data/6h_1.trim.fastq data/6h_2.trim.fastq

The parameters are:

velvet_out: the output folder

29: the value of k

-fastq: the input format

The second module, velvetg, has many options, but can also simply run just specifying the folder with the velveth output. Let’s now add the velvetg command in the file that you are writing:

velvetg velvet_out

Now save and close the file and run with qsub. When it is finished, have a look at the output folder (using for example ls –l velvet_out). Some files (e.g. Roadmaps, Sequences) report the reads decomposition and indexing. The file with the reconstructed transcript is contigs.txt. Open it, and you will see the sequence of all transcripts in fasta format.

Let’s now see if velvet was able to correctly assemble zebrafish transcripts. Open contigs.fa, choose one transcript sufficiently long and copy it. Now open your browser and go to: [1]

Click on nucleotide blast on the left size of the screen. Paste the chosen sequence from the velvet output into the text box, select Nucleotide collection (nr/nt) from Choose Search Set, then click the BLAST button and wait until completion. Verify that the best match in the BLAST output is a zebrafish gene. If you want, try again with other reconstructed transcripts, trying some large ones and some small ones. Do all of them have a good match with zebrafish genes?