Difference between revisions of "9.3 SNP calling using varscan"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
First,  you  need  to  sort  the  '''SAM'''  file, and  then generate  a  so-called  '''pileup  file''',  which  describes how the reads align to the reference genome position by position. You will use  '''samtools'''  for  both steps.  Go  into  the  '''bwaMapping'''  folder  and  type  the  following  command (in one line):  
+
First,  you  need  to  convert the  '''SAM'''  file into '''BAM''' file and to sort it. Then,  generate  a  so-called  '''pileup  file''',  which  describes how the reads align to the reference genome position by position. You will use  '''samtools'''  for  these steps steps.  Go  into  the  '''bwaMapping'''  folder  and  type  the  following  command (in one line):  
 
$ samtools sort -T tmp -O bam -o 2cells.sorted.bam 2cells.bwa.sam
 
 
   
 
   
 +
$samtools view -bS 2cells.bwa.sam > 2cells.bwa.bam
 +
$samtools sort 2cells.bwa.bam 2cells.bwa.sorted
 +
$samtools index 2cells.bwa.sorted.bam
 +
 
Now let’s create the '''pileup''':  
 
Now let’s create the '''pileup''':  
 
   
 
   
 
  $ samtools mpileup -f /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping/danRer10.fa 2cells.sorted.bam > 2cells.pileup
 
  $ samtools mpileup -f /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping/danRer10.fa 2cells.sorted.bam > 2cells.pileup
 
 
the  '''–f'''  parameter  specifies  the  path  to  the  genomic  sequence.  Now  open  the  '''2cells.pileup''' file with the '''more''' command (is generally a very large file, so do not  open it with a text editor) and scroll few pages to see its format.   
 
the  '''–f'''  parameter  specifies  the  path  to  the  genomic  sequence.  Now  open  the  '''2cells.pileup''' file with the '''more''' command (is generally a very large file, so do not  open it with a text editor) and scroll few pages to see its format.   
 +
(You can use the commands that we show in the previous chapters in order to submit these commands on the queue system).
 +
 
You  are  now  ready  for  the  identification  of  '''SNP'''.  '''Varscan'''  has  several  parameters  for  setting  the  reliability  of  the  '''SNP'''  identification.  All  '''SNP'''  calling  algorithms  face  the  problem of discriminating between true variations and sequencing errors.  Open a text file called for example '''varscan.sh''', and write the header like this:  
 
You  are  now  ready  for  the  identification  of  '''SNP'''.  '''Varscan'''  has  several  parameters  for  setting  the  reliability  of  the  '''SNP'''  identification.  All  '''SNP'''  calling  algorithms  face  the  problem of discriminating between true variations and sequencing errors.  Open a text file called for example '''varscan.sh''', and write the header like this:  
 
  #$ -S /bin/sh
 
  #$ -S /bin/sh
 
  #$ -cwd
 
  #$ -cwd
 
  #$ -q amd.q,large.q,intel.q
 
  #$ -q amd.q,large.q,intel.q
  #$ -l h_vmem=8G
+
  #$ -l h_vmem=50G
 
  #$ -e snp_calling.e
 
  #$ -e snp_calling.e
 
  #$ -N snp_calling
 
  #$ -N snp_calling
Line 18: Line 21:
 
  cd /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping
 
  cd /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping
 
Now write the '''varscan command''' (all in one line):  
 
Now write the '''varscan command''' (all in one line):  
  java -jar /cm/shared/apps/varscan2/2.3.7/jre--1.7.0_72/varscan2-2.3.7.jar pileup2snp 2cells.pileup --p-value 0.01 --min-coverage 50 --min-reads2 20 --min-avg-qual 20 > 2cells.varscan.out   
+
  java -jar /cm/shared/apps/varscan2/VarScan.v2.3.7.jar pileup2snp 2cells.pileup \
 +
--p-value 0.01 --min-coverage 50 --min-reads2 20 --min-avg-qual 20 > 2cells.varscan.out   
 
the parameters are:
 
the parameters are:
 
   
 
   
Line 30: Line 34:
 
   
 
   
 
Now sort the output file by increasing pAvalues:  
 
Now sort the output file by increasing pAvalues:  
 
 
  $ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out  
 
  $ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out  
Open  the  sorted  file.  The  first  two  columns  indicate  the  genomic  coordinate  of  the  '''SNP''',  the third is the nucleotide in the reference genome in that position, and the fourth is the  nucleotide  found  there  in  the  aligned  reads.  Copy  the  coordinate  of  the  first  (or  of  any)  SNP,  go  to  the  '''UCSC  Genome  Browser'''  and  point  it  to  the  '''SNP'''  coordinate  (remember  to  set danRer10 as  genome  assembly).  In  which  gene  the  '''SNP'''  falls  into?  Once you have the '''SNPs''' list, you can try to assess the potential effect of the variation, for  example a '''SNP''' changing an amino acid into another with different characteristics can be  deleterious.
+
Open  the  sorted  file.  The  first  two  columns  indicate  the  genomic  coordinate  of  the  '''SNP''',  the third is the nucleotide in the reference genome in that position, and the fourth is the  nucleotide  found  there  in  the  aligned  reads.  Copy  the  coordinate  of  the  first  (or  of  any)  SNP,  go  to  the  '''UCSC  Genome  Browser'''  and  point  it  to  the  '''SNP'''  coordinate  (remember  to  set '''danRer10''' as  genome  assembly).  In  which  gene  the  '''SNP'''  falls  into?  Once you have the '''SNPs''' list, you can try to assess the potential effect of the variation, for  example a '''SNP''' changing an amino acid into another with different characteristics can be  deleterious.

Latest revision as of 14:40, 25 February 2016

First, you need to convert the SAM file into BAM file and to sort it. Then, generate a so-called pileup file, which describes how the reads align to the reference genome position by position. You will use samtools for these steps steps. Go into the bwaMapping folder and type the following command (in one line):

$samtools view -bS 2cells.bwa.sam > 2cells.bwa.bam
$samtools sort 2cells.bwa.bam 2cells.bwa.sorted
$samtools index 2cells.bwa.sorted.bam

Now let’s create the pileup:

$ samtools mpileup -f /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping/danRer10.fa 2cells.sorted.bam > 2cells.pileup

the –f parameter specifies the path to the genomic sequence. Now open the 2cells.pileup file with the more command (is generally a very large file, so do not open it with a text editor) and scroll few pages to see its format. (You can use the commands that we show in the previous chapters in order to submit these commands on the queue system).

You are now ready for the identification of SNP. Varscan has several parameters for setting the reliability of the SNP identification. All SNP calling algorithms face the problem of discriminating between true variations and sequencing errors. Open a text file called for example varscan.sh, and write the header like this:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=50G
#$ -e snp_calling.e
#$ -N snp_calling
#$ -o snp_calling.o
cd /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping

Now write the varscan command (all in one line):

java -jar /cm/shared/apps/varscan2/VarScan.v2.3.7.jar pileup2snp 2cells.pileup \
--p-value 0.01 --min-coverage 50 --min-reads2 20 --min-avg-qual 20 > 2cells.varscan.out  

the parameters are:

--p-value: p-value threshold for calling a SNP

--min-coverage: minimum number of reads covering a genomic position

--min-reads2: minimum number of reads supporting the variant

--min-avg-qual: minimum average quality at the position

Now sort the output file by increasing pAvalues:

$ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out 

Open the sorted file. The first two columns indicate the genomic coordinate of the SNP, the third is the nucleotide in the reference genome in that position, and the fourth is the nucleotide found there in the aligned reads. Copy the coordinate of the first (or of any) SNP, go to the UCSC Genome Browser and point it to the SNP coordinate (remember to set danRer10 as genome assembly). In which gene the SNP falls into? Once you have the SNPs list, you can try to assess the potential effect of the variation, for example a SNP changing an amino acid into another with different characteristics can be deleterious.