Difference between revisions of "9.3 SNP calling using varscan"
(Created page with "First, you need to sort the '''SAM''' file, and then generate a so-called '''pileup file''', which describes how the reads align to the reference genome positi...") |
|||
Line 15: | Line 15: | ||
#PBS -W group_list="IscrC_RepiTOP" | #PBS -W group_list="IscrC_RepiTOP" | ||
cd /pico/home/userexternal/fferre00/zebrafish/bwaMapping | cd /pico/home/userexternal/fferre00/zebrafish/bwaMapping | ||
− | + | Now write the varscan command (all in one line): | |
+ | java -jar /cm/shared/apps/varscan2/2.3.7/jre--1.7.0_72/varscan2-2.3.7.jar pileup2snp 2cells.pileup --p-value 0.01 --min-coverage 50 --min-reads2 20 --min-avg-qual 20 > 2cells.varscan.out | ||
+ | |||
+ | the parameters are: | ||
− | + | '''--p-value''': p-value threshold for calling a SNP | |
− | + | ||
− | + | '''--min-coverage''': minimum number of reads covering a genomic position | |
− | + | ||
+ | '''--min-reads2''': minimum number of reads supporting the variant | ||
+ | |||
+ | '''--min-avg-qual''': minimum average quality at the position | ||
Now sort the output file by increasing pAvalues: | Now sort the output file by increasing pAvalues: | ||
− | $ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out | + | $ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out |
− | Open the sorted file. The first two columns indicate the genomic coordinate of the SNP, the third is the nucleotide in the reference genome in that position, and the fourth is the nucleotide found there in the aligned reads. Copy the coordinate of the first (or of any) SNP, go to the UCSC Genome Browser and point it to the SNP coordinate (remember to set | + | Open the sorted file. The first two columns indicate the genomic coordinate of the '''SNP''', the third is the nucleotide in the reference genome in that position, and the fourth is the nucleotide found there in the aligned reads. Copy the coordinate of the first (or of any) SNP, go to the '''UCSC Genome Browser''' and point it to the '''SNP''' coordinate (remember to set danRer10 as genome assembly). In which gene the '''SNP''' falls into? Once you have the '''SNPs''' list, you can try to assess the potential effect of the variation, for example a '''SNP''' changing an amino acid into another with different characteristics can be deleterious. |
Revision as of 16:34, 20 February 2016
First, you need to sort the SAM file, and then generate a so-called pileup file, which describes how the reads align to the reference genome position by position. You will use samtools for both steps. Go into the bwaMapping folder and type the following command (in one line):
$ samtools sort -T tmp -O bam -o 2cells.sorted.bam 2cells.bwa.sam
Now let’s create the pileup:
$ samtools mpileup -f /ibers/ernie/scratch/vpl/zebra_fish/bwaMapping/danRer10.fa 2cells.sorted.bam > 2cells.pileup
the –f parameter specifies the path to the genomic sequence. Now open the 2cells.pileup file with the more command (is generally a very large file, so do not open it with a text editor) and scroll few pages to see its format. You are now ready for the identification of SNP. Varscan has several parameters for setting the reliability of the SNP identification. All SNP calling algorithms face the problem of discriminating between true variations and sequencing errors. Open a text file called for example varscan.sh, and write the header like this:
- /bin/bash
- PBS -l walltime=4:00:00
- PBS -l select=1:ncpus=4:mpiprocs=1:mem=8GB
- PBS -A IscrC_RepiTOP
- PBS -W group_list="IscrC_RepiTOP"
cd /pico/home/userexternal/fferre00/zebrafish/bwaMapping Now write the varscan command (all in one line):
java -jar /cm/shared/apps/varscan2/2.3.7/jre--1.7.0_72/varscan2-2.3.7.jar pileup2snp 2cells.pileup --p-value 0.01 --min-coverage 50 --min-reads2 20 --min-avg-qual 20 > 2cells.varscan.out
the parameters are:
--p-value: p-value threshold for calling a SNP
--min-coverage: minimum number of reads covering a genomic position
--min-reads2: minimum number of reads supporting the variant
--min-avg-qual: minimum average quality at the position
Now sort the output file by increasing pAvalues:
$ sort -t$'\t' -g -k 12 2cells.varscan.out > 2cells.varscan.sorted.out
Open the sorted file. The first two columns indicate the genomic coordinate of the SNP, the third is the nucleotide in the reference genome in that position, and the fourth is the nucleotide found there in the aligned reads. Copy the coordinate of the first (or of any) SNP, go to the UCSC Genome Browser and point it to the SNP coordinate (remember to set danRer10 as genome assembly). In which gene the SNP falls into? Once you have the SNPs list, you can try to assess the potential effect of the variation, for example a SNP changing an amino acid into another with different characteristics can be deleterious.