Difference between revisions of "4. Quality control"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
(Created page with "You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name yo...")
 
 
(5 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
where to store the FastQC output files. Let’s call it Fastqc or any other name you like:  
 
where to store the FastQC output files. Let’s call it Fastqc or any other name you like:  
 
  $ mkdir Fastqc
 
  $ mkdir Fastqc
Now,  let’s  see  the  FastQC  options.  In  the  pico  environment,  several  bioinformatics  modules  are  ready  be  loaded.  So  let’s  first  load the FastQC module and then  see its  command line options:  
+
Now,  we have to load the FastQC module and see the options of the program by typing the following commands:
  $ module load fastqc
+
  $ module load fastqc/0.11.2
 
  $ fastqc -h
 
  $ fastqc -h
Most  options  are  needed  only  for  particular  cases,  and  you  can  generally  ignore  them.  Let’s run FastQC on the forward reads fastq file of the 2Acells sample
+
Most  options  are  needed  only  for  particular  cases,  and  you  can  generally  ignore  them.  Let’s run FastQC on the forward reads fastq file of the 2-cells sample. To do that, we will write the following line in the script that we show in the previous chapter (3. Your environment).
 
  $ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
 
  $ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
-o: the output folder  
+
So, the script that we will run is:
 +
#$ -S /bin/sh
 +
#$ -cwd
 +
#$ -q amd.q,large.q,intel.q
 +
#$ -l h_vmem=40G
 +
#$ -e run_fastqc.e
 +
#$ -N run_fastqc
 +
#$ -o run_fastqc.o
 +
module load fastqc/0.11.2
 +
fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
 +
The parameters that we use:
 +
-o: the output folder
 +
 
-t:  the  number  of  threads  (FastQC  is  fast,  you  don’t  need  to  dedicate  lots  of  resources  
 
-t:  the  number  of  threads  (FastQC  is  fast,  you  don’t  need  to  dedicate  lots  of  resources  
 
to run it)  
 
to run it)  
 +
 
--extract: instruct FastQC to extract the compressed output files  
 
--extract: instruct FastQC to extract the compressed output files  
--nogroup:  show  output  data  for  each  position  in  the  read,  instead  of  grouping  
+
 
neighboring positions.  
+
--nogroup:  show  output  data  for  each  position  in  the  read,  instead  of  grouping neighbouring positions.  
 +
 
 
The  last  argument  is  the  full  path  to  the  fastq  file.  More  than  one  fastq  file  can  be  
 
The  last  argument  is  the  full  path  to  the  fastq  file.  More  than  one  fastq  file  can  be  
 
analyzed  in  one  run,  but  let’s  do  it  one  file  at  the  time.  Now  let’s  look  at  the  FastQC  
 
analyzed  in  one  run,  but  let’s  do  it  one  file  at  the  time.  Now  let’s  look  at  the  FastQC  
 
output:  
 
output:  
 
  $ ls –l Fastqc
 
  $ ls –l Fastqc
You should see two files (a zip archive and an html file, and a folder). The html file can be  opened  by  any  browser,  and  will  show  plot  reports similar  to  those  that  you  saw  during  the presentation, with quality distributions, GC contents and so on. To look at these plots  you  must  copy  these  files  on  your  laptop  by sftp  or  scp.  Alternatively,  in  the  folder  there are two text files that you can look. The summary.txt reports the outcome of all  the performed tests. Let’s look at it:  
+
You should see two files (a zip archive and an html file, and a folder). The html file can be  opened  by  any  browser,  and  will  show  plot  reports. To look at these plots  you  must  copy  these  files  on  your  laptop  by scp.  Alternatively,  in  the  folder  there are two text files that you can look. The summary.txt reports the outcome of all  the performed tests. Let’s look at it:  
 
  $ more Fastqc/2cells_1_fastqc/summary.txt
 
  $ more Fastqc/2cells_1_fastqc/summary.txt
 
You should see printed on screen something like this:  
 
You should see printed on screen something like this:  
 
  PASS Basic Statistics 2cells_1.fastq
 
  PASS Basic Statistics 2cells_1.fastq
  PASS Per base sequence quality 2cells_1.fastq PASS Per tile sequence quality 2cells_1.fastq PASS Per sequence quality scores 2cells_1.fastq FAIL Per base sequence content 2cells_1.fastq         WARN Per sequence GC content 2cells_1.fastq
+
  PASS Per base sequence quality 2cells_1.fastq
 +
PASS Per tile sequence quality 2cells_1.fastq
 +
PASS Per sequence quality scores 2cells_1.fastq
 +
FAIL Per base sequence content 2cells_1.fastq
 +
WARN Per sequence GC content 2cells_1.fastq
 
  PASS Per base N content 2cells_1.fastq
 
  PASS Per base N content 2cells_1.fastq
 
  PASS Sequence Length Distribution 2cells_1.fastq
 
  PASS Sequence Length Distribution 2cells_1.fastq

Latest revision as of 15:26, 27 January 2016

You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name you like:

$ mkdir Fastqc

Now, we have to load the FastQC module and see the options of the program by typing the following commands:

$ module load fastqc/0.11.2
$ fastqc -h

Most options are needed only for particular cases, and you can generally ignore them. Let’s run FastQC on the forward reads fastq file of the 2-cells sample. To do that, we will write the following line in the script that we show in the previous chapter (3. Your environment).

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq

So, the script that we will run is:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=40G
#$ -e run_fastqc.e
#$ -N run_fastqc
#$ -o run_fastqc.o
module load fastqc/0.11.2
fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq

The parameters that we use: -o: the output folder

-t: the number of threads (FastQC is fast, you don’t need to dedicate lots of resources to run it)

--extract: instruct FastQC to extract the compressed output files

--nogroup: show output data for each position in the read, instead of grouping neighbouring positions.

The last argument is the full path to the fastq file. More than one fastq file can be analyzed in one run, but let’s do it one file at the time. Now let’s look at the FastQC output:

$ ls –l Fastqc

You should see two files (a zip archive and an html file, and a folder). The html file can be opened by any browser, and will show plot reports. To look at these plots you must copy these files on your laptop by scp. Alternatively, in the folder there are two text files that you can look. The summary.txt reports the outcome of all the performed tests. Let’s look at it:

$ more Fastqc/2cells_1_fastqc/summary.txt

You should see printed on screen something like this:

PASS Basic Statistics 2cells_1.fastq
PASS Per base sequence quality 2cells_1.fastq
PASS Per tile sequence quality 2cells_1.fastq
PASS Per sequence quality scores 2cells_1.fastq
FAIL Per base sequence content 2cells_1.fastq
WARN Per sequence GC content 2cells_1.fastq
PASS Per base N content 2cells_1.fastq
PASS Sequence Length Distribution 2cells_1.fastq
FAIL Sequence Duplication Levels 2cells_1.fastq
PASS Overrepresented sequences 2cells_1.fastq
PASS Adapter Content 2cells_1.fastq
FAIL Kmer Content 2cells_1.fastq

Most tests were successful, and we can ignore the failed ones. The file fastqc_data.txt contains a more detailed report, equivalent of the plots in the html output file but in text format. Either case, let’s check whether in the Per base sequence quality report there are positions where the mean or median quality drops below 20. Now repeat the same steps for the three remaining fastq files in the data folder:

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_2.fastq $ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_1.fastq
$ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_2.fastq

You should see in the end of the runs that all four files seem to be of good quality without any glaring problem. In any case, it is always a good idea to perform some trimming, since even if the overall quality is good, there could be individual reads having low quality.