Difference between revisions of "4. Quality control"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
(Created page with "You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name yo...")
 
Line 7: Line 7:
 
Most  options  are  needed  only  for  particular  cases,  and  you  can  generally  ignore  them.  Let’s run FastQC on the forward reads fastq file of the 2Acells sample:   
 
Most  options  are  needed  only  for  particular  cases,  and  you  can  generally  ignore  them.  Let’s run FastQC on the forward reads fastq file of the 2Acells sample:   
 
  $ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
 
  $ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
-o: the output folder  
+
-o: the output folder
 +
 
-t:  the  number  of  threads  (FastQC  is  fast,  you  don’t  need  to  dedicate  lots  of  resources  
 
-t:  the  number  of  threads  (FastQC  is  fast,  you  don’t  need  to  dedicate  lots  of  resources  
 
to run it)  
 
to run it)  
 +
 
--extract: instruct FastQC to extract the compressed output files  
 
--extract: instruct FastQC to extract the compressed output files  
--nogroup:  show  output  data  for  each  position  in  the  read,  instead  of  grouping  
+
 
neighboring positions.  
+
--nogroup:  show  output  data  for  each  position  in  the  read,  instead  of  grouping neighbouring positions.  
 +
 
 
The  last  argument  is  the  full  path  to  the  fastq  file.  More  than  one  fastq  file  can  be  
 
The  last  argument  is  the  full  path  to  the  fastq  file.  More  than  one  fastq  file  can  be  
 
analyzed  in  one  run,  but  let’s  do  it  one  file  at  the  time.  Now  let’s  look  at  the  FastQC  
 
analyzed  in  one  run,  but  let’s  do  it  one  file  at  the  time.  Now  let’s  look  at  the  FastQC  

Revision as of 15:40, 25 January 2016

You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name you like:

$ mkdir Fastqc

Now, let’s see the FastQC options. In the pico environment, several bioinformatics modules are ready be loaded. So let’s first load the FastQC module and then see its command line options:

$ module load fastqc
$ fastqc -h

Most options are needed only for particular cases, and you can generally ignore them. Let’s run FastQC on the forward reads fastq file of the 2Acells sample:

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq

-o: the output folder

-t: the number of threads (FastQC is fast, you don’t need to dedicate lots of resources to run it)

--extract: instruct FastQC to extract the compressed output files

--nogroup: show output data for each position in the read, instead of grouping neighbouring positions.

The last argument is the full path to the fastq file. More than one fastq file can be analyzed in one run, but let’s do it one file at the time. Now let’s look at the FastQC output:

$ ls –l Fastqc

You should see two files (a zip archive and an html file, and a folder). The html file can be opened by any browser, and will show plot reports similar to those that you saw during the presentation, with quality distributions, GC contents and so on. To look at these plots you must copy these files on your laptop by sftp or scp. Alternatively, in the folder there are two text files that you can look. The summary.txt reports the outcome of all the performed tests. Let’s look at it:

$ more Fastqc/2cells_1_fastqc/summary.txt

You should see printed on screen something like this:

PASS Basic Statistics 2cells_1.fastq
PASS Per base sequence quality 2cells_1.fastq PASS Per tile sequence quality 2cells_1.fastq PASS Per sequence quality scores 2cells_1.fastq FAIL Per base sequence content 2cells_1.fastq          WARN Per sequence GC content 2cells_1.fastq
PASS Per base N content 2cells_1.fastq
PASS Sequence Length Distribution 2cells_1.fastq
FAIL Sequence Duplication Levels 2cells_1.fastq
PASS Overrepresented sequences 2cells_1.fastq
PASS Adapter Content 2cells_1.fastq
FAIL Kmer Content 2cells_1.fastq

Most tests were successful, and we can ignore the failed ones. The file fastqc_data.txt contains a more detailed report, equivalent of the plots in the html output file but in text format. Either case, let’s check whether in the Per base sequence quality report there are positions where the mean or median quality drops below 20. Now repeat the same steps for the three remaining fastq files in the data folder:

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_2.fastq $ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_1.fastq
$ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_2.fastq

You should see in the end of the runs that all four files seem to be of good quality without any glaring problem. In any case, it is always a good idea to perform some trimming, since even if the overall quality is good, there could be individual reads having low quality.