4. Quality control

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name you like:

$ mkdir Fastqc

Now, we have to load the FastQC module and see the options of the program by typing the following commands:

$ module load fastqc/0.11.2
$ fastqc -h

Most options are needed only for particular cases, and you can generally ignore them. Let’s run FastQC on the forward reads fastq file of the 2-cells sample. To do that, we will write the following line in the script that we show in the previous chapter (3. Your environment).

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq

So, the script that we will run is:

#$ -S /bin/sh
#$ -cwd
#$ -q amd.q,large.q,intel.q
#$ -l h_vmem=40G
#$ -e run_fastqc.e
#$ -N run_fastqc
#$ -o run_fastqc.o
module load fastqc/0.11.2
fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq

The parameters that we use: -o: the output folder

-t: the number of threads (FastQC is fast, you don’t need to dedicate lots of resources to run it)

--extract: instruct FastQC to extract the compressed output files

--nogroup: show output data for each position in the read, instead of grouping neighbouring positions.

The last argument is the full path to the fastq file. More than one fastq file can be analyzed in one run, but let’s do it one file at the time. Now let’s look at the FastQC output:

$ ls –l Fastqc

You should see two files (a zip archive and an html file, and a folder). The html file can be opened by any browser, and will show plot reports. To look at these plots you must copy these files on your laptop by scp. Alternatively, in the folder there are two text files that you can look. The summary.txt reports the outcome of all the performed tests. Let’s look at it:

$ more Fastqc/2cells_1_fastqc/summary.txt

You should see printed on screen something like this:

PASS Basic Statistics 2cells_1.fastq
PASS Per base sequence quality 2cells_1.fastq
PASS Per tile sequence quality 2cells_1.fastq
PASS Per sequence quality scores 2cells_1.fastq
FAIL Per base sequence content 2cells_1.fastq
WARN Per sequence GC content 2cells_1.fastq
PASS Per base N content 2cells_1.fastq
PASS Sequence Length Distribution 2cells_1.fastq
FAIL Sequence Duplication Levels 2cells_1.fastq
PASS Overrepresented sequences 2cells_1.fastq
PASS Adapter Content 2cells_1.fastq
FAIL Kmer Content 2cells_1.fastq

Most tests were successful, and we can ignore the failed ones. The file fastqc_data.txt contains a more detailed report, equivalent of the plots in the html output file but in text format. Either case, let’s check whether in the Per base sequence quality report there are positions where the mean or median quality drops below 20. Now repeat the same steps for the three remaining fastq files in the data folder:

$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_2.fastq $ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_1.fastq
$ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_2.fastq

You should see in the end of the runs that all four files seem to be of good quality without any glaring problem. In any case, it is always a good idea to perform some trimming, since even if the overall quality is good, there could be individual reads having low quality.