Difference between revisions of "4. Quality control"
Line 35: | Line 35: | ||
You should see printed on screen something like this: | You should see printed on screen something like this: | ||
PASS Basic Statistics 2cells_1.fastq | PASS Basic Statistics 2cells_1.fastq | ||
− | PASS Per base sequence quality 2cells_1.fastq PASS Per tile sequence quality 2cells_1.fastq PASS Per sequence quality scores 2cells_1.fastq FAIL Per base sequence content 2cells_1.fastq | + | PASS Per base sequence quality 2cells_1.fastq |
+ | PASS Per tile sequence quality 2cells_1.fastq | ||
+ | PASS Per sequence quality scores 2cells_1.fastq | ||
+ | FAIL Per base sequence content 2cells_1.fastq | ||
+ | WARN Per sequence GC content 2cells_1.fastq | ||
PASS Per base N content 2cells_1.fastq | PASS Per base N content 2cells_1.fastq | ||
PASS Sequence Length Distribution 2cells_1.fastq | PASS Sequence Length Distribution 2cells_1.fastq |
Latest revision as of 15:26, 27 January 2016
You are now going to use FastQC to verify read quality. First, you need to create a folder where to store the FastQC output files. Let’s call it Fastqc or any other name you like:
$ mkdir Fastqc
Now, we have to load the FastQC module and see the options of the program by typing the following commands:
$ module load fastqc/0.11.2 $ fastqc -h
Most options are needed only for particular cases, and you can generally ignore them. Let’s run FastQC on the forward reads fastq file of the 2-cells sample. To do that, we will write the following line in the script that we show in the previous chapter (3. Your environment).
$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
So, the script that we will run is:
#$ -S /bin/sh #$ -cwd #$ -q amd.q,large.q,intel.q #$ -l h_vmem=40G #$ -e run_fastqc.e #$ -N run_fastqc #$ -o run_fastqc.o module load fastqc/0.11.2 fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_1.fastq
The parameters that we use: -o: the output folder
-t: the number of threads (FastQC is fast, you don’t need to dedicate lots of resources to run it)
--extract: instruct FastQC to extract the compressed output files
--nogroup: show output data for each position in the read, instead of grouping neighbouring positions.
The last argument is the full path to the fastq file. More than one fastq file can be analyzed in one run, but let’s do it one file at the time. Now let’s look at the FastQC output:
$ ls –l Fastqc
You should see two files (a zip archive and an html file, and a folder). The html file can be opened by any browser, and will show plot reports. To look at these plots you must copy these files on your laptop by scp. Alternatively, in the folder there are two text files that you can look. The summary.txt reports the outcome of all the performed tests. Let’s look at it:
$ more Fastqc/2cells_1_fastqc/summary.txt
You should see printed on screen something like this:
PASS Basic Statistics 2cells_1.fastq PASS Per base sequence quality 2cells_1.fastq PASS Per tile sequence quality 2cells_1.fastq PASS Per sequence quality scores 2cells_1.fastq FAIL Per base sequence content 2cells_1.fastq WARN Per sequence GC content 2cells_1.fastq PASS Per base N content 2cells_1.fastq PASS Sequence Length Distribution 2cells_1.fastq FAIL Sequence Duplication Levels 2cells_1.fastq PASS Overrepresented sequences 2cells_1.fastq PASS Adapter Content 2cells_1.fastq FAIL Kmer Content 2cells_1.fastq
Most tests were successful, and we can ignore the failed ones. The file fastqc_data.txt contains a more detailed report, equivalent of the plots in the html output file but in text format. Either case, let’s check whether in the Per base sequence quality report there are positions where the mean or median quality drops below 20. Now repeat the same steps for the three remaining fastq files in the data folder:
$ fastqc -o Fastqc -t 2 --extract –-nogroup data/2cells_2.fastq $ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_1.fastq $ fastqc -o Fastqc -t 2 --extract –-nogroup data/6h_2.fastq
You should see in the end of the runs that all four files seem to be of good quality without any glaring problem. In any case, it is always a good idea to perform some trimming, since even if the overall quality is good, there could be individual reads having low quality.