Difference between revisions of "3. Your data"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
Line 15: Line 15:
 
*'''2cells_1.fastq''' and '''2cells_2.fastq''': these files are derived from paired-end  RNA-seq data of a 2-cell zebrafish embryo, the first file containing the forward reads and  the other the reverse reads;   
 
*'''2cells_1.fastq''' and '''2cells_2.fastq''': these files are derived from paired-end  RNA-seq data of a 2-cell zebrafish embryo, the first file containing the forward reads and  the other the reverse reads;   
 
* '''6h_1.fastq''' and '''6h_2.fastq''' : these files are derived from paired-end RNA-seq  data  of  zebrafish  embryos  6  hours  post-fertilization,  the  first  file  containing  the  forward  reads and the other the reverse reads.  
 
* '''6h_1.fastq''' and '''6h_2.fastq''' : these files are derived from paired-end RNA-seq  data  of  zebrafish  embryos  6  hours  post-fertilization,  the  first  file  containing  the  forward  reads and the other the reverse reads.  
  Let’s look at a fastq file. These are text files that you could open with any text editor ('''vi''',  '''emacs''',  '''pico''',  etc.),  but  these  files  are  generally  so  large  that  trying  to  open  them  will  cause lots of problems. Remember the commands '''more''' (to look at a file one page at the  time), '''head''' (to see the file beginning) and '''tail''' (to see its end).   
+
Let’s look at a fastq file. These are text files that you could open with any text editor ('''vi''',  '''emacs''',  '''pico''',  etc.),  but  these  files  are  generally  so  large  that  trying  to  open  them  will  cause lots of problems. Remember the commands '''more''' (to look at a file one page at the  time), '''head''' (to see the file beginning) and '''tail''' (to see its end).   
 
   
 
   
 
In this case let’s look at the beginning of a fastq file with  <span style="colour:#41 69 E1"> '''head <filename>'''  </span> (choose  
 
In this case let’s look at the beginning of a fastq file with  <span style="colour:#41 69 E1"> '''head <filename>'''  </span> (choose  
 
any file in the data folder).  For example:  
 
any file in the data folder).  For example:  
 
   
 
   
$ head data/2cells_1.fastq
+
$ head data/2cells_1.fastq
 
   
 
   
  4
+
By  default,  '''head''' shows  the  first  ten  rows  of  the  file.  How  many  complete  reads  are  you  looking  at?  Remember  that  in  the  fastq  file  each  read  is  represented  by  four  rows  (read  name, read sequence, spacer, phred scores).  The file in the example contains the forward  reads; let’s look at the corresponding reverse reads:  
By  default,  head  shows  the  first  ten  rows  of  the  file.  How  many  complete  reads  are  you  looking  at?  Remember  that  in  the  fastq  file  each  read  is  represented  by  four  rows  (read  name, read sequence, spacer, phred scores).  The file in the example contains the forward  reads; let’s look at the corresponding reverse reads:  
 
 
   
 
   
$ head data/2cells_2.fastq
+
$ head data/2cells_2.fastq
 
   
 
   
 
Check  the  name  of  the  first  read  in  the  forward  file  and  of  its  mate  in  the  corresponding  reverse  file.  In  the  two  files,  paired  reads  occupy  the  same  position  within  the  files  (e.g.  the mate of the first read in the forward file is the first also in the reverse file, and so on).  Hence, the two files must contain the same number of reads. How can you quickly known  how many reads are in the fastq file? Run the following commands:  
 
Check  the  name  of  the  first  read  in  the  forward  file  and  of  its  mate  in  the  corresponding  reverse  file.  In  the  two  files,  paired  reads  occupy  the  same  position  within  the  files  (e.g.  the mate of the first read in the forward file is the first also in the reverse file, and so on).  Hence, the two files must contain the same number of reads. How can you quickly known  how many reads are in the fastq file? Run the following commands:  
 
   
 
   
$ wc –l data/2cells_1.fastq
+
$ wc –l data/2cells_1.fastq
$ wc –l data/2cells_2.fastq
+
$ wc –l data/2cells_2.fastq
 
   
 
   
 
You  will  get  the  number  of  rows  in  the  two  files,  which  must  be  identical.  Dividing  this  number by four gives you the number of reads. If the number of rows in the forward and  reverse  files  is  not  the  same,  and  these  numbers  are  not  multiples  of  four,  then  something is very wrong.   
 
You  will  get  the  number  of  rows  in  the  two  files,  which  must  be  identical.  Dividing  this  number by four gives you the number of reads. If the number of rows in the forward and  reverse  files  is  not  the  same,  and  these  numbers  are  not  multiples  of  four,  then  something is very wrong.   
Line 37: Line 36:
 
http://en.wikipedia.org/wiki/FASTQ_format#Encoding
 
http://en.wikipedia.org/wiki/FASTQ_format#Encoding
 
   
 
   
Knowing  the  encoding  of  the  quality  scores  is  required  for  some  steps  of  the  RNAAseq pipelines.  Some  tools  can  try  to  guess  the  encoding,  while  others  require  the  user  to  specify  it.  The  Wikipedia  page  provides  lots  of  information  on  how  the  encodings  changed over the years.  
+
Knowing  the  encoding  of  the  quality  scores  is  required  for  some  steps  of  the  RNA-seq pipelines.  Some  tools  can  try  to  guess  the  encoding,  while  others  require  the  user  to  specify  it.  The  Wikipedia  page  provides  lots  of  information  on  how  the  encodings  changed over the years.  
 
You are now ready for checking how good your reads are.
 
You are now ready for checking how good your reads are.

Revision as of 14:07, 25 January 2016

You will use a dataset derived from sequencing of mRNA from zebrafish (Danio rerio) embryos in two different developmental stages. Sequencing was performed on the Illumina platform and generated 76bp pairedAend sequence data using polyAAselected RNA. Due to the time constraints of the practical you will only use a subset of the reads, but the steps that you are going to follow are the same whether you are using the whole data or only a subset. All the data can be found on the repository at the following path:

  /ibers/repository/public/courses/Rna-Seq/

(Original data can be found here: http://www.ebi.ac.uk/ena/data/view/ERR022484 and http://www.ebi.ac.uk/ena/data/view/ERR022485)

You need first to create a directory called zebrafish (or whatever other name you like) as a subdirectory of your home, to store all data and results. The sequencing data are in another folder in repository called data. You need to copy this folder into the folder you just created. In this tutorial, each step will be written explicitly, but you should be able by now to perform most of the above and the following steps using Unix commands, so try by yourself before resorting to help. Don’t worry about doing something wrong, you should not be able to do major damage. Below is the list of commands; as before, $ is the shell prompt, press enter after each command, and everything after the # is a comment ignored by the shell, and you don’t need to type it:

 $ pwd                                                                           # where are you
 $ ls                                                                               # what’s in your home
 $ mkdir zebrafish                                                        # create the work folder
 $ cd zebrafish                                                             # and move into it
 $ cp –r  /hove/vasilis/repository/.../tutorial/data .     # copy the data in zebrafish folder
 $ ls –l data                                                                  # what’s in the folder

Note that the –r parameter for the cp command instructs cp to copy entire folders. In the data folder you will find the following data files:

  • 2cells_1.fastq and 2cells_2.fastq: these files are derived from paired-end RNA-seq data of a 2-cell zebrafish embryo, the first file containing the forward reads and the other the reverse reads;
  • 6h_1.fastq and 6h_2.fastq : these files are derived from paired-end RNA-seq data of zebrafish embryos 6 hours post-fertilization, the first file containing the forward reads and the other the reverse reads.

Let’s look at a fastq file. These are text files that you could open with any text editor (vi, emacs, pico, etc.), but these files are generally so large that trying to open them will cause lots of problems. Remember the commands more (to look at a file one page at the time), head (to see the file beginning) and tail (to see its end).

In this case let’s look at the beginning of a fastq file with head <filename> (choose any file in the data folder). For example:

$ head data/2cells_1.fastq

By default, head shows the first ten rows of the file. How many complete reads are you looking at? Remember that in the fastq file each read is represented by four rows (read name, read sequence, spacer, phred scores). The file in the example contains the forward reads; let’s look at the corresponding reverse reads:

$ head data/2cells_2.fastq

Check the name of the first read in the forward file and of its mate in the corresponding reverse file. In the two files, paired reads occupy the same position within the files (e.g. the mate of the first read in the forward file is the first also in the reverse file, and so on). Hence, the two files must contain the same number of reads. How can you quickly known how many reads are in the fastq file? Run the following commands:

$ wc –l data/2cells_1.fastq
$ wc –l data/2cells_2.fastq

You will get the number of rows in the two files, which must be identical. Dividing this number by four gives you the number of reads. If the number of rows in the forward and reverse files is not the same, and these numbers are not multiples of four, then something is very wrong. Can you tell which quality encoding our fastq formatted reads are in? Look at the first few reads of the file 2cells_1.fastq using the head command, and then compare the quality strings with the table found at:

http://en.wikipedia.org/wiki/FASTQ_format#Encoding

Knowing the encoding of the quality scores is required for some steps of the RNA-seq pipelines. Some tools can try to guess the encoding, while others require the user to specify it. The Wikipedia page provides lots of information on how the encodings changed over the years. You are now ready for checking how good your reads are.