4. Your data

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

For this tutorial you will use as a whole genome a newly sequenced scaffold from xxxx organism. This can be found on the repository at the following path:

  /ibers/repository/public/courses/Genome-annotation/

You need first to create a directory called xxxx (or whatever other name you like) as a subdirectory of your home, to store all data and results. The fasta file is in another folder in repository called data. You need to copy this folder into the folder you just created. In this tutorial, each step will be written explicitly, but you should be able by now to perform most of the above and the following steps using Unix commands, so try by yourself before resorting to help. Don’t worry about doing something wrong, you should not be able to do major damage. Below is the list of commands; as before, $ is the shell prompt, press enter after each command, and everything after the # is a comment ignored by the shell, and you don’t need to type it:

 $ pwd  # where are you
 $ ls  # what’s in your home
 $ mkdir xxx  # create the work folder
 $ cd xxxx # and move into it
 $ cp –r  /ibers/repository/public/courses/Genome-annotation/data .  # copy the data in xxx folder
 $ ls –l data  # what’s in the folder

Note that the –r parameter for the cp command instructs cp to copy entire folders. In the data folder you will find the following data files:


Let’s look at the fasta file. These are text files that you could open with any text editor (vi, emacs, pico, etc.), but these files are generally so large that trying to open them will cause lots of problems. Remember the commands more (to look at a file one page at the time), head (to see the file beginning) and tail (to see its end).

In this case let’s look at the beginning of a fasta file with head <filename> (choose any file in the data folder). For example:

$ head data/xxxx.fa

By default, head shows the first ten rows of the file. What type or raws are you watching?

$ wc –l data/xxx.fasta

You will get the number of rows in the file. If each line contains 60 bases, how many nucleotides has this scaffold? More details about fasta format you can find here: [1]

Now you can create another directory in your xxx directory. Lets call it tools. Copy the faSize tool from the following address to tools directory. Run faSize. Is the number of nucleotides the same that you calculated before?

 $ mkdir xxx  # create the tools directory
 $ cd xxxx # and move into it
 $ cp –r  /ibers/repository/public/courses/Genome-annotation/tools .  # copy the tools in xxx folder
 $ ls –l tools  # what’s in the folder
 $ ./faSize xxxx.fa