5.Identification and masking of repeat elements

From IBERS Bioinformatics and HPC Wiki
Revision as of 17:24, 19 March 2016 by Vpl (talk | contribs)
Jump to: navigation, search

Repeat identification Usually the first step for the genome annotation is the repeat identification and masking. With the term of "repeat" we mean different type of sequences like: Low complexity sequences as homopolymeric runs of nucleotides, transposable elements, viruses, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs).

The masking of a genome consists of two steps: 1) The built of the repeats data base and 2) the masking by using the data base.

For the construction of the repeat database we are using the RepeatModeler. RepeatModeler is a de-novo repeat family identification and modeling package.

Now, we have to load the RepeatModeler module and see the options of the program by typing the following commands:

 $ module load repeatmodeler/1.0.7
 $ repeatmodeler -h
  1. $ -S /bin/sh
  2. $ -cwd
  3. $ -q amd.q,large.q,intel.q
  4. $ -l h_vmem=20G
  5. $ -e RepeatmodDB.e
  6. $ -o RepeatmodDB.o
  7. $ -N RMDatabase

module load repeatmodeler/1.0.7

BuildDatabase -name Lp_v1_database -engine ncbi /ibers/ernie/scratch/seb19/Lperenne_V1/Lp_v1.fa