Splitting Multifastas

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

There are several ways to split a multifasta. The Internet is full of scripts, software and online services. Here we present the script Perl "split_multifasta.pl" developed Joshua Orvis (jorvis @ tigr.org). To download the script: http://iubio.bio.indiana.edu/gmod/genogrid/scripts/split_multifasta.pl

This script allows some interesting options to work with multifasta files:

NAME

   split_multifasta.pl - split a single FASTA file containing multiple
   sequences into separate files.

SYNOPSIS

   USAGE: split_multifasta.pl --input_file=/path/to/some_file.fsa
   --output_dir=/path/to/somedir [ --output_list=/path/to/somefile.list
   --output_subdir_size=1000 --output_subdir_prefix=fasta --seqs_per_file=1
   --compress_output=1 ]
    split_multifasta.pl --in snapdmel.aa --output_dir=./ --f=snaa --seqs_per_file=1000

OPTIONS

   --input_file,-i The input multi-fasta file to split.
   --output_dir,-o The directory to which the output files will be written.
   --output_list,-s Write a list file containing the paths of each of the
   regular output files. This may be useful for later scripts that can
   accept a list as input.
   --output_file_prefix,-f If defined, each file created will have this
   string prepended to its name. This is ignored unless writing multiple
   sequences to each output file using the --seqs_per_file option with a
   value greater than 1, else each file created will just be a number.
   --output_subdir_size,-u If defined, this script will create numbered
   subdirectories in the output directory, each containing this many
   sequences files. Once this limit is reached, another subdirectory is
   created.
   --output_subdir_prefix,-p To be used along with --output_subdir_size,
   this allows more control of the names of the subdirectories created.
   Rather than just incrementing numbers (like 10), each subdirectory will
   be named with this prefix (like prefix10).
   --compress_output,-c Output fasta files will be gzipped when written.
   --debug,-d Debug level. Use a large number to turn on verbose debugging.
   --log,-l Log file
   --help,-h This help message

DESCRIPTION

   This script is used to split a single FASTA file containing multiple
   sequences into separate files containing one sequence each.

INPUT

   The input is defined with --input_file and should be a single fasta
   file. File extensions are ignored. When creating this multi-entry FASTA
   file, one should take care to make the first *word* after the > symbol a
   unique value, as it will be used as the file name for that sequence. For
   example:
       >gi53791237 Tragulus javanicus p97bcnt gene for p97Bcnt
       ACAGGAGAAGAGACTGAAGAGACACGTTCAGGAGAAGAGCAAGAGAAGCCTAAAGAAATGCAAGAAGTTA
       AACTCACCAAATCACTTGTTGAAGAAGTCAGGTAACATGACATTCACAAACTTCAAAACTAGTTCTTTAA
       AAAGGAACATCTCTCTTTTAATATGTATGCATTATTAATTTATTTACTCATTGGCGTGGAGGAGGAAATG
       >gi15387669 Corynebacterium callunae pCC1 plasmid
       ATGCATGCTAGTGTGGTGAGTATGAGCACACACATTCATGGGCACCGCCGGGGTGCAGGGGGGCTTGCCC
       CTTGTCCATGCGGGGTGTGGGGCTTGCCCCGCCGATAGAGACCGGCCACCACCATGGCACCCGGTCGCGG
       GGTGATCGGCCACCACCACCGCCCCCGGCCACTCTCCCCCTGTCTAGGCCATATTTCAGGCCGTCCACTG
   Whitespace is ignored within the input file. See the OUTPUT section for more on creation of output files.

OUTPUT

   The name of each output sequence file is pulled from the FASTA header of
   that sequence. The first *word* after the > symbol will be used as the
   file name, along with the extension .fsa. The word is defined as all the
   text after the > symbol up to the first whitespace.
   If the above example were your input file, two files would be created:
       gi53791237.fsa
       gi15387669.fsa
   Any characters other than a-z A-Z 0-9 . _ - in the ID will be changed
   into an underscore. This only occurs in the file name; the original
   FASTA header within the file will be unmodified.
   You can pass a path to the optional --output_list to create a text file
   containing the full paths to each of the FASTA files created by this
   script.
   Two other optional arguments, --output_subdir_size and
   --output_subdir_prefix, can be used on input sets that are too large to
   write out to one directory. This depends on the limitations of your file
   system, but you usually don't want 100,000 files written in the same
   directory.
   If you have an FASTA file containing 95000 sequences, and use the
   following option:
       --output_dir=/some/path
       --output_subdir_size=30000
   The following will be created:
       directory              file count
       ---------------------------------
       /some/path/1/          30000
       /some/path/2/          30000
       /some/path/3/          30000
       /some/path/4/           5000
   If you choose to create a list file (and you probably want to), it will
   contain these proper paths.
   You may not want the subdirectories to simply be numbers, as above, so
   you can use the --output_subdir_prefix option. For example:
       --output_dir=/some/path
       --output_subdir_size=30000
       --output_subdir_prefix=fasta
   The following will be created:
       directory              file count
       ---------------------------------
       /some/path/fasta1/     30000
       /some/path/fasta2/     30000
       /some/path/fasta3/     30000
       /some/path/fasta4/      5000
   Finally, you can write multiple sequences to each output file using the
   --seqs_per_file option, which can be used along with
   --outupt_subdir_size and --output_subdir_prefix. The main difference to
   note is that, if you use --seqs_per_file, the fasta file created will no
   longer be named using values taken from the header, since it will
   contain multiple headers. Instead, the file will simply be named using
   sequential numbers starting at 1 (like 1.fsa). For example:
       --output_dir=/some/path
       --output_subdir_size=3000
       --output_subdir_prefix=fasta
       --seqs_per_file=10
   The following will be created:
       directory              file count
       ---------------------------------
       /some/path/fasta1/     3000
       /some/path/fasta2/     3000
       /some/path/fasta3/     3000
       /some/path/fasta4/      500