Difference between revisions of "Splitting Multifastas"
(3 intermediate revisions by the same user not shown) | |||
Line 3: | Line 3: | ||
To download the script: http://iubio.bio.indiana.edu/gmod/genogrid/scripts/split_multifasta.pl | To download the script: http://iubio.bio.indiana.edu/gmod/genogrid/scripts/split_multifasta.pl | ||
− | + | This script allows some interesting options to work with multifasta files: | |
− | |||
− | This script allows some interesting options to work with | ||
NAME | NAME |
Latest revision as of 11:41, 10 November 2017
There are several ways to split a multifasta. The Internet is full of scripts, software and online services. Here we present the script Perl "split_multifasta.pl" developed Joshua Orvis (jorvis @ tigr.org). To download the script: http://iubio.bio.indiana.edu/gmod/genogrid/scripts/split_multifasta.pl
This script allows some interesting options to work with multifasta files:
NAME
split_multifasta.pl - split a single FASTA file containing multiple sequences into separate files.
SYNOPSIS
USAGE: split_multifasta.pl --input_file=/path/to/some_file.fsa --output_dir=/path/to/somedir [ --output_list=/path/to/somefile.list --output_subdir_size=1000 --output_subdir_prefix=fasta --seqs_per_file=1 --compress_output=1 ] split_multifasta.pl --in snapdmel.aa --output_dir=./ --f=snaa --seqs_per_file=1000
OPTIONS
--input_file,-i The input multi-fasta file to split. --output_dir,-o The directory to which the output files will be written. --output_list,-s Write a list file containing the paths of each of the regular output files. This may be useful for later scripts that can accept a list as input. --output_file_prefix,-f If defined, each file created will have this string prepended to its name. This is ignored unless writing multiple sequences to each output file using the --seqs_per_file option with a value greater than 1, else each file created will just be a number. --output_subdir_size,-u If defined, this script will create numbered subdirectories in the output directory, each containing this many sequences files. Once this limit is reached, another subdirectory is created. --output_subdir_prefix,-p To be used along with --output_subdir_size, this allows more control of the names of the subdirectories created. Rather than just incrementing numbers (like 10), each subdirectory will be named with this prefix (like prefix10). --compress_output,-c Output fasta files will be gzipped when written. --debug,-d Debug level. Use a large number to turn on verbose debugging. --log,-l Log file --help,-h This help message
DESCRIPTION
This script is used to split a single FASTA file containing multiple sequences into separate files containing one sequence each.
INPUT
The input is defined with --input_file and should be a single fasta file. File extensions are ignored. When creating this multi-entry FASTA file, one should take care to make the first *word* after the > symbol a unique value, as it will be used as the file name for that sequence. For example: >gi53791237 Tragulus javanicus p97bcnt gene for p97Bcnt ACAGGAGAAGAGACTGAAGAGACACGTTCAGGAGAAGAGCAAGAGAAGCCTAAAGAAATGCAAGAAGTTA AACTCACCAAATCACTTGTTGAAGAAGTCAGGTAACATGACATTCACAAACTTCAAAACTAGTTCTTTAA AAAGGAACATCTCTCTTTTAATATGTATGCATTATTAATTTATTTACTCATTGGCGTGGAGGAGGAAATG >gi15387669 Corynebacterium callunae pCC1 plasmid ATGCATGCTAGTGTGGTGAGTATGAGCACACACATTCATGGGCACCGCCGGGGTGCAGGGGGGCTTGCCC CTTGTCCATGCGGGGTGTGGGGCTTGCCCCGCCGATAGAGACCGGCCACCACCATGGCACCCGGTCGCGG GGTGATCGGCCACCACCACCGCCCCCGGCCACTCTCCCCCTGTCTAGGCCATATTTCAGGCCGTCCACTG Whitespace is ignored within the input file. See the OUTPUT section for more on creation of output files.
OUTPUT
The name of each output sequence file is pulled from the FASTA header of that sequence. The first *word* after the > symbol will be used as the file name, along with the extension .fsa. The word is defined as all the text after the > symbol up to the first whitespace. If the above example were your input file, two files would be created: gi53791237.fsa gi15387669.fsa Any characters other than a-z A-Z 0-9 . _ - in the ID will be changed into an underscore. This only occurs in the file name; the original FASTA header within the file will be unmodified. You can pass a path to the optional --output_list to create a text file containing the full paths to each of the FASTA files created by this script. Two other optional arguments, --output_subdir_size and --output_subdir_prefix, can be used on input sets that are too large to write out to one directory. This depends on the limitations of your file system, but you usually don't want 100,000 files written in the same directory. If you have an FASTA file containing 95000 sequences, and use the following option: --output_dir=/some/path --output_subdir_size=30000 The following will be created: directory file count --------------------------------- /some/path/1/ 30000 /some/path/2/ 30000 /some/path/3/ 30000 /some/path/4/ 5000 If you choose to create a list file (and you probably want to), it will contain these proper paths. You may not want the subdirectories to simply be numbers, as above, so you can use the --output_subdir_prefix option. For example: --output_dir=/some/path --output_subdir_size=30000 --output_subdir_prefix=fasta The following will be created: directory file count --------------------------------- /some/path/fasta1/ 30000 /some/path/fasta2/ 30000 /some/path/fasta3/ 30000 /some/path/fasta4/ 5000 Finally, you can write multiple sequences to each output file using the --seqs_per_file option, which can be used along with --outupt_subdir_size and --output_subdir_prefix. The main difference to note is that, if you use --seqs_per_file, the fasta file created will no longer be named using values taken from the header, since it will contain multiple headers. Instead, the file will simply be named using sequential numbers starting at 1 (like 1.fsa). For example: --output_dir=/some/path --output_subdir_size=3000 --output_subdir_prefix=fasta --seqs_per_file=10 The following will be created: directory file count --------------------------------- /some/path/fasta1/ 3000 /some/path/fasta2/ 3000 /some/path/fasta3/ 3000 /some/path/fasta4/ 500