OrthoMCL

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search

Using orthoMCL

An important consideration before you start: Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with a password so that I could access mysql there. Contact the server administrator for help to set this up.

This VM is almost bare of software, but it has mysql. It doesn't even have emacs. So if you need to modify scripts there, you can either use vi (if you are familiar with it) or change them somewhere else and then scp or upload these files into your account in the VM.

Also, create a `~/bin/` directory in your home directory in the VM:

 mkdir ~/bin/

Go into the new bin directory and download the orthomcl software (comes in tar.gz), un zip and untar it. You will need the commands in the directory : "~/bin/orthomclSoftware-v2.0.9/bin/" in my case.

You can actually run the first part of the analysis in the server (using the "module load orthomcl"). But once you get to the to stage where you need to upload into a database, you'll have to run those commands calling your ~/bin/orthomcl/bin/" path.

orthomcl Tutorial:

In the server...

First you need to sort the files in the fasta format so that they will be correctly input by the program (orthomclAdjustfasta). Basically just make sure the names will be unique after choosing the unique identifier from the sequencer name.

Follow the tutorial for orthomcl (which explains how it parses or separates the fields for the sequence names.

Then to run the orthomcl scripts I followed the tutorial here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196566/

and this other user-friendly (copy/paste commands) tutorial here: https://www.biostars.org/p/199083/

Once input sequences for the analysis are in the right format required (1 fasta file per organism or taxa), create 2 directories:

 mkdir my_orthomcl_dir;
 cd my_orthomcl_dir;
 mkdir compliantFasta;

"cp" your original fasta files in the compliantFasta directory. Note: file names don't need to end up in .fasta (but they must be in fasta format).

cd compliantFasta/

Run the "orthomclAdjustFasta". Special note: XXXX is the unique species tag, whatever you want to name that species. Then you give it the compliant fasta file, and finally you specify the number of field that you want to use from the sequence name that will become your unique sequence tag (it has to be unique, so make sure there are no duplicates!!). otherwise you can play around changing the names of the sequences to make it unique. The orthomclAdjustfasta script will accept as field separators either "|" or " " (space). In the example below, it will take the first field before either "|" or " ".

 module load orthomcl;
 orthomclAdjustFasta XXXX my_protein_seqs.fasta 1

Actually, since I had a lot of files, I used a script file, where I listed all the commands per file. Just execute it when it's ready:

 module load orthomcl;
 orthomclAdjustFasta atig my_protein_seqs_sp1.fasta 1
 orthomclAdjustFasta cyno my_protein_seqs_sp2.fasta 1
 orthomclAdjustFasta hyno my_protein_seqs_sp3.fasta 4
 orthomclAdjustFasta newt my_protein_seqs_sp4.fasta 1
 orthomclAdjustFasta anol my_protein_seqs_sp5.fasta 1

Once you obtain the output file (your fasta files), you'll have to move the original ones and the links to them (if you have any) to a different directory. Just keep in the compliantFasta directory

Then do the filtering step:

 orthomclFilterFasta compliantFasta 10 20

all-vs-all BLAST

The previous step produces the goodProteins.fasta file. Use this file to create your BLAST database (in my case, I used Diamond instead of BLAST). Although orthomcl calls the normal Blast program. In such case, just use the command:

 module load blast;
 makeblastdb -in goodProteins.fasta -dbtype prot -out goodprotdb

Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. The command in the submission script should be:

 blastall -p blastp -F 'm S' -v 100000 -b 100000 -z 55 -e 1e-5 -d goodprotdb -i goodProteins.fasta -o all-all.blastp -m 8


Alternatively, if you want to carry out the Blast with Diamond:

 module load diamond;
 diamond makedb --in goodProteins.fasta -p 10 -d goodprotdb;

Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. A suggested command in the submission script would be:

 diamond blastp -q goodProteins.fasta --min-score 60 --db goodprotdb.dmnd -a all-all.blastp.daa -t /ibers/ernie/scratch/my_user/tmp/ --threads 6 --max-target-seqs 1000000;

The "min-score", "threads" and "max-target-seqs" values can be modified to meet the users' preferences.

Once you have your .daa file, you can get your tabular format file with the following command:

 diamond view -a all-all.blastp.daa -o all-all.blastp;




BLAST shortcut

In my case, to save time as I was in a hurry, I partitioned the query sequences into species (so 35 fasta files in total) so that it would run faster. I separated them from the "goodProteins.fasta" file. To do so I had a script convert_fasta_to_1l_fasta.sh that changes sequences files that are in multiple fasta lines (the way they download from ncbi) into a 1 long line fasta format.

 sh convert_fasta_to_1l_fasta.sh goodProteins.fasta
 emacs grep_separate.sh;

Inside emacs copy and paste the following snippet of code:

 grep -A1 ">$1|" goodProteins.1l.fasta > $1.fas

Exit emacs. Now run it, giving it the list of 3 or 4 characters that were used to tag the species. In our case, from this example the "list_of_shortspecies" file would look like this:

 atig
 cyno
 hyno
 newt
 anol
 cat list_of_shortspecies | xargs -I % sh grep_separate.sh %; 

Then I created submission scripts for diamond for all and each of my files with a mkdiamond.sh command. The resulting submission scripts looked like this:

 #$ -e /my_path_to_my_files/my_orthomcl_dir/atig.error
 #$ -o /my_path_to_my_files/my_orthomcl_dir/atig.output
 #$ -N atig
 #$ -cwd
 #$ -j y
 #$ -S /bin/bash
 #$ -m beas
 #$ -M user@email.com
 #$ -V
 #$ -l h_vmem=20G
 #$ -l h_rt=24:00:00
 cd /my_path_to_my_files/my_orthomcl_dir
 module load diamond
 diamond blastp --db goodprotdb.dmnd -q atig.fas -a atig_all_all.daa -t /ibers/ernie/scratch/my_user/tmp --threads 10 --max-target-seqs 1000000

Finally I submitted all 35 jobs and all run across multiple nodes. Job finished in 1 night.

Then I transformed my *.daa files into *.m8 files tabular format:

 module load diamond;
 for i in *.daa; do diamond view -a $i -o ${i}.m8; done;

Then I concatenate them all together with

 cat *.m8 > all-all.blastp 




Now run orthomclBlastParser into the resulting huge all-all.blastp file: Since in my case it would take several hours, I ran it in a submission script (also making sure I got a log file or .output file to make sure there were no errors).

 module load orthomcl;
 orthomclBlastParser all-all.blastp compliantFasta > similarSequences.txt

Now you have your similarSequences.txt file. You have to go to the VM... This resulting file was huge (44G). I had to make sure my VM would cope with it. So I had to request that the VM could have access to x5 my file size (so at least 220G of hard drive).

In the Virtual Machine...

ssh into your account in the VM. Upload / scp your similarSequences.txt file into either the my_orthomcl_dir or whichever hard disk you have the capacity to run it. In my case I had to cp it into /mnt/storage/orthomcl_kas/, which was a temporary HD installed for me to run this.

cp the orthomcl.config file from the template available in the `~/bin/orthomclSoftware-v2.0.9/doc/OrthoMCLEngine/Main/` directory. Sometimes it could have problems. I had errors running for the loadBlast command, so I had to change it, you can see my orthomcl.config file that finally ran with no errors.

Next, clean and create your orthomcl database using the following commands: This has cleaned from memory from whatever Martin had done before and it frees memory as well.

 mysql -u (the-orthomcl-database-user) -p
 (insert password here)
 DROP DATABASE orthomcldb;
 CREATE DATABASE orthomcldb;
 quit;

Then create the tables (schema) inside the database:

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclInstallSchema ~/my_orthomcl_dir/orthomcl.config ~/my_orthomcl_dir/install_schema.log

When it finishes, check that the log file has no errors. Ok now I can proceed to loading the Blast into the database. Note this step takes a few hours and there is no way to know if it crashed, so you have to check it every so often. I suggest running in "screen".

 screen -S load;
 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast ~/my_orthomcl_dir/orthomcl.config similarSequences.txt

exit screen (ctrl+ A then D). Once it finishes (it will give you a prompt for next command); then you can do next step also in a different screen:

 screen -S Pairs;
 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclPairs ~/my_orthomcl_dir/orthomcl.config orthomcl_pairs.log cleanup=no

exit screen.

Create output tables: (I ran it in nohup to keep a log file in case of errors:

 nohup ~/bin/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles ~/my_orthomcl_dir/orthomcl.config 2>dump.err &

The database step is to match most similar things in order to find the more finer scale orthology / paralogies. Hence we end up with files called "orthologs.txt", "coorthologs.txt" and "inparalogs.txt", which are just files with pairs of proteins with a score for each pair in a third column. The program creates a summary of all these ("mclInput") which can be used in a subsequent step of clustering with an mcl algorithm. The MCL step I ran in the server, so I "scp" the resulting files back to my account in the server.

Now I can quit the database VM. Also, you can let the administrator know in case they need to free up space, etc. Since this file took a lot of space from the disk, someone else who wants to run something there might have to clean the database.

Back to the server...

MCL clustering step:

Make sure you have the MCL software installed or available in your directory. In my case, I installed mine in my "~/bin" . I ran this step in a submission script. It took about 30 mins and for my size of file it used 2G of memory.

 ~/bin/mcl mclInput --abc -I 1.5 -o mclOutput

Then just parse the output into the groups.txt: (takes 2 mins)

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclMclToGroups cluster- 1000 < mclOutput > groups.txt

And finally obtain the singleton files (so everything in your goodProteins.fasta file that was not included in the clusters in "groups.txt"):

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt

And you are done! do a final count of the final clusters obtained yielded for your dataset:

 cut -d":" -f1 groups.txt | wc -l;

The number you obtained should match what was written in the log file (in my case called "mcl_ortho.output") for the "mcl" command you ran hopefully in the cluster...

 grep "clusters found" mcl_ortho.output