Difference between revisions of "OrthoMCL"

From IBERS Bioinformatics and HPC Wiki
Jump to: navigation, search
(In the Virtual Machine...)
(all vs all BLAST)
 
(14 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with a password so that I could access mysql there. Contact the server administrator for help to set this up.
 
Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with a password so that I could access mysql there. Contact the server administrator for help to set this up.
  
This VM is almost bare of software, but it has mysql. It doesn't even have emacs. So if you need to modify scripts there, you can either use vi (if you are familiar with it) or change them somewhere else and then `scp` or upload these files into your account in the VM.
+
This VM is almost bare of software, but it has mysql. It doesn't even have emacs. So if you need to modify scripts there, you can either use vi (if you are familiar with it) or change them somewhere else and then   scp or upload these files into your account in the VM.
  
 
Also, create a  `~/bin/`  directory in your home directory in the VM:
 
Also, create a  `~/bin/`  directory in your home directory in the VM:
  
```
+
  mkdir ~/bin/
mkdir ~/bin/
 
```
 
  
Go into the new bin directory and download the orthomcl software (comes in tar.gz), un zip and untar it. You will need the commands in the directory : `~/bin/orthomclSoftware-v2.0.9/bin/` in my case.
+
Go into the new bin directory and download the orthomcl software (comes in tar.gz), un zip and untar it. You will need the commands in the directory : "~/bin/orthomclSoftware-v2.0.9/bin/" in my case.
  
You can actually run the first part of the analysis in the server (using the module load orthomcl). But once you get to the to stage where you need to upload into a database, you'll have to run those commands calling your ~/bin/orthomcl/bin/ path.
+
You can actually run the first part of the analysis in the server (using the "module load orthomcl"). But once you get to the to stage where you need to upload into a database, you'll have to run those commands calling your ''~/bin/orthomcl/bin/" path.
  
 
== orthomcl Tutorial: ==
 
== orthomcl Tutorial: ==
Line 33: Line 31:
 
Once input sequences for the analysis are in the right format required (1 fasta file per organism or taxa), create 2 directories:
 
Once input sequences for the analysis are in the right format required (1 fasta file per organism or taxa), create 2 directories:
  
```
+
  mkdir my_orthomcl_dir;
mkdir my_orthomcl_dir;
+
  cd my_orthomcl_dir;
cd my_orthomcl_dir;
+
  mkdir compliantFasta;
mkdir compliantFasta;
 
```
 
cp your original fasta files in the compliantFasta directory. Note: file names don't need to end up in .fasta (but they must be in fasta format).
 
  
run the "orthomclAdjustFasta". Special note: XXXX is the unique species tag, whatever you want to name that species. Then you give it the compliant fasta file, and finally you specify the number of field that you want to use from the sequence name that will become your unique sequence tag (it has to be unique, so make sure there are no duplicates!!). otherwise you can play around changing the names of the sequences to make it unique. The orthomclAdjustfasta script will accept as field separators either "|" or " " (space). In the example below, it will take the first field before either "|" or " ".
+
"cp" your original fasta files in the compliantFasta directory. Note: file names don't need to end up in .fasta (but they must be in fasta format).
  
 +
cd compliantFasta/
  
```
+
Run the "orthomclAdjustFasta". Special note:  XXXX is the unique species tag, whatever you want to name that species. Then you give it the compliant fasta file, and finally you specify the number of field that you want to use from the sequence name that will become your unique sequence tag (it has to be unique, so make sure there are no duplicates!!). otherwise you can play around changing the names of the sequences to make it unique. The orthomclAdjustfasta script will accept as field separators either "|" or " " (space). In the example below, it will take the first field before either "|" or " ".
module load orthomcl;
 
orthomclAdjustFasta XXXX my_protein_seqs.fasta 1
 
```
 
  
 +
  module load orthomcl;
 +
  orthomclAdjustFasta XXXX my_protein_seqs.fasta 1
  
 
Actually, since I had a lot of files, I used a script file, where I listed all the commands per file. Just execute it when it's ready:
 
Actually, since I had a lot of files, I used a script file, where I listed all the commands per file. Just execute it when it's ready:
  
```
+
  module load orthomcl;
module load orthomcl;
+
  orthomclAdjustFasta atig my_protein_seqs_sp1.fasta 1
orthomclAdjustFasta atig my_protein_seqs_sp1.fasta 1
+
  orthomclAdjustFasta cyno my_protein_seqs_sp2.fasta 1
orthomclAdjustFasta cyno my_protein_seqs_sp2.fasta 1
+
  orthomclAdjustFasta hyno my_protein_seqs_sp3.fasta 4
orthomclAdjustFasta hyno my_protein_seqs_sp3.fasta 4
+
  orthomclAdjustFasta newt my_protein_seqs_sp4.fasta 1
orthomclAdjustFasta newt my_protein_seqs_sp4.fasta 1
+
  orthomclAdjustFasta anol my_protein_seqs_sp5.fasta 1
orthomclAdjustFasta anol my_protein_seqs_sp5.fasta 1
 
 
 
```
 
  
 
Once you obtain the output file (your fasta files), you'll have to move the original ones and the links to them (if you have any) to a different directory.  Just keep in the compliantFasta directory
 
Once you obtain the output file (your fasta files), you'll have to move the original ones and the links to them (if you have any) to a different directory.  Just keep in the compliantFasta directory
Line 65: Line 57:
 
Then do the filtering step:  
 
Then do the filtering step:  
  
```
+
  orthomclFilterFasta compliantFasta 10 20
orthomclFilterFasta compliantFasta 10 20
+
 
```
+
=== all-vs-all BLAST===
 +
 
 +
The previous step produces the goodProteins.fasta file. Use this file to create your BLAST database (in my case, I used Diamond instead of BLAST). Although orthomcl calls the normal Blast program. In such case, just use the command:
 +
 
 +
  module load blast;
 +
  makeblastdb -in goodProteins.fasta -dbtype prot -out goodprotdb
 +
 
 +
Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. The command in the submission script should be:
 +
 
 +
  blastall -p blastp -F 'm S' -v 100000 -b 100000 -z 55 -e 1e-5 -d goodprotdb -i goodProteins.fasta -o all-all.blastp -m 8
 +
 
  
This step produces the goodProteins.fasta file. Use this file to create your BLAST database (in my case, I used Diamond):
+
Alternatively, if you want to carry out the Blast with Diamond:
  
```
+
  module load diamond;
module load diamond;
+
  diamond makedb --in goodProteins.fasta -p 10 -d goodprotdb;
diamond makedb ...
 
```
 
  
Then carry out the all vs all blast. In my case, I partitioned the query sequences into species (so 35 fasta files in total) so that it would run faster. I separated them from the "goodProteins.fasta" file:
+
Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. A suggested command in the submission script would be:
  
```
+
  diamond blastp -q goodProteins.fasta --min-score 60 --db goodprotdb.dmnd -a all-all.blastp.daa -t /ibers/ernie/scratch/my_user/tmp/ --threads 6 --max-target-seqs 1000000;
convert_fasta_to_1l_fasta.sh goodProteins.fasta
 
emacs grep_separate.sh;
 
#Inside emacs copy and paste the following snippet:
 
grep -A1 ">$1|" goodProteins.1l.fasta > $1.fas
 
```
 
  
now run it:
+
The "min-score", "threads" and "max-target-seqs" values can be modified to meet the users' preferences.
  
```
+
Once  you have your .daa file, you can get your tabular format file with the following command:
cat list_of_shortspecies | xargs -I % sh grep_separate.sh %;  
+
 
```
+
  diamond view -a all-all.blastp.daa -o all-all.blastp;
+
 
Then I created submission scripts for diamond for all 35 files with a mkdiamond.sh command. It looked like this:
+
 
 +
----
 +
 
 +
 
 +
'''BLAST shortcut'''
 +
 
 +
In my case, to save time as I was in a hurry, I partitioned the query sequences into species (so 35 fasta files in total) so that it would run faster. I separated them from the "goodProteins.fasta" file. To do so I had a script [[convert_fasta_to_1l_fasta.sh]] that changes sequences files that are in multiple fasta lines (the way they download from ncbi) into a 1 long line fasta format.
 +
 
 +
  sh convert_fasta_to_1l_fasta.sh goodProteins.fasta
 +
  emacs grep_separate.sh;
 +
 
 +
Inside emacs copy and paste the following snippet of code:
 +
 
 +
  grep -A1 ">$1|" goodProteins.1l.fasta > $1.fas
 +
 
 +
Exit emacs. Now run it, giving it the list of 3 or 4 characters that were used to tag the species. In our case, from this example the "list_of_shortspecies" file would look like this:
  
```
+
  atig
emacs mkdiamondPBS.sh;
+
  cyno
#inside emacs copy and paste the following:
+
  hyno
#To use this script only type mkdiamondPBS.sh Speciesname. For example: mkdiamondPBS.sh Scinax
+
  newt
subheader.sh $1 60G 48:00:00 > diamond_${1}.sub;
+
  anol
echo "module load diamond" >> diamond_${1}.sub;
 
echo "diamond blastp --db goodprotdb.dmnd -q "${1}".fas -a "${1}"_all_all.daa -t /ibers/ernie/scratch/kas63/tmp" --threads 10 --max-target-seqs 1000000 >> diamond_
 
${1}.sub;
 
  
#exit emacs.
+
  cat list_of_shortspecies | xargs -I % sh grep_separate.sh %;
 +
 +
Then I created submission scripts for diamond for all and each of my files with a mkdiamond.sh command. The resulting submission scripts looked like this:
  
cat list_of_shortspecies | xargs -I % sh mkdiamondPBS.sh %;
+
  #$ -e /my_path_to_my_files/my_orthomcl_dir/atig.error
```
+
  #$ -o /my_path_to_my_files/my_orthomcl_dir/atig.output
 +
  #$ -N atig
 +
  #$ -cwd
 +
  #$ -j y
 +
  #$ -S /bin/bash
 +
  #$ -m beas
 +
  #$ -M user@email.com
 +
  #$ -V
 +
  #$ -l h_vmem=20G
 +
  #$ -l h_rt=24:00:00
 +
  cd /my_path_to_my_files/my_orthomcl_dir
 +
  module load diamond
 +
  diamond blastp --db goodprotdb.dmnd -q atig.fas -a atig_all_all.daa -t /ibers/ernie/scratch/my_user/tmp --threads 10 --max-target-seqs 1000000
  
 
Finally I submitted all 35 jobs and all run across multiple nodes. Job finished in 1 night.
 
Finally I submitted all 35 jobs and all run across multiple nodes. Job finished in 1 night.
Line 111: Line 132:
 
Then I transformed my *.daa files into *.m8 files tabular format:
 
Then I transformed my *.daa files into *.m8 files tabular format:
  
```
+
  module load diamond;
module load diamond;
+
  for i in *.daa; do diamond view -a $i -o ${i}.m8; done;
for i in *.daa; do diamond view -a $i -o ${i}.m8; done;
+
 
```
+
Then I concatenate them all together with
 +
  cat *.m8 > all-all.blastp
 +
 
 +
 
 +
----
 +
 
  
Then I concatenate them all together with `cat *.m8 > all-all.blastp` .
 
  
 
Now run orthomclBlastParser into the resulting huge all-all.blastp file:
 
Now run orthomclBlastParser into the resulting huge all-all.blastp file:
 
Since in my case it would take several hours, I ran it in a submission script (also making sure I got a log file or .output file to make sure there were no errors).
 
Since in my case it would take several hours, I ran it in a submission script (also making sure I got a log file or .output file to make sure there were no errors).
  
```
+
  module load orthomcl;
module load orthomcl;
+
  orthomclBlastParser all-all.blastp compliantFasta > similarSequences.txt
orthomclBlastParser all-all.blastp compliantFasta > similarSequences.txt
+
 
```
 
 
Now you have your similarSequences.txt file. You have to go to the VM... This resulting file was huge (44G). I had to make sure my VM would cope with it. So I had to request that the VM could have access to x5 my file size (so at least 220G of hard drive).
 
Now you have your similarSequences.txt file. You have to go to the VM... This resulting file was huge (44G). I had to make sure my VM would cope with it. So I had to request that the VM could have access to x5 my file size (so at least 220G of hard drive).
  
Line 136: Line 160:
 
This has cleaned from memory from whatever Martin had done before and it frees memory as well.
 
This has cleaned from memory from whatever Martin had done before and it frees memory as well.
  
```
+
  mysql -u (the-orthomcl-database-user) -p
mysql -u orthomcl_user -p
+
  (insert password here)
8pg4tdDy
+
  DROP DATABASE orthomcldb;
DROP DATABASE orthomcldb;
+
  CREATE DATABASE orthomcldb;
CREATE DATABASE orthomcldb;
+
  quit;
quit;
 
```
 
  
 
Then create the tables (schema) inside the database:
 
Then create the tables (schema) inside the database:
  
```
+
  ~/bin/orthomclSoftware-v2.0.9/bin/orthomclInstallSchema ~/my_orthomcl_dir/orthomcl.config ~/my_orthomcl_dir/install_schema.log
~/bin/orthomclSoftware-v2.0.9/bin/orthomclInstallSchema ~/my_orthomcl_dir/orthomcl.config ~/my_orthomcl_dir/install_schema.log
+
 
```
 
 
When it finishes, check that the log file has no errors.
 
When it finishes, check that the log file has no errors.
 
Ok now I can proceed to loading the Blast into the database. Note this step takes a few hours and there is no way to know if it crashed, so you have to check it every so often. I suggest running in "screen".
 
Ok now I can proceed to loading the Blast into the database. Note this step takes a few hours and there is no way to know if it crashed, so you have to check it every so often. I suggest running in "screen".
  
 
+
  screen -S load;
```
+
  ~/bin/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast ~/my_orthomcl_dir/orthomcl.config similarSequences.txt
screen -S load;
 
~/bin/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast ~/my_orthomcl_dir/orthomcl.config similarSequences.txt
 
```
 
  
 
exit screen (ctrl+ A then D).  
 
exit screen (ctrl+ A then D).  
 
Once it finishes (it will give you a prompt for next command); then you can do next step also in a different screen:
 
Once it finishes (it will give you a prompt for next command); then you can do next step also in a different screen:
  
```
+
  screen -S Pairs;
screen -S Pairs;
+
  ~/bin/orthomclSoftware-v2.0.9/bin/orthomclPairs ~/my_orthomcl_dir/orthomcl.config orthomcl_pairs.log cleanup=no
~/bin/orthomclSoftware-v2.0.9/bin/orthomclPairs ~/my_orthomcl_dir/orthomcl.config orthomcl_pairs.log cleanup=no
 
```
 
  
 
exit screen.  
 
exit screen.  
Line 170: Line 186:
 
Create output tables: (I ran it in nohup to keep a log file in case of errors:
 
Create output tables: (I ran it in nohup to keep a log file in case of errors:
  
```
+
  nohup ~/bin/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles ~/my_orthomcl_dir/orthomcl.config 2>dump.err &
nohup ~/bin/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles ~/my_orthomcl_dir/orthomcl.config 2>dump.err &
 
```
 
  
The database step is to match most similar things in order to find the more finer scale orthology / paralogies. Hence we end up with files called "orthologs.txt", "coorthologs.txt" and "inparalogs.txt", which are just files with pairs of proteins with a score for each pair in a third column. The program creates a summary of all these ("mclInput") which can be used in a subsequent step of clustering with an mcl algorithm. The MCL  step I ran in the server, so I `scp` the resulting files back to my account in the server.  
+
The database step is to match most similar things in order to find the more finer scale orthology / paralogies. Hence we end up with files called "orthologs.txt", "coorthologs.txt" and "inparalogs.txt", which are just files with pairs of proteins with a score for each pair in a third column. The program creates a summary of all these ("mclInput") which can be used in a subsequent step of clustering with an mcl algorithm. The MCL  step I ran in the server, so I "scp" the resulting files back to my account in the server.  
  
 
Now I can quit the database VM. Also, you can let the administrator know in case they need to free up space, etc. Since this file took a lot of space from the disk, someone else who wants to run something there might have to clean the database.
 
Now I can quit the database VM. Also, you can let the administrator know in case they need to free up space, etc. Since this file took a lot of space from the disk, someone else who wants to run something there might have to clean the database.
Line 182: Line 196:
 
===MCL clustering step: ===
 
===MCL clustering step: ===
  
Make sure you have the MCL software installed or available in your directory. In my case, I installed mine in my `~/bin` .
+
Make sure you have the MCL software installed or available in your directory. In my case, I installed mine in my "~/bin" .
 
I ran this step in a submission script. It took about 30 mins and for my size of file it used 2G of memory.
 
I ran this step in a submission script. It took about 30 mins and for my size of file it used 2G of memory.
  
```
+
  ~/bin/mcl mclInput --abc -I 1.5 -o mclOutput
~/bin/mcl mclInput --abc -I 1.5 -o mclOutput
 
```
 
  
 
Then just parse the output into the groups.txt: (takes 2 mins)
 
Then just parse the output into the groups.txt: (takes 2 mins)
  
```
+
  ~/bin/orthomclSoftware-v2.0.9/bin/orthomclMclToGroups cluster- 1000 < mclOutput > groups.txt
~/bin/orthomclSoftware-v2.0.9/bin/orthomclMclToGroups 03Oct16-35sp- 1000 < mclOutput > groups.txt
+
 
```
+
And finally obtain the singleton files (so everything in your goodProteins.fasta file that was not included in the clusters in "groups.txt"):
 +
 
 +
  ~/bin/orthomclSoftware-v2.0.9/bin/orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt
 +
 
 +
And you are done! do a final count of the final clusters obtained yielded for your dataset:
  
And finally obtain the singleton files (so everything in your goodProteins.fasta file that was not included in the clusters in "groups.txt":
+
  cut -d":" -f1 groups.txt | wc -l;
  
```
+
The number you obtained should match what was written in the log file (in my case called "mcl_ortho.output") for the "mcl" command you ran hopefully in the cluster...
~/bin/orthomclSoftware-v2.0.9/bin/orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt
 
```
 
  
And you are done! a final count of my final clusters obtained yielded for this dataset a total of 117,282 clusters of 2+ species or sequences.
+
  grep "clusters found" mcl_ortho.output

Latest revision as of 17:30, 3 October 2016

Using orthoMCL

An important consideration before you start: Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with a password so that I could access mysql there. Contact the server administrator for help to set this up.

This VM is almost bare of software, but it has mysql. It doesn't even have emacs. So if you need to modify scripts there, you can either use vi (if you are familiar with it) or change them somewhere else and then scp or upload these files into your account in the VM.

Also, create a `~/bin/` directory in your home directory in the VM:

 mkdir ~/bin/

Go into the new bin directory and download the orthomcl software (comes in tar.gz), un zip and untar it. You will need the commands in the directory : "~/bin/orthomclSoftware-v2.0.9/bin/" in my case.

You can actually run the first part of the analysis in the server (using the "module load orthomcl"). But once you get to the to stage where you need to upload into a database, you'll have to run those commands calling your ~/bin/orthomcl/bin/" path.

orthomcl Tutorial:

In the server...

First you need to sort the files in the fasta format so that they will be correctly input by the program (orthomclAdjustfasta). Basically just make sure the names will be unique after choosing the unique identifier from the sequencer name.

Follow the tutorial for orthomcl (which explains how it parses or separates the fields for the sequence names.

Then to run the orthomcl scripts I followed the tutorial here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196566/

and this other user-friendly (copy/paste commands) tutorial here: https://www.biostars.org/p/199083/

Once input sequences for the analysis are in the right format required (1 fasta file per organism or taxa), create 2 directories:

 mkdir my_orthomcl_dir;
 cd my_orthomcl_dir;
 mkdir compliantFasta;

"cp" your original fasta files in the compliantFasta directory. Note: file names don't need to end up in .fasta (but they must be in fasta format).

cd compliantFasta/

Run the "orthomclAdjustFasta". Special note: XXXX is the unique species tag, whatever you want to name that species. Then you give it the compliant fasta file, and finally you specify the number of field that you want to use from the sequence name that will become your unique sequence tag (it has to be unique, so make sure there are no duplicates!!). otherwise you can play around changing the names of the sequences to make it unique. The orthomclAdjustfasta script will accept as field separators either "|" or " " (space). In the example below, it will take the first field before either "|" or " ".

 module load orthomcl;
 orthomclAdjustFasta XXXX my_protein_seqs.fasta 1

Actually, since I had a lot of files, I used a script file, where I listed all the commands per file. Just execute it when it's ready:

 module load orthomcl;
 orthomclAdjustFasta atig my_protein_seqs_sp1.fasta 1
 orthomclAdjustFasta cyno my_protein_seqs_sp2.fasta 1
 orthomclAdjustFasta hyno my_protein_seqs_sp3.fasta 4
 orthomclAdjustFasta newt my_protein_seqs_sp4.fasta 1
 orthomclAdjustFasta anol my_protein_seqs_sp5.fasta 1

Once you obtain the output file (your fasta files), you'll have to move the original ones and the links to them (if you have any) to a different directory. Just keep in the compliantFasta directory

Then do the filtering step:

 orthomclFilterFasta compliantFasta 10 20

all-vs-all BLAST

The previous step produces the goodProteins.fasta file. Use this file to create your BLAST database (in my case, I used Diamond instead of BLAST). Although orthomcl calls the normal Blast program. In such case, just use the command:

 module load blast;
 makeblastdb -in goodProteins.fasta -dbtype prot -out goodprotdb

Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. The command in the submission script should be:

 blastall -p blastp -F 'm S' -v 100000 -b 100000 -z 55 -e 1e-5 -d goodprotdb -i goodProteins.fasta -o all-all.blastp -m 8


Alternatively, if you want to carry out the Blast with Diamond:

 module load diamond;
 diamond makedb --in goodProteins.fasta -p 10 -d goodprotdb;

Then carry out the all vs all blast. Please note this step can take days depending on the number of sequences to Blast. This should be submitted through a job in the queue. A suggested command in the submission script would be:

 diamond blastp -q goodProteins.fasta --min-score 60 --db goodprotdb.dmnd -a all-all.blastp.daa -t /ibers/ernie/scratch/my_user/tmp/ --threads 6 --max-target-seqs 1000000;

The "min-score", "threads" and "max-target-seqs" values can be modified to meet the users' preferences.

Once you have your .daa file, you can get your tabular format file with the following command:

 diamond view -a all-all.blastp.daa -o all-all.blastp;




BLAST shortcut

In my case, to save time as I was in a hurry, I partitioned the query sequences into species (so 35 fasta files in total) so that it would run faster. I separated them from the "goodProteins.fasta" file. To do so I had a script convert_fasta_to_1l_fasta.sh that changes sequences files that are in multiple fasta lines (the way they download from ncbi) into a 1 long line fasta format.

 sh convert_fasta_to_1l_fasta.sh goodProteins.fasta
 emacs grep_separate.sh;

Inside emacs copy and paste the following snippet of code:

 grep -A1 ">$1|" goodProteins.1l.fasta > $1.fas

Exit emacs. Now run it, giving it the list of 3 or 4 characters that were used to tag the species. In our case, from this example the "list_of_shortspecies" file would look like this:

 atig
 cyno
 hyno
 newt
 anol
 cat list_of_shortspecies | xargs -I % sh grep_separate.sh %; 

Then I created submission scripts for diamond for all and each of my files with a mkdiamond.sh command. The resulting submission scripts looked like this:

 #$ -e /my_path_to_my_files/my_orthomcl_dir/atig.error
 #$ -o /my_path_to_my_files/my_orthomcl_dir/atig.output
 #$ -N atig
 #$ -cwd
 #$ -j y
 #$ -S /bin/bash
 #$ -m beas
 #$ -M user@email.com
 #$ -V
 #$ -l h_vmem=20G
 #$ -l h_rt=24:00:00
 cd /my_path_to_my_files/my_orthomcl_dir
 module load diamond
 diamond blastp --db goodprotdb.dmnd -q atig.fas -a atig_all_all.daa -t /ibers/ernie/scratch/my_user/tmp --threads 10 --max-target-seqs 1000000

Finally I submitted all 35 jobs and all run across multiple nodes. Job finished in 1 night.

Then I transformed my *.daa files into *.m8 files tabular format:

 module load diamond;
 for i in *.daa; do diamond view -a $i -o ${i}.m8; done;

Then I concatenate them all together with

 cat *.m8 > all-all.blastp 




Now run orthomclBlastParser into the resulting huge all-all.blastp file: Since in my case it would take several hours, I ran it in a submission script (also making sure I got a log file or .output file to make sure there were no errors).

 module load orthomcl;
 orthomclBlastParser all-all.blastp compliantFasta > similarSequences.txt

Now you have your similarSequences.txt file. You have to go to the VM... This resulting file was huge (44G). I had to make sure my VM would cope with it. So I had to request that the VM could have access to x5 my file size (so at least 220G of hard drive).

In the Virtual Machine...

ssh into your account in the VM. Upload / scp your similarSequences.txt file into either the my_orthomcl_dir or whichever hard disk you have the capacity to run it. In my case I had to cp it into /mnt/storage/orthomcl_kas/, which was a temporary HD installed for me to run this.

cp the orthomcl.config file from the template available in the `~/bin/orthomclSoftware-v2.0.9/doc/OrthoMCLEngine/Main/` directory. Sometimes it could have problems. I had errors running for the loadBlast command, so I had to change it, you can see my orthomcl.config file that finally ran with no errors.

Next, clean and create your orthomcl database using the following commands: This has cleaned from memory from whatever Martin had done before and it frees memory as well.

 mysql -u (the-orthomcl-database-user) -p
 (insert password here)
 DROP DATABASE orthomcldb;
 CREATE DATABASE orthomcldb;
 quit;

Then create the tables (schema) inside the database:

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclInstallSchema ~/my_orthomcl_dir/orthomcl.config ~/my_orthomcl_dir/install_schema.log

When it finishes, check that the log file has no errors. Ok now I can proceed to loading the Blast into the database. Note this step takes a few hours and there is no way to know if it crashed, so you have to check it every so often. I suggest running in "screen".

 screen -S load;
 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast ~/my_orthomcl_dir/orthomcl.config similarSequences.txt

exit screen (ctrl+ A then D). Once it finishes (it will give you a prompt for next command); then you can do next step also in a different screen:

 screen -S Pairs;
 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclPairs ~/my_orthomcl_dir/orthomcl.config orthomcl_pairs.log cleanup=no

exit screen.

Create output tables: (I ran it in nohup to keep a log file in case of errors:

 nohup ~/bin/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles ~/my_orthomcl_dir/orthomcl.config 2>dump.err &

The database step is to match most similar things in order to find the more finer scale orthology / paralogies. Hence we end up with files called "orthologs.txt", "coorthologs.txt" and "inparalogs.txt", which are just files with pairs of proteins with a score for each pair in a third column. The program creates a summary of all these ("mclInput") which can be used in a subsequent step of clustering with an mcl algorithm. The MCL step I ran in the server, so I "scp" the resulting files back to my account in the server.

Now I can quit the database VM. Also, you can let the administrator know in case they need to free up space, etc. Since this file took a lot of space from the disk, someone else who wants to run something there might have to clean the database.

Back to the server...

MCL clustering step:

Make sure you have the MCL software installed or available in your directory. In my case, I installed mine in my "~/bin" . I ran this step in a submission script. It took about 30 mins and for my size of file it used 2G of memory.

 ~/bin/mcl mclInput --abc -I 1.5 -o mclOutput

Then just parse the output into the groups.txt: (takes 2 mins)

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclMclToGroups cluster- 1000 < mclOutput > groups.txt

And finally obtain the singleton files (so everything in your goodProteins.fasta file that was not included in the clusters in "groups.txt"):

 ~/bin/orthomclSoftware-v2.0.9/bin/orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt

And you are done! do a final count of the final clusters obtained yielded for your dataset:

 cut -d":" -f1 groups.txt | wc -l;

The number you obtained should match what was written in the log file (in my case called "mcl_ortho.output") for the "mcl" command you ran hopefully in the cluster...

 grep "clusters found" mcl_ortho.output