OrthoMCL

From IBERS Bioinformatics and HPC Wiki
Revision as of 14:39, 3 October 2016 by Angry piranha (talk | contribs) (Created page with "! Using orthomcl An important considerations before you start: Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with ...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

! Using orthomcl

An important considerations before you start: Orthomcl doesn't run in the server, but in a Virtual machine (VM). I had to request an account in the VM with a password so that I could access mysql there. Contact the server administrator for help to set this up.

This VM is almost bare of software, but it has mysql. It doesn't even have emacs. So if you need to modify scripts there, you can either use vi (if you are familiar with it) or change them somewhere else and then `scp` or upload these files into your account in the VM.

Also, create a `~/bin/` directory in your home directory in the VM:

``` mkdir ~/bin/ ```

Go into the new bin directory and download the orthomcl software (comes in tar.gz), un zip and untar it. You will need the commands in the directory : `~/bin/orthomclSoftware-v2.0.9/bin/` in my case.

You can actually run the first part of the analysis in the server (using the module load orthomcl). But once you get to the to stage where you need to upload into a database, you'll have to run those commands calling your ~/bin/orthomcl/bin/ path.

! orthomcl Tutorial:

!!!In the server...

First you need to sort the files in the fasta format so that they will be correctly input by the program (orthomclAdjustfasta). Basically just make sure the names will be unique after choosing the unique identifier from the sequencer name.

``` cat duplicates_ESTs.phy | sed 's/|/ /1' | sed 's/|/ /1' | tr ' ' '\t' | sed 's/*//g' | sort -k2,2 -k9,9r | awk 'BEGIN{array[this]=that};{ if ($2 in array){}else{ array[$2]=$0; print $0}}' > EST_Atigrinum_ncbi_renamed_transdecoder.phy; ```

Follow the tutorial for orthomcl (which explains how it parses or separates the fields for the sequence names.

Then to run the orthomcl scripts I followed the tutorial here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196566/

and this other user-friendly (copy/paste commands) tutorial here: https://www.biostars.org/p/199083/

Once input sequences for the analysis are in the right format required (1 fasta file per organism or taxa), create 2 directories:

``` mkdir my_orthomcl_dir; cd my_orthomcl_dir; mkdir compliantFasta; ``` cp your original fasta files in the compliantFasta directory. Note: file names don't need to end up in .fasta (but they must be in fasta format).

run the "orthomclAdjustFasta":

``` module load orthomcl; orthomclAdjustFasta atig All_Atigrinum_EST_contigs.fasta.transdecoder.pep.filtered.fasta 1 ``` Actually, since I had a lot of files, I used a script file (following Martin's example), where I listed all the commands per file (it was called adjust_new.sh)

``` orthomclAdjustFasta atig All_Atigrinum_EST_contigs.fasta.transdecoder.pep.filtered.fasta 1 orthomclAdjustFasta cyno Cynops_all_filtered_trinity_renamed.fasta.transdecoder.pep.compliant 1 orthomclAdjustFasta hyno Hynobius_assembly_ncbi.fasta.transdecoder.pep.compliant 4 orthomclAdjustFasta newt Newt_transcriptome_renamed.fasta.transdecoder.pep.compliant 1 orthomclAdjustFasta anol download_ensembl_Anolis.fasta.longest.aa.filt.fasta.compliant 1 orthomclAdjustFasta chic download_ensembl_Chicken.fasta.longest.aa.filt.fasta.compliant 1 orthomclAdjustFasta cows download_ensembl_Cow.fasta.longest.aa.filt.fasta.compliant 1 orthomclAdjustFasta dani download_ensembl_Danio.fasta.longest.aa.filt.fasta.compliant 1 ```

Once you obtain the output file (your fasta files), you'll have to move the original ones and the links to them (if you have any) to a different directory. Just keep in the compliantFasta directory

Then do the filtering step:

``` orthomclFilterFasta compliantFasta 10 20 ```

This step produces the goodProteins.fasta file. Use this file to create your BLAST database (in my case, I used Diamond):

``` module load diamond; diamond makedb ... ```

Then carry out the all vs all blast. In my case, I partitioned the query sequences into species (so 35 fasta files in total) so that it would run faster. I separated them from the "goodProteins.fasta" file:

``` convert_fasta_to_1l_fasta.sh goodProteins.fasta emacs grep_separate.sh;

  1. Inside emacs copy and paste the following snippet:

grep -A1 ">$1|" goodProteins.1l.fasta > $1.fas ```

now run it:

``` cat list_of_shortspecies | xargs -I % sh grep_separate.sh %; ```

Then I created submission scripts for diamond for all 35 files with a mkdiamond.sh command. It looked like this:

``` emacs mkdiamondPBS.sh;

  1. inside emacs copy and paste the following:
  2. To use this script only type mkdiamondPBS.sh Speciesname. For example: mkdiamondPBS.sh Scinax

subheader.sh $1 60G 48:00:00 > diamond_${1}.sub; echo "module load diamond" >> diamond_${1}.sub; echo "diamond blastp --db goodprotdb.dmnd -q "${1}".fas -a "${1}"_all_all.daa -t /ibers/ernie/scratch/kas63/tmp" --threads 10 --max-target-seqs 1000000 >> diamond_ ${1}.sub;

  1. exit emacs.

cat list_of_shortspecies | xargs -I % sh mkdiamondPBS.sh %; ```

Finally I submitted all 35 jobs and all run across multiple nodes. Job finished in 1 night.

Then I transformed my *.daa files into *.m8 files tabular format:

``` module load diamond; for i in *.daa; do diamond view -a $i -o ${i}.m8; done; ```

Then I concatenate them all together with `cat *.m8 > all-all.blastp` .

Now run orthomclBlastParser into the resulting huge all-all.blastp file: Since in my case it would take several hours, I ran it in a submission script (also making sure I got a log file or .output file to make sure there were no errors).

``` module load orthomcl; orthomclBlastParser all-all.blastp compliantFasta > similarSequences.txt ``` Now you have your similarSequences.txt file. You have to go to the VM... This resulting file was huge (44G). I had to make sure my VM would cope with it. So I had to request that the VM could have access to x5 my file size (so at least 220G of hard drive).

!!!In the Virtual Machine...

ssh into your account in the VM. Upload / scp your similarSequences.txt file into either the my_orthomcl_dir or whichever hard disk you have the capacity to run it. In my case I had to cp it into /mnt/storage/orthomcl_kas/, which was a temporary HD installed for me to run this.

Now I would suggest cp the orthomcl.config file from either Martin's Glossina or you can use the template available in the `~/bin/orthomclSoftware-v2.0.9/doc/OrthoMCLEngine/Main/` directory. Sometimes it could have problems. I had errors running for the loadBlast command, so I had to change the it, you can see my orthomcl.config file that finally ran with no errors.

Next, clean and create your orthomcl database using the following commands: This has cleaned from memory from whatever Martin had done before and it frees memory as well.

``` mysql -u orthomcl_user -p 8pg4tdDy DROP DATABASE orthomcldb; CREATE DATABASE orthomcldb; quit; ```

Then create the tables (schema) inside the database:

``` ~/bin/orthomclSoftware-v2.0.9/bin/orthomclInstallSchema ~/my_orthomcl_dir/orthomcl.config ~/my_orthomcl_dir/install_schema.log ``` When it finishes, check that the log file has no errors. Ok now I can proceed to loading the Blast into the database. Note this step takes a few hours and there is no way to know if it crashed, so you have to check it every so often. I suggest running in "screen".


``` screen -S load; ~/bin/orthomclSoftware-v2.0.9/bin/orthomclLoadBlast ~/my_orthomcl_dir/orthomcl.config similarSequences.txt ```

exit screen (ctrl+ A then D). Once it finishes (it will give you a prompt for next command); then you can do next step also in a different screen:

``` screen -S Pairs; ~/bin/orthomclSoftware-v2.0.9/bin/orthomclPairs ~/my_orthomcl_dir/orthomcl.config orthomcl_pairs.log cleanup=no ```

exit screen.

Create output tables: (I ran it in nohup to keep a log file in case of errors:

``` nohup ~/bin/orthomclSoftware-v2.0.9/bin/orthomclDumpPairsFiles ~/my_orthomcl_dir/orthomcl.config 2>dump.err & ```

The database step is to match most similar things in order to find the more finer scale orthology / paralogies. Hence we end up with files called "orthologs.txt", "coorthologs.txt" and "inparalogs.txt", which are just files with pairs of proteins with a score for each pair in a third column. The program creates a summary of all these ("mclInput") which can be used in a subsequent step of clustering with an mcl algorithm. The MCL step I ran in the server, so I `scp` the resulting files back to my account in the server.

Now I can quit the database VM. Also, you can let the administrator know in case they need to free up space, etc. Since this file took a lot of space from the disk, someone else who wants to run something there might have to clean the database.

Back to the server...

!!!MCL clustering step: Make sure you have the MCL software installed or available in your directory. In my case, I installed mine in my `~/bin` . I ran this step in a submission script. It took about 30 mins and for my size of file it used 2G of memory.

``` ~/bin/mcl mclInput --abc -I 1.5 -o mclOutput ```

Then just parse the output into the groups.txt: (takes 2 mins)

``` ~/bin/orthomclSoftware-v2.0.9/bin/orthomclMclToGroups 03Oct16-35sp- 1000 < mclOutput > groups.txt ```

And finally obtain the singleton files (so everything in your goodProteins.fasta file that was not included in the clusters in "groups.txt":

``` ~/bin/orthomclSoftware-v2.0.9/bin/orthomclSingletons goodProteins.fasta groups.txt >> singletons.txt ```

And you are done! a final count of my final clusters obtained yielded for this dataset a total of 117,282 clusters of 2+ species or sequences.