Readme

This pipeline FRIGG (Fungal ResIstance Gene-directed Genome mining) has been designed to identify putative resistance genes in secondary metabolite gene clusters

Copyright (C) 2019  Inge Kjærbølling

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.


Workflow
1. Run SMG_prox_fam_ver3_Pip.py
	To run: python SMG_prox_fam_ver3_Pip.py -c "TRUE", -rr "TRUE"
	
	
	User input:
	-clear / -c, default='FALSE', Clear and replace data if org_id is already found in output table.
	-plimit / -pl, default=0, Set the number of genes to include before and after the cluster.
	-input / -i, default='smurf', Set the Table to use as input (SMG_proximity) with SMURF data (predicted secondary metabolite gene clusters). -> Therby also which org to run the analysis on (Input_data_pipeline_smurf)
	-output / -o, default='clust_prox', Set the Table to use as output from SMG_proximity function, giving the sec met genes and the location.
	-input_fam / -if, default='clust_prox', Set the Table to use as input for the SMG_family function.
	-output_fam / -of, default='prox_fam_count', Set the Table to use as output for SMG_family -> final output of SMG_prox_fam_ver3_Pip.py. Giving the sec met genes, the protein family and the counts of homologs in total and per organism in and outside clusters.
	-delete / -de, default='TRUE', Delete the  (intermediate) table if it already exists.
	-rerun / -rr, default='FALSE', Rerun SMG_proximity to create new ik_clust_prox table or use already existing.

	To change organisms in the dataset make smurf table with desired organisms as input and the hfam table

	Output: MySQL table: ik_prox_fam_count

	prerequistite: 
		- countfams table with the hfam and count pr organism
	
	running SMG_proximity from SMG_prox_ver3_Pip.py and SMG_family from SMG_fam_ver3_Pip.py

	SMG_proximity - requires the presence of a table:
		- gff_table - named gff_ultimo is needed (Input_data_pipeline_gff)
	SMG_family - requires the setting these tables:
		- intermediate_table = 'ik_prox_fam' # intermediate table with clust+prox genes coupled to hfam
		- hfam_table = 'hfam' # table containing the proteins and the protein family they belong to
		(Input_data_pipeline_resistancepipeline_biblast_hfam_july2018)
		- count_fam = 'countfams' # table containing the protein families, organisms and the number of homologs in each organism. 
		(has to be generated from the hfam_table - guide of how to make it is found in the script)

	# This part is the slowest of the pipeline and takes 4-8 hours, depending on the number of organisms included and on the computer used. 


2. Run SMG_resistance_ver3_Pip.py
	To run:
	
	-input/-i, default='prox_fam_count', Set the Table to use as input - generated by the SMG_prox_fam.py.
	-output/-o, default='StrictCase_WholeHFAM', Set the Table to use as output, a table with all proteins from each hfam identified with putative resistance genes.	
	-smurf_table/-s, default='smurf', Set which secondary metabolite prediction table to use (SMURF).
	-hfamTable/-ht, default='hfam', Set the hfam table - table containing the protein families (column: hfam, org_id, org_name, protein_id
	-Overview_file/-Of, default='*Overview_name*', The name of the overview output file with key numbers from the filtering process
	# user input for selection patterns and filtering criteria
	-Hfam_cutoff/-hc, default=10000, The cut off of the size of hfams that are disregarded concerning the count and pattern selection, set to 10000 to avoid this setting, step 2.
	-Flag_moreClust/-fmc, , default=0, A flag whether filtering based on potential resistance gene should be found in more than 1 cluster, step 3 (if =1 filtering is done if 0 filtering is not done).
	-Nhfam_org/-Nho, default='98perc', Set the number of organism the protein family has to be found in (all/98perc/95perc/90perc), step 4.
	-Flag_singleOrg/-fso, default=1, A flag whether filtering based on the number of organisms having a single copy in the genome is implemented or not, step 5 (if =1 filtering is done if 0 filtering is not done).

	Output: 
	- MySQL table: -output with proteins from all the protein families that have been selected based on the selection and filtering criteria (columns: hfam, org_id, org_name, real_name, section, protein_id, prot_seq, flag,
	  count (the number of copies in the organism)
	- FASTA files for each selected protein family containing all the proteins in that family.

	what the program is doing:
	# Selecting the potential resistance clusters - strict case scenario: genecopy nr 1 for all in the cluster and one gene also has a copy outside or if Hfam_cutoff has been selected then homologs are allowed if the size of the protein family is bigger than the cut-off
	# Making intermediate table ‘StrictCase_genes_Pipeline_’ - having all the strict case cluster genes - so a table with the potential resistance genes found in clusters
	# Filtering and selecting the strict case genes where the hfam is also found in another strict case gene -> so getting the hfam protein families with at least two strict cluster cases, if Flag_moreClust is set to 1
	# Filtering the number protein families based on the percentage of organisms the family has a member in if Nhfam_org  is set to 1
	# Filtering the the protein family based on 50% of the organisms can only have one copy is Flag_singleOrg is set to 1
	# Create table with the selected protein families and a flag telling if the specific protein is found in cluster, if it is the ‘selected’ cluster (strict_clust) or if it is the corresponding outside cluster -> output table: StrictCase_WholeHFAM
	# creating fasta files for all the selected protein families

	# This takes between 8-40 minutes depending on the selection and filtering criteria used.

3. Python script SMG_Align_Trim.py
	input is the fasta files with all the sequences from one hfam - taken from one directory
	Align
	- Run clustalo
	Trim
	- Run Gblocks	
	Realign
	- Run clustalo
	# output aligned sequences in fasta format

4. SMG_PCA_Phylo.R - make grafical figures for further analysis
	# The input is the aligned sequences in fasta format
	
	Functions:
	PCA_analysis - running principal component analysis on an aligned protein family - output pdf format
	ML_phylo - creating phylogenetic trees with 500 bootstraps - outputting in newick and pdf format 
	
	Notes: the PCA is fast while the ML_phylo is slow