Readme This pipeline FRIGG (Fungal ResIstance Gene-directed Genome mining) has been designed to identify putative resistance genes in secondary metabolite gene clusters Copyright (C) 2019 Inge Kjærbølling This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Workflow 1. Run SMG_prox_fam_ver3_Pip.py To run: python SMG_prox_fam_ver3_Pip.py -c "TRUE", -rr "TRUE" User input: -clear / -c, default='FALSE', Clear and replace data if org_id is already found in output table. -plimit / -pl, default=0, Set the number of genes to include before and after the cluster. -input / -i, default='smurf', Set the Table to use as input (SMG_proximity) with SMURF data (predicted secondary metabolite gene clusters). -> Therby also which org to run the analysis on (Input_data_pipeline_smurf) -output / -o, default='clust_prox', Set the Table to use as output from SMG_proximity function, giving the sec met genes and the location. -input_fam / -if, default='clust_prox', Set the Table to use as input for the SMG_family function. -output_fam / -of, default='prox_fam_count', Set the Table to use as output for SMG_family -> final output of SMG_prox_fam_ver3_Pip.py. Giving the sec met genes, the protein family and the counts of homologs in total and per organism in and outside clusters. -delete / -de, default='TRUE', Delete the (intermediate) table if it already exists. -rerun / -rr, default='FALSE', Rerun SMG_proximity to create new ik_clust_prox table or use already existing. To change organisms in the dataset make smurf table with desired organisms as input and the hfam table Output: MySQL table: ik_prox_fam_count prerequistite: - countfams table with the hfam and count pr organism running SMG_proximity from SMG_prox_ver3_Pip.py and SMG_family from SMG_fam_ver3_Pip.py SMG_proximity - requires the presence of a table: - gff_table - named gff_ultimo is needed (Input_data_pipeline_gff) SMG_family - requires the setting these tables: - intermediate_table = 'ik_prox_fam' # intermediate table with clust+prox genes coupled to hfam - hfam_table = 'hfam' # table containing the proteins and the protein family they belong to (Input_data_pipeline_resistancepipeline_biblast_hfam_july2018) - count_fam = 'countfams' # table containing the protein families, organisms and the number of homologs in each organism. (has to be generated from the hfam_table - guide of how to make it is found in the script) # This part is the slowest of the pipeline and takes 4-8 hours, depending on the number of organisms included and on the computer used. 2. Run SMG_resistance_ver3_Pip.py To run: -input/-i, default='prox_fam_count', Set the Table to use as input - generated by the SMG_prox_fam.py. -output/-o, default='StrictCase_WholeHFAM', Set the Table to use as output, a table with all proteins from each hfam identified with putative resistance genes. -smurf_table/-s, default='smurf', Set which secondary metabolite prediction table to use (SMURF). -hfamTable/-ht, default='hfam', Set the hfam table - table containing the protein families (column: hfam, org_id, org_name, protein_id -Overview_file/-Of, default='*Overview_name*', The name of the overview output file with key numbers from the filtering process # user input for selection patterns and filtering criteria -Hfam_cutoff/-hc, default=10000, The cut off of the size of hfams that are disregarded concerning the count and pattern selection, set to 10000 to avoid this setting, step 2. -Flag_moreClust/-fmc, , default=0, A flag whether filtering based on potential resistance gene should be found in more than 1 cluster, step 3 (if =1 filtering is done if 0 filtering is not done). -Nhfam_org/-Nho, default='98perc', Set the number of organism the protein family has to be found in (all/98perc/95perc/90perc), step 4. -Flag_singleOrg/-fso, default=1, A flag whether filtering based on the number of organisms having a single copy in the genome is implemented or not, step 5 (if =1 filtering is done if 0 filtering is not done). Output: - MySQL table: -output with proteins from all the protein families that have been selected based on the selection and filtering criteria (columns: hfam, org_id, org_name, real_name, section, protein_id, prot_seq, flag, count (the number of copies in the organism) - FASTA files for each selected protein family containing all the proteins in that family. what the program is doing: # Selecting the potential resistance clusters - strict case scenario: genecopy nr 1 for all in the cluster and one gene also has a copy outside or if Hfam_cutoff has been selected then homologs are allowed if the size of the protein family is bigger than the cut-off # Making intermediate table ‘StrictCase_genes_Pipeline_’ - having all the strict case cluster genes - so a table with the potential resistance genes found in clusters # Filtering and selecting the strict case genes where the hfam is also found in another strict case gene -> so getting the hfam protein families with at least two strict cluster cases, if Flag_moreClust is set to 1 # Filtering the number protein families based on the percentage of organisms the family has a member in if Nhfam_org is set to 1 # Filtering the the protein family based on 50% of the organisms can only have one copy is Flag_singleOrg is set to 1 # Create table with the selected protein families and a flag telling if the specific protein is found in cluster, if it is the ‘selected’ cluster (strict_clust) or if it is the corresponding outside cluster -> output table: StrictCase_WholeHFAM # creating fasta files for all the selected protein families # This takes between 8-40 minutes depending on the selection and filtering criteria used. 3. Python script SMG_Align_Trim.py input is the fasta files with all the sequences from one hfam - taken from one directory Align - Run clustalo Trim - Run Gblocks Realign - Run clustalo # output aligned sequences in fasta format 4. SMG_PCA_Phylo.R - make grafical figures for further analysis # The input is the aligned sequences in fasta format Functions: PCA_analysis - running principal component analysis on an aligned protein family - output pdf format ML_phylo - creating phylogenetic trees with 500 bootstraps - outputting in newick and pdf format Notes: the PCA is fast while the ML_phylo is slow