There is a newer version of the record available.

Published July 10, 2020 | Version 1
Dataset Open

Characteristics of human and viral RNA binding sites and site clusters recognized by SRSF1 and RNPS1

  • 1. Western University, Cytognomix Inc
  • 2. Western University
  • 3. Cytognomix Inc

Description

Section 1. Extended Data Tables

This archive contains the extended data tables for the research article "A proposed mechanism for molecular pathogenesis of severe RNA-viral pulmonary infections". These tables provide SRSF1, RNPS1 and hnRNP A1 binding site and information-dense cluster counts across various RNA viral genomes [including multiple SARS-CoV-2 and influenza strains] and the human transcriptome, the estimated SARS-CoV-2 doubling time necessary for viral genome SRSF1 binding site availability to exceed sites within the host transcriptome, and an analysis of influenza, dengue, and aplastic anemia patients misdiagnosed as irradiated by established radiation gene signatures.These tables are:

Section 1 - Table 1. RNPS1 and hnRNPA1 binding sites and Information-Dense Clusters for RNPS1 and
hnRNPA1 in RNA Virus Genomes
Section 1 - Table 2A. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 1) in RNA Virus
Genomes
Section 1 - Table 2B. Detailed Analysis of Information-Dense Clusters for SRSF1 (Replicate 2) in RNA Virus
Genomes
Section 1 - Table 2C. Detailed Analysis of Information-Dense Clusters for RNPS1 in RNA Virus Genomes
Section 1 - Table 2D. Detailed Analysis of Information-Dense Clusters for hnRNP A1 in RNA Virus
Genomes
Section 1 - Table 3. Binding Site Analysis of Multiple Coronavirus Strains (Both Strands)
Section 1 - Table 4A. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Negative Strand Only)
Section 1 - Table 4B. Binding Site Analysis of Multiple Influenza A (H3N2) Strains (Both Strands)
Section 1 - Table 5. SRSF1, RNPS1 and hnRNPA1 Binding Sites and Information-Dense Clusters by Gene
Section 1 - Table 6A. Transcriptome-Wide Information Dense Clusters Intersecting DRIP- and DRIPc-seq
Intervals
Section 1 - Table 6B. Exome-Wide Information Dense Clusters within DRIP- and DRIPc-seq Intervals
Section 1 - Table 6C. Transcriptome-Wide Scan of Strong Binding Sites Intersecting DRIP- and DRIPc-seq
Intervals
Section 1 - Table 6D. Exome-Wide Scan of Strong Binding Sites within DRIP- and DRIPc-seq Intervals
Section 1 - Table 7. Rate of False Positives for Influenza, Dengue Virus and Aplastic Anemia Using
Radiation Signatures
Section 1 - Table 8. Radiation Model Genes Contributing to False Positives for Patients with Influenza A,
Dengue Virus, and Aplastic Anemia
Section 1 - Table 9A. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding
Sites (Positive-Strand Sites Only)
Section 1 - Table 9B. Doubling Time of SARS-CoV-2 Needed to Exceed Host Transcriptome SRSF1 Binding
Sites (Both Strands Considered)

Section 2.  All SRSF1, hnRNPA1 and RNPS1 binding site tracks for human and viral genomes

We provide bedgraph tracks which provide the location and strength of binding sites (and binding site clusters) for SRSF1, RNPS1 and hnRNPA1 across the human transcriptome (GRCh37), the human exome (including +/-300nt surrounding the exon; non-intergenic only), and for all viral genome investigated in this study (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [two strains]). Note that if no clusters were found for a particular viral genome, a file for said genome will not be present in the Zenodo archive.

Folder “Cluster-to-DRIPseq-Intersection-Tracks” contain tracks which indicate where binding site clusters have been identified, intersected with DRIP-seq and DRIPc-seq intervals which indicate where there is evidence of R-Loop formation in the human genome. The DRIP-seq dataset (GSE68845) is not strand specific. DRIPc-seq (GSE70189) is strand specific, and has been taken into account in the intersection (e.g. tracks only list positive strand clusters found in positive-strand DRIPc-seq intervals).

Due to sheer size, the human transcriptome and exome tracks which indicate the location of individual binding sites are split into two separate files (separated by strand). While the custom tracks containing human binding site information are designed to be uploaded to the UCSC Genome Browser, files containing transcriptome-wide binding site information may be too large to be uploaded and may require further filtering (i.e. by chromosome).

To be classified as a cluster, binding sites on the same strand must have Ri values which sum to >50 bits, each binding site must have a neighboring site within 25nt, and all binding sites in the cluster must have Ri greater than a minimum bit threshold. For human transcriptomes and exomes, this bit minimum was set to Rsequence. The bit minimum for viral binding sites was set to 0.1 * Rsequence. The information density-based clustering algorithm utilized in this work is described in  Lu and Rogan 2018 (https://f1000research.com/articles/7-1933/v2) and archived source code is available through Zenodo (https://dx.doi.org/10.5281/zenodo.1892051).

Section 3. Binding site clusters - lollipop plots

Lollipop plots present the genomic coordinates and information densities of clusters across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]). The height of the "lollipop" corresponds to the information density of a cluster. Labels above "lollipops" present the start and end genomic coordinate (GRCh37) of the cluster followed by the number of sites in the cluster enclosed in brackets. Lollipop plots associated with human transcriptomes/exomes each contain a single gene. Influenza has 8 segments and each segment requires its own plot, other viral genomes examined are presented in a single plot.

File naming convention for human plots:

  • RBP_Gene.png
  • e.g. RNPS1_ADK.png

File naming convention for viral plots (elements in square brackets do not always appear):

  • Virus[.InfluenzaSegment].RiThreshold.Strand.RBP.png
  • e.g. Wuhan-Hu-1.complete-genome.4.2-bits.PosStrand.hnRNPA1.png

The specified Ri threshold indicates all binding sites which comprise a cluster have Ri greater-than or equal to the threshold.

Section 4. Ri(b,l) matrices for all binding sites scanned

The information theory-based position weight matrices for the following RNA binding proteins (RBP) used in this study: SRSF1, hnRNPA1 and RNPS1. We investigated binding using two different RNPS1 binding models. While similar, these two models contained binding site information on opposing sides of the binding site motif which is why we found it prudent to scan with both models.

Structure of each file:

Line #1: Start position, End position and Rsequence [average strength of sequences used to generate the model]

Subsequent lines describe the information on each position of the binding site:

  • First four columns: Ri contribution of nucleotide at this position of the matrix [A, C, G, T]
  • Row #5: Position of the matrix
  • Last four columns: Number of binding sites used to generate model with a particular nucleotide at this position of the matrix [A, C, G, T]

Example:

-2.965775           1.282153            0.034225            -4.906891           0            1              19          8            0

At zero position of the matrix (first nucleotide), a ‘C’ would have a positive contribution to binding site strength, a ‘G’ would be relatively neutral, and an ‘A’ or ‘T’ would negatively contribute to binding site strength.

Generation of Ri(b,l) matrices and computation of Ri values and can be accomplished by utilizing the Delila package (https://alum.mit.edu/www/toms/delila/delilaprograms.html).

Section 5. Ri and intersite distance - histograms

Two sets of histograms present Ri distribution and intersite distance distribution across the human transcriptome, human exome, and viral genomes (Coronavirus, Dengue, HIV-1 [two strains] and Influenza [one strain]). 

File naming convention for human plots (elements in square brackets do not always appear):

  • [IntersiteDistancesThreshold-]Human-[DRIPc]-AllChrs-RBP[-RiThreshold].png
  • e.g. IntersiteDistances500-Human-AllChrs-hnRNPA1-4.6-bits.png

File naming convention for viral plots (elements in square brackets do not always appear):

  • [IntersiteDistancesThreshold-]Strand-RBP-Virus[.InfluenzaSegment][-RiThreshold].png
  • e.g. IntersideDistances1000-PosStrandOnly-SRSF1-top50000sitesReplicate1-HIV-1-Strain-B.png

Intersite distance thresholds of 500 or 1000 were assigned for all intersite distance histograms. Any distances above the corresponding threshold were excluded from the plot. Plots presenting Ri distributions contain a dashed line indicating Rsequence if it is visible within the scope of the plot.

Files

Section 1. Extended Data Tables.zip