Published 2025 | Version v4
Dataset Open

Pangenome Mutation-Annotated Networks

  • 1. ROR icon University of California, San Diego

Contributors

Researcher:

Description

Pangenomics is an emerging field that uses a collection of genomes of a species instead of a single reference genome to overcome reference bias and study the within-species genetic diversity. Future pangenomics applications will require analyzing large and ever-growing collections of genomes. Therefore, the choice of data representation is a key determinant of the scope, as well as the computational and memory performance of pangenomic analyses. Current pangenome formats, while capable of storing genetic variations across multiple genomes, fail to capture the shared evolutionary and mutational histories among them, thereby limiting their applications. They are also inefficient for storage, and therefore face significant scaling challenges. In this manuscript, we propose PanMAN, a novel data structure that is information-wise richer than all existing pangenome formats – in addition to representing the alignment and genetic variation in a collection of genomes, PanMAN represents the shared mutational and evolutionary histories inferred between those genomes. By using “evolutionary compression”, PanMAN achieves 5.2 to 680-fold compression over other variation-preserving pangenomic formats. PanMAN's relative performance generally improves with larger datasets and it is compatible with any method for inferring phylogenies and ancestral nucleotide states. Using SARS-CoV-2 as a case study, we show that PanMAN offers a detailed and accurate portrayal of the pathogen's evolutionary and mutational history, facilitating the discovery of new biological insights. We also present panmanUtils, a software toolkit that supports common pangenomic analyses and makes PanMANs interoperable with existing tools and formats. PanMANs are poised to enhance the scale, speed, resolution, and overall scope of pangenomic analyses and data sharing.

Files

Files (45.4 GB)

Name Size Download all
md5:8747663e62e6a9a331e4efbcf88043e6
106.8 MB Download
md5:8b229e076f800a1dd0185f4fa4d04a3b
13.1 MB Download
md5:4b7ea6bcf96106fb74d184d7b2b355b3
210.6 MB Download
md5:e9e6ef589c2ae9b96814823d0a20ed55
725.7 kB Download
md5:64deffe72f01893d76b1cd256668f307
2.4 MB Download
md5:d6e6b163c6552b20581eda7fe762ee9d
44.7 GB Download
md5:9b91da25c589d0ec385bcc57bc14c6ae
382.7 MB Download
md5:9685b7f03a8038b175655e1330ace496
5.4 MB Download

Additional details

Software

Repository URL
https://github.com/TurakhiaLab/panman
Programming language
C++
Development Status
Active