There is a newer version of the record available.

Published October 29, 2024 | Version 2024-Q3
Dataset Open

SDDF Energy Dataset

Description

This conformational energy dataset, developed as part of the Smart Distributed Data Factory (SDDF) project, contains over 2.17 million molecular conformations based on drug-like molecules sourced from the ENAMINE database. Energies were calculated using DFT with the ωB97x density functional and the 6–31G(d) basis set. The conformations were generated from SMILES using RDKit, MMFF94 optimization, and molecular dynamics (MD) simulations, providing a diverse set of molecular structures and energy states.

  • RDKit Conformations: 535,338
  • RDKit + MMFF94 Optimized: 1,151,936
  • MD-Generated: 483,279

This dataset serves as a benchmark for energy prediction models, with training (638,617 examples), validation (134,732 examples), and test subsets (24,890 examples) created using a strict scaffold-based split to ensure no overlap and less than 70% similarity between the training and test sets.

Dataset contents:

  • data.tar.gz: contains the conformations in Structured Data File format, grouped into separate folders based on the molecule ID. Each conformation's label is provided within its SDF file as a property named "energy".
  • INDEX.smi: specifies the molecule IDs and their corresponding SMILES.
  • SOURCES.csv: specifies the conformation generation method for each conformation.
  • SDDF_train.tsv, SDDF_validation.tsv, and SDDF_test.tsv specify the molecule IDs and conformations for each subset of the benchmark.

A detailed description is provided in the accompanying paper.

Files

SOURCES.csv

Files (1.7 GB)

Name Size Download all
md5:8800e8bf2c916ce6ec0be39ad5279357
1.6 GB Download
md5:0d72ab801f37f4d8ca386a3fffb82ac5
28.4 MB Download
md5:cd06458b02fc78b0608e63bd3295ea08
260.0 kB Download
md5:592ed3f45ca8dffd87876387f838ff63
6.8 MB Download
md5:2ea7ab2a2b3f588761d574cba4d2a6e5
1.4 MB Download
md5:44fab60185217e0d36a486d3a1c5c644
60.6 MB Preview Download

Additional details

Additional titles

Alternative title
SDDF-Energy-2024Q3

Related works

Is published in
Preprint: 10.1101/2024.10.22.619651 (DOI)

Software