The QDπ dataset, training data for drug-like molecules and biopolymer fragments and their interactions

Scientific Data vol. 12  p. 693  DOI: 10.1038/s41597-025-04972-3  Published: 2025-04-25 


Jinzhe Zeng [ ] , Timothy J. Giese [ ] , Andreas W. Götz, Darrin M. York [ ]

  View Full Article
 Download PDF

Abstract

<p>The development of universal machine learning potentials (MLP) for small organic and drug-like molecules requires large, accurate datasets that span diverse chemical spaces. In this study, we introduce the QD&pi; dataset which incorporates data taken from several datasets. We use a query&mdash;by&mdash;committee active learning strategy to extract data from large datasets to maximize the diversity and avoid redundancy as relevant for neural network training to construct the QD&pi; dataset. The QD&pi; dataset requires only 1.6 million structures to express the chemical diversity of 13 elements from the various source datasets at the &omega;B97M-D3(BJ)/def2-TZVPPD level of theory. The QD&pi; dataset enables creation of flexible target loss functions for neural network training relevant to drug discovery, including information-dense data sets of relative conformational energies and barriers, intermolecular interactions, tautomers and relative protonation energies of drug-like compounds and biomolecular fragments. It is the hope that the high chemical information density and diversity contained in the QD&pi; dataset will provide a valuable resource for the development of new universal MLPs for drug discovery.</p>