MolGPT#
1. Methods#
The input of the model
Molecular SMILES
Scaffold
Properties:\(\left\{\begin{aligned}&\log{P}\quad \text{ ( The logarithm of the partition coefficient.)}\\ &SAS \quad \text{ (Synthetic Accessibility score) }\\&TPSA \quad \text{ (Topological Polar Surface Area)}\\&QED \quad \text{ (Quantitative Estimate of Drug-likeness )}\end{aligned}\right.\)
Dataset used
MOSES or the GuacaMol
**MOSES: **A data set composed of 1.9 million clean lead-like molecules from the Zinc data set46 with molecular weight ranging from 250 to 350 Da, number of rotatable bonds lower than 7, and XlogP below 3.5.
**GuacaMol: **A subset of the ChEMBL 24 database that contains 1.6 million molecules.
1.1. The architecture of Transformer#
Detailed tutorial can be seen in: Transformer.md
2. Training Procedure and Evaluation Metrics#
Model is trained for 10 epochs using the Adam optimizer with a learning rate of \(6 × 10^{−4}\).
Evaluation metrics :
Validity: The fraction of a generated molecules that are valid.
Uniqueness: The fraction of valid generated molecules that are unique.
**Novelty: **The fraction of valid unique generated molecules that are not in the training set.
Internal Diversity (\(\mathbf{IntDiV_p}\) ): Measures the diversity of the generated molecules.
**Freshet ChemNet Distance(FCD): **Calculated using the features of the generated molecules and the features of molecules in the data set.
KL Divergence: KL divergence between two distributions P and Q for any given property is a measure of how well Q approximates P
3. RESULTS AND DISCUSSION#
Nonconditioned Molecular Generation.
Generation-based on Single and Multiple Properties.
Generation Based on Scaffold.
Generation Based on Scaffold and Property.