Applying machine learning for solubility prediction: comparing different representations of molecular data

4

Abstract

Solubility is one of the crucial properties of drugs and is important to determine early in the drug development cycle. Artificial intelligence (AI) based algorithms offer a faster solution compared to much more computationally expensive methods that utilize energy calculations, quantum dynamics and calculation of molecular dynamics. In this work, several AI-based algorithms utilizing different molecular data representation approaches, namely convolutional neural networks, graph neural networks and decision tree based gradient boosting, are applied to an open dataset from a recently published Kaggle challenge on solubility with more than 70 000 compounds. Performance of the models is evaluated on a testing set provided by the Kaggle challenge, as well as on a locally created independent dataset. Results demonstrate superior performance by the gradient boosting model trained on tabular feature representation of the molecules. Developed model is also shown to be competitive with other solutions posted on the leaderboards of the Kaggle challenge.

General Information

Keywords: machine learning, drug solubility, graph neural networks, gradient boosting, convolution neural networks

Journal rubric: Data Analysis

Article type: scientific article

DOI: https://doi.org/10.17759/mda.2025150103

Received: 19.02.2025

Accepted:

For citation: Ereshchenko A.V. Applying machine learning for solubility prediction: comparing different representations of molecular data. Modelirovanie i analiz dannikh = Modelling and Data Analysis, 2025. Vol. 15, no. 1, pp. 35–50. DOI: 10.17759/mda.2025150103. (In Russ., аbstr. in Engl.)

References

  1. Fink, C., Sun, D., Wagner, K., Schneider, M., Bauer, H., Dolgos, H., Mäder, K., Peters, S.-A. Evaluating the Role of Solubility in Oral Absorption of Poorly Water-Soluble Drugs Using Physiologically-Based Pharmacokinetic Modeling // Clin. Pharmacol. Ther. 2020. 107: 650-661. DOI: 10.1002/cpt.1672
  2. Ameta, R.K., Soni, K., Bhattarai, A. Recent Advances in Improving the Bioavailability of Hydrophobic/Lipophilic Drugs and Their Delivery via Self Emulsifying Formulations // Colloids Interfaces. 2023. 7. 16. DOI: 10.3390/colloids7010016
  3. Loftsson, T., Brewster, M.E. Pharmaceutical applications of cyclodextrins: basic science and product development // Journal of Pharmacy and Pharmacology. 2010. Volume 62, Issue 11, November 2010, Pages 1607–1621. DOI: 10.1111/j.2042-7158.2010.01030.x
  4. Basavaraj, S., Guru V. Betageri, G.V. Can formulation and drug delivery reduce attrition during drug discovery and development—review of feasibility, benefits and challenges // Acta Pharmaceutica Sinica B. 2014. Volume 4, Issue 1. Pages 3-17, ISSN 2211-3835. DOI: 10.1016/j.apsb.2013.12.003
  5. Ran, Y., Samuel H. Yalkowsky, S.H. Prediction of Drug Solubility by the General Solubility Equation (GSE) // Journal of Chemical Information and Computer Sciences. 2001. 41 (2). 354-357. DOI: 10.1021/ci000338c
  6. Fredenslund, A., Jones, R.L. Prausnitz, J.M. Group-contribution estimation of activity coefficients in nonideal liquid mixtures // AIChE J. 1975. 21: 1975, 1086-1099. DOI: 10.1002/aic.690210607
  7. Palmer, D.S., McDonagh, J.L., Mitchell, J.B.O., Mourik, T., Fedorov, M.V. First-Principles Calculation of the Intrinsic Aqueous Solubility of Crystalline Druglike Molecules // Journal of Chemical Theory and Computation. 2012. 8. (9), 3322-3337. DOI: 10.1021/ct300345m
  8. Li, L., Totton, T., Frenkel, D. Computational methodology for solubility prediction: Application to the sparingly soluble solutes // J. Chem. Phys. 2017. 146 (21): 214110. DOI: 10.1063/1.4983754
  9. Boothroyd, S., Anwar, J. Solubility prediction for a soluble organic molecule via chemical potentials from density of states // The Journal of Chemical Physics. 2019. 151, 184113. DOI: 10.1063/1.5117281
  10. Duchowicz, P.R., Castro, E.A. QSPR Studies on Aqueous Solubilities of Drug-Like Compounds // Int. J. Mol. Sci. 2009. 10, 2558-2577. DOI: 10.3390/ijms10062558
  11. Yu, X., Wang, X., Wang, H., Li, X.,Gao, J. Prediction of Solubility Parameters for Polymers by a QSPR Model // QSAR Comb. Sci. 2006. 25: 156-161. DOI: 10.1002/qsar.200530138
  12. Palmer, D.S., O'Boyle, N.M., Glen, R.C., Mitchell, J.B.O. Random Forest Models To Predict Aqueous Solubility // Journal of Chemical Information and Modeling. 2007. 47 (1), 150-158. DOI: 10.1021/ci060164k
  13. Deng, T., Jia, G. Prediction of aqueous solubility of compounds based on neural network // Molecular Physics. 2019. 118:2, DOI: 10.1080/00268976.2019.1600754.
  14. Lusci, A., Pollastri, G., Baldi, P. Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules // Journal of Chemical Information and Modeling. 2013. 53 (7), 1563-1575. DOI: 10.1021/ci400187y
  15. Boobier, S., Hose, D.R.J., Blacker, A.J. et al. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water // Nature Communciations. 2020. 11, 5753. DOI: 10.1038/s41467-020-19594-z
  16. Morgan, H.L. The generation of a unique machine description for chemical structures — a technique developed at chemical abstracts service // Journal of Chemical Documentation. 1965. Doc 5:107–113. DOI: 10.1021/c160017a018
  17. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A. CatBoost: unbiased boosting with categorical features // NeurIPS. 2018. DOI: 10.48550/arXiv.1706.09516
  18. Capecchi, A., Probst, D., Reymond, J.L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome // Journal of Cheminformatics. 2020. 12. 43. DOI: 10.1186/s13321-020-00445-4
  19. Платонов Е.Н., Мартынова И.Р. Семантический анализ отзывов об организациях методами машинного обучения // Моделирование и анализ данных. 2024. Том 14. № 1. С. 7–26. DOI: 10.17759/mda.2024140101
  20. Blevins, A., Quigley, K., I., Halverson, J., B., Wilkinson, N., Levin, S., R., Pulapaka, A., Reade, W., Howard, A. NeurIPS 2024 - Predict New Medicines with BELKA. Kaggle. 2024. https://kaggle.com/competitions/leash-BELKA.
  21. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library // In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 1–12.
  22. Cacciari, I., Ranfagni, A. Hands-On Fundamentals of 1D Convolutional Neural Networks — A Tutorial for Beginner Users // Applied Sciences. 2024. 14(18), 8500. DOI: 10.3390/app14188500
  23. Shi, Y., Huang, Z., Wang, W., Zhong, H., Feng, S., Sun, Y. Masked Label Prediction: Unified Massage Passing Model for Semi-Supervised Classification // Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 202. pp 1548−1554. DOI: 10.24963/ijcai.2021/214
  24. Zaliani, A., Tang, J., Martin, J., Harmel, R., Wang, W. 1st EUOS/SLAS Joint Challenge: Compound Solubility. https://kaggle.com/competitions/euos-slas, 2022. Kaggle.
  25. Powell, M. J. D. An efficient method for finding the minimum of a function of several variables without calculating derivatives // The Computer Journal. 1964. Volume 7, Issue 2, Pages 155–162. DOI: 10.1093/comjnl/7.2.155
  26. Pedregosa et al. Scikit-learn: Machine Learning in Python // Journal of Machine Learning Research. 2011. 12(85):2825−2830.
  27. Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.; Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python // Nat. Methods. 2020. 17, 261−272. DOI: 10.1038/s41592-019-0686-2

Information About the Authors

Alexey V. Ereshchenko, phd student, FRC «Computer Science and Control» RAS, Moscow, Russian Federation, e-mail: ereshchenko.alexey@yandex.com

Metrics

 Web Views

Whole time: 26
Previous month: 0
Current month: 26

 PDF Downloads

Whole time: 4
Previous month: 0
Current month: 4

 Total

Whole time: 30
Previous month: 0
Current month: 30