Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration
作者机构:Xiangya School of Pharmaceutical SciencesCentral South UniversityChangsha 410013Hunan P.R.China Shangqiu Normal UniversitySchool of Information TechnologyShangqiu 476000HenanP.R.China College of ComputerNational University of Defense TechnologyChangsha 410005HunanP.R.China Department of Computer ScienceHunan UniversityChangsha 410082.HunanP.R.China institute for Advancing Translational Medicine in Bone and Joint DiseasesSchool of Chinese MedicineHong Kong Baptist UniversityHong Kong SAR 999077P.R.China Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University.College of Pharmaceutical SciencesZhejiang UniversityHangzhou 310058ZhejiangP.R.China
出 版 物:《Research》 (研究(英文))
年 卷 期:2023年第2022卷第3期
页 面:157-170页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 081104[工学-模式识别与智能系统] 08[工学] 0835[工学-软件工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:the National Key Research and Development Program of China(2021YFF1201400) the National Natural Science Foundation of China(U1811462 and 22173118) the Hunan Provincial Science Fund for Distinguished Young Scholars(2021J10068) the Science and Technology Innovation Program of Hunan Province(2021RC4011) the Project of Inteiligent Management Software for Multimodal Medical Big Data for New Generation Information Technology,Ministry of Industry and Information Technology of People's Republic of China(TC210804V) the Changsha Municipal Natural Science Foundation(kq2014144) the Changsha Science and Technology Bureau project(kq2001034) the HKBU Strategic Development Fund project(SDF19-0402-P02)
主 题:simplified generalization tuning
摘 要:Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug *** feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints,which need extensive human expert *** the rapid progress of artificial intelligence technology,data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based ***,existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties,thus resulting in poor generalization ***,we proposed a novel multitask learning BERT(Bidirectional Encoder Representations from Transformer)framework,named MTL-BERT,which leverages large-scale pre-training,multitask learning,and SMILES(simplified molecular input line entry specification)enumeration to alleviate the data scarcity ***-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared ***,SMILES enumeration is used as a data enhancement strategy during the pretraining,fine-tuning,and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES *** experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular ***,the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.