Transfer synthetic over-sampling for class-imbalance learning with limited minority class data
为与有限少数班数据学习的班不平衡转移合成在采样上作者机构:School of Computer Science and Engineering Southeast University Nanjing 210096 China Key Laboratory of Computer Network and Information Integration (Southeast University) Ministry of Education Nanjing 210096 China Collaborative Innovation Center for Wireless Communications Technology Nanjing 210096 China
出 版 物:《Frontiers of Computer Science》 (中国计算机科学前沿(英文版))
年 卷 期:2019年第13卷第5期
页 面:996-1009页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 08[工学]
基 金:The authors wish to thank the associate editor and anonymous reviewers for their helpful comments and suggestions. This work was supported by the National Key R&D Program of China (2017YFB1002801) the National Natural Science Foundation of China (Grant Nos. 61473087, 61573104) the Natural Science Foundation of Jiangsu Province (BK20141340), and partially supported by the Collaborative Innovation Center of Novel Software Technology and Industrialization
主 题:machine learning data mining class imbalance over sampling boosting transfer learning
摘 要:The problem of limited minority class data is encountered in manyclass imbalanced applications, but has received little attention. Synthetic over-sampling, as popular class-imbalance learning methods, could introduce much noise when minority class has limited data since the synthetic samples are not i.i.d. samples of minority class. Mos t sophisticated synthetic sampling methods tackle this problem by denoising or generating samples more consistent with ground-truth data distribution. But their assumptions about true noise or ground-truth data distribution may not hold. To adapt synthetic sampling to the problem of limited minority class data, the proposed Traso framework treats synthetic minority class samples as an additional data source, and exploits transfer learning to transfer knowledge from them to minority class. As an implementation, TrasoBoost method firstly generates synthetic samples to balance class sizes. Then in each boosting iteration, the weights of synthetic samples and original data decrease and increase respectively when being misclassified, and remain unchanged otherwise. The misclassified synthetic samples are potential noise, and thus have smaller influence in the following iterations. Besides, the weights of minority class instances have greater change than those of majority class instances to be more influential. And only original data are used to estimate error rate to be immune from noise. Finally, since the synthetic samples are highly related to minority class, all of the weak learners are aggregated for prediction. Experimental results show TrasoBoost outperforms many popular class-imbalance learning methods.