Investigating the Relevance of Arabic Text Classification Datasets Based on Supervised Learning
Investigating the Relevance of Arabic Text Classification Datasets Based on Supervised Learning作者机构:the Computer Science DepartmentAmerican University of MadabaMadaba 2882
出 版 物:《Journal of Electronic Science and Technology》 (电子科技学刊(英文版))
年 卷 期:2022年第20卷第2期
页 面:187-208页
核心收录:
学科分类:0502[文学-外国语言文学] 12[管理学] 050208[文学-阿拉伯语语言文学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 05[文学] 081203[工学-计算机应用技术] 081104[工学-模式识别与智能系统] 08[工学] 050210[文学-亚非语言文学] 0835[工学-软件工程] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)]
摘 要:Training and testing different models in the field of text classification mainly depend on the pre-classified text document datasets. Recently, seven datasets have emerged for Arabic text classification, including Single-Label Arabic News Articles Dataset(SANAD), Khaleej, Arabiya, Akhbarona, KALIMAT, Waten2004, and Khaleej2004. This study investigates which of these datasets can provide significant training and fair evaluation for text classification(TC). In this investigation, well-known and accurate learning models are used, including naive Bayes(NB), random forest(RF), K-nearest neighbor(KNN), support vector machines(SVM), and logistic regression(LR) models. We present relevance and time measures of training the models with these datasets to enable Arabic language researchers to select the appropriate dataset to use based on a solid basis of comparison. The performances of the five learning models across the seven datasets are measured and compared with the performances of the same models trained on a well-known English language dataset. The analysis of the relevance and time scores shows that training the SVM model on Khaleej and Arabiya obtained the most significant results in the shortest amount of time,with the accuracy of 82%.