Application of Algorithm CARDBK in Document Clustering
Application of Algorithm CARDBK in Document Clustering作者机构:College of Economics and Management Xi'an University of Posts and Telecommunications Xi'an 710121 Shaanxi China Information Business Department Puyang Technician College Puyang 457000 Henan China
出 版 物:《Wuhan University Journal of Natural Sciences》 (武汉大学学报(自然科学英文版))
年 卷 期:2018年第23卷第6期
页 面:514-524页
核心收录:
学科分类:08[工学] 081203[工学-计算机应用技术] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:Supported by the Social Science Foundation of Shaanxi Province of China(2018P03) the Humanities and Social Sciences Research Youth Fund Project of Ministry of Education of China(13YJCZH251)
主 题:algorithm design and analysis clustering documentanalysis text processing
摘 要:In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—"centroid all rank distance(CARD)" which means that all centroids are sorted by distance value from one point and "BK" are the initials of "batch K-means"—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information(NMI). Our algorithm manifested to be more stable, linearly scalable and faster.