String similarity join with different similarity thresholds based on novel indexing techniques
String similarity join with different similarity thresholds based on novel indexing techniques作者机构:School of Computer Science & Software Engineering Tianjin Polytechnic University Tianjin 300387 China School of Mathematical & Natural Sciences Arizona State University Tempe AZ 85281 USA
出 版 物:《Frontiers of Computer Science》 (中国计算机科学前沿(英文版))
年 卷 期:2017年第11卷第2期
页 面:307-319页
核心收录:
学科分类:12[管理学] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 08[工学] 0814[工学-土木工程] 082301[工学-道路与铁道工程] 0823[工学-交通运输工程]
主 题:similarity join similarity aware index similarity thresholds
摘 要:String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a certain similarity function. The string pairs with similarity above a certain threshold are regarded as results. The current approach to solving the similarity join problem is to use a unique threshold value. There are, however, several scenarios that require the support of multiple thresholds, for instance, when the dataset includes strings of various lengths. In this scenario, longer string pairs typically tolerate much more typos than shorter ones. Therefore, we proposed a so- lution for string similarity joins that supports different simi- larity thresholds in a single operator. In order to support dif- ferent thresholds, we devised two novel indexing techniques: partition based indexing and similarity aware indexing. To utilize the new indices and improve the join performance, we proposed new filtering methods and index probing tech- niques. To the best of our knowledge, this is the first work that addresses this problem. Experimental results on real-world datasets show that our solution performs efficiently while pro- viding a more flexible threshold specification.