咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Crowd-Guided Entity Matching w... 收藏

Crowd-Guided Entity Matching with Consolidated Textual Data

Crowd-Guided Entity Matching with Consolidated Textual Data

作     者:Zhi-Xu Li Qiang Yang An Liu Guan-Feng Liu Jia Zhu Jia-Jie Xu Kai Zheng Min Zhang 

作者机构:School of Computer Science and Technology Soochow University Suzhou 215006 China Guangdong Key Laboratory of Big Data Analysis and Processing Guangzhou 510006 China School of Computer Science and Technology Soochow University Suzhou 215006 China School of Computer South China Normal University Guangzhou 510631 China School of Computer Science and Technology Soochow University Suzhou 215006 China Beijing Key Laboratory of Big Data Management and Analysis Methods Beijing 100872 China 

出 版 物:《Journal of Computer Science & Technology》 (计算机科学技术学报(英文版))

年 卷 期:2017年第32卷第5期

页      面:858-876页

核心收录:

学科分类:12[管理学] 081603[工学-地图制图学与地理信息工程] 081802[工学-地球探测与信息技术] 0808[工学-电气工程] 07[理学] 08[工学] 070503[理学-地图学与地理信息系统] 1201[管理学-管理科学与工程(可授管理学、工学学位)] 0818[工学-地质资源与地质工程] 0705[理学-地理学] 0816[工学-测绘科学与技术] 0835[工学-软件工程] 0701[理学-数学] 0811[工学-控制科学与工程] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:the Open Foundation of Guangdong Key Laboratory of Big Data Analysis and Processing of China the National Postdoctoral Funding of China 国家自然科学基金 the Postdoctoral Scientific Research Funding of Jiangsu Province of China 

主  题:entity matching consolidated textual data crowdsourcing 

摘      要:Entity matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra consolidated textual information (CText) of the record, but seldom work has been done on using the CText for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CText since there are hundreds or thousands of words with each piece of CText, while existing topic models either cannot work well since there are no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each piece of CText, and then measure the similarity between CText on the multiple sub-topic dimensions. To avoid ignoring some hidden important sub-topics, we let the crowd help us decide weights of different sub-topics in doing EM. Our empirical study on two real-world datasets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分