AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures
AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures作者机构:LIP6 Laboratory University Pierre et Marie Curie Paris 75005 France CEDRIC Laboratory Conservatoire National des Arts et Metiers Paris 75003 France LAMSADE Laboratory University Paris-Dauphine Paris 75775 France DICC Laboratory Universidad Catolica del Uruguay Montevideo 11200 Uruguay
出 版 物:《Journal of Computer Science & Technology》 (计算机科学技术学报(英文版))
年 卷 期:2016年第31卷第1期
页 面:147-166页
核心收录:
学科分类:07[理学] 081203[工学-计算机应用技术] 08[工学] 070104[理学-应用数学] 0835[工学-软件工程] 0701[理学-数学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:supported by the Advanced European Research Council Grant Webdam
主 题:full text indexing large-scale indexing algebraic signature
摘 要:We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method.