Robots Exclusion and Guidance Protocol
Robots Exclusion and Guidance Protocol作者机构:Department of Computer Science and Technology Tongji University
出 版 物:《Tsinghua Science and Technology》 (清华大学学报(自然科学版(英文版))
年 卷 期:2016年第21卷第6期
页 面:643-659页
核心收录:
学科分类:08[工学] 080402[工学-测试计量技术及仪器] 0804[工学-仪器科学与技术]
基 金:partially supported by the National Natural Science Foundation of China(Nos.61672381 and90818023)
主 题:deep web Ajax crawler protocol
摘 要:With the rapid development of the Internet, general-purpose web crawlers have increasingly become unable to meet people's individual needs as they are no longer efficient enough to fetch deep web pages. The presence of several deep web pages in the websites and the widespread use of Ajax make it difficult for generalpurpose web crawlers to fetch information quickly and efficiently. On the basis of the original Robots Exclusion Protocol(REP), a Robots Exclusion and Guidance Protocol(REGP) is proposed in this paper, by integrating the independent scattered expansions of the original Robots Protocol developed by major search engine *** protocol expands the file format and command set of the REP as well as two labels of the Sitemap *** our protocol, websites can express their aspects of requirements for restrictions and guidance to the visiting crawlers, and provide a general-purpose fast access of deep web pages and Ajax pages for the crawlers,and facilitates crawlers to easily obtain the open data on websites effectively with ease. Finally, this paper presents a specific application scenario, in which both a website and a crawler work with support from our protocol. A series of experiments are also conducted to demonstrate the efficiency of the proposed protocol.