PathMarker:protecting web contents against inside crawlers
作者机构:Department of Computer Science College of William and MaryWilliamsburg 23187-8795VAUSA Department of Information Sciences and Technology George Mason UniversityFairfax 22030VAUSA
出 版 物:《Cybersecurity》 (网络空间安全科学与技术(英文))
年 卷 期:2019年第2卷第1期
页 面:100-116页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
主 题:Anti-Crawler mechanism Stealthy distributed inside crawler Confidential Website content protection
摘 要:Web crawlers have been misused for several malicious purposes such as downloading server data without permission from the website ***,armoured crawlers are evolving against new anti-crawler mechanisms in the arm races between crawler developers and crawler *** this paper,based on one observation that normal users and malicious crawlers have different short-term and long-term download behaviours,we develop a new anti-crawler mechanism called PathMarker to detect and constrain persistent distributed *** adding a marker to each Uniform Resource Locator(URL),we can trace the page that leads to the access of this URL and the user identity who accesses this *** this supporting information,we can not only perform more accurate heuristic detection using the path related features,but also develop a Support Vector Machine based machine learning detection model to distinguish malicious crawlers from normal users via inspecting their different patterns of URL visiting paths and URL visiting *** addition to effectively detecting crawlers at the earliest stage,PathMarker can dramatically suppress the scraping efficiency of crawlers before they are *** deploy our approach on an online forum website,and the evaluation results show that PathMarker can quickly capture all 6 open-source and in-house crawlers,plus two external crawlers(i.e.,Googlebots and Yahoo Slurp).