PPS Sampling of Web Graph Using Preferential Jumping Strategy
作者单位:Web Mining LabDept.of Media and Communication City University of Hong Kong
会议名称:《2010 IEEE 2nd Symposium on Web Society》
会议日期:2010年
学科分类:08[工学] 080402[工学-测试计量技术及仪器] 0804[工学-仪器科学与技术]
关 键 词:Web sampling Web sample random walk jumping strategy PPS sampling evaluation.
摘 要:正Sampling is the most powerful tool for researchers to study important characteristics of the continuously growing *** Web page sampling problem,we collect a number of pages which are representative to the Web population. However,we believe Web sampling greatly differs from generic sampling *** of all,the randomness principle can not be applied to Web sampling mechanically;Secondly, randomness on page level should not be the only goal of Web *** believe that there is still space to improve the randomness goal,and other than pursuing randomness on page level,new objectives should be set for host and domain levels. In our work,we designed a new Web sampling method, called the Probability Proportional to the Size of Websites (PPSW for short) *** certain preliminary experiments and analysis,we concluded that no former sampling methods took into account the host and domain level of the *** we seek new Web sampling methods that can yield samples that are representative on host and domain level. With regard to the new objective,we redesigned the jumping strategy of the random walk while *** preferential jumping strategy markedly increased the validity of random walk on host and domain *** particularly,random walk based sampling methods have two configurations:whether the random walk has random jump probability,and whether the random walk is conducted on undirected Web graph with the help of search *** these two configurations, together with our newly designed preferential jumping strategy, we conducted four kinds of new sampling *** the four groups of experiments,the directed one with random jump showed great performance improvement. For evaluating our new PPSW sampling methods,we put forward new objectives,along with corresponding *** first two are coverage *** speaking,the number of domains is several orders of magnitude smaller than the number of Web pa