Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model
Frequency and Similarity-Aware Partitioning for Cloud Storage Based on Space-Time Utility Maximization Model作者机构:Department of Computer Science and Technology University of Science and Technology Beijing Department of Computer and Information Sciences Temple University
出 版 物:《Tsinghua Science and Technology》 (清华大学学报(自然科学版(英文版))
年 卷 期:2015年第20卷第3期
页 面:233-245页
核心收录:
学科分类:0601[历史学-考古学] 060107[历史学-文化遗产与博物馆] 06[历史学] 060207[历史学-专门史] 07[理学] 08[工学] 083304[工学-城乡发展历史与遗产保护规划] 081201[工学-计算机系统结构] 0712[理学-科学技术史(分学科,可授理学、工学、农学、医学学位)] 0602[历史学-中国史] 0833[工学-城乡规划学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
主 题:de-duplication cloud storage redundancy frequency
摘 要:With the rise of various cloud services, the problem of redundant data is more prominent in the cloud storage systems. How to assign a set of documents to a distributed file system, which can not only reduce storage space, but also ensure the access efficiency as much as possible, is an urgent problem which needs to be solved. Space-efficiency mainly uses data de-duplication technologies, while access-efficiency requires gathering the files with high similarity on a server. Based on the study of other data de-duplication technologies, especially the Similarity-Aware Partitioning (SAP) algorithm, this paper proposes the Frequency and Similarity-Aware Partitioning (FSAP) algorithm for cloud storage. The FSAP algorithm is a more reasonable data partitioning algorithm than the SAP algorithm. Meanwhile, this paper proposes the Space-Time Utility Maximization Model (STUMM), which is useful in balancing the relationship between space-efficiency and access-efficiency. Finally, this paper uses 100 web files downloaded from CNN for testing, and the results show that, relative to using the algorithms associated with the SAP algorithm (including the SAP-Space-Delta algorithm and the SAP-Space-Dedup algorithm), the FSAP algorithm based on STUMM reaches higher compression ratio and a more balanced distribution of data blocks.