摘 要
互联网的迅速发展,数据的产生和复制量以惊人的速度增长着。数据需要更多的存储容量、处理能力和网络带宽。数据在进行存储时,往往存在大量的冗余数据,不仅占用了大量的存储空间,而且降低了存储效率。针对这些问题,重复数据删除技术解决方案。对存储进行优化,减少对物理存储空间的浪费。然而,重复数据删除技术会产生额外的开销和冗余,使存储和读取数据的性能降低。并且随着数据量的增大,的检索速度会减缓,以致文件的存储变慢。基于数据块,基于内容,基于滑块的去重。
本课题合理设计并掌握Hadoop的集群,Hadoop伪分布式平台搭建,Hadoop分布式文件系统HDFS,映射虚拟磁盘,MapReduce,NoSql数据库HBase,分布式数据仓库Hive,内存计算框架Spark等相关知识。数据去重最核心的是HDFS和MapReduce,采用分布式计算框架MapReduce,分布式文件系统HDFS,通过反复试验与总结,搭建合适的平台,使用不同的方法测试其中的优劣。在此基础上,完成系统测试工作。
关键词:Hadoop;重复数据删除;MapReduce;分布式;HDFS
Abstract
With the rapid development of the Internet, the generation and replication of data are increasing at an alarming rate. Data requires more storage capacity, processing capacity, and network bandwidth. When data is stored, there is often a large number of redundant data, which not only occupies a large amount of storage space, but also reduces the storage efficiency. To solve these problems, repeat data deletion technology solution. Optimize storage to reduce waste of physical storage space. however, repeated data deletion techniques can generate additional overhead and redundancy, reducing the performance of storing and reading data. With the increase of data volume, the retrieval speed of fingerprint will slow down, so that the storage of files will slow down. Based on data block, content based, slider based weight removal.
This subject reasonably designs and grasps the Hadoop cluster, builds Hadoop pseudo-distributed platform, Hadoop distributed file system HDFS, maps virtual disk, MapReduce,NoSql database HBase, distributed data warehouse, memory computing framework Spark and so on. The core of data removal is that HDFS and MapReduce, adopt distributed computing framework MapReduce, distributed file system. Through repeated experiments and summary, a suitable platform is built and different methods are used to test the advantages and disadvantages. On this basis, complete the system testing work.
Keywords: Hadoop; duplicate data deletion; MapReduce; distributed; HDFS
目 录
摘 要
Abstract
1 绪论
1.1选题背景及意义
1.1.1选题背景
1.1.2选题意义
1.2研究现状及趋势
1.3研究主要内容
2 重复数据删除存储系统架构
2.1 重复数据删除技术的基本概念
2.2 重复数据删除系统结构和基本原理
2.3 重复数据删除关键技术
2.3.1 数据划分方法
2.3.2 I/O优化技术
2.3.3 高可靠数据配置策略
2.3.4 系统可扩展性
2.4 hadoop系统架构
2.4.1 hadoop生态架构和概况
2.4.2 HDFS(Hadoop分布式文件系统)
2.4.3 Mapreduce(分布式计算框架)
3 重复数据删除系统实现
3.1 MapReduce计算框架实现
3.2指纹计算模块实现
3.2.1 指纹索引表的建立与指纹检索
3.2.2 BLOOM FILTER过滤算法的实现
3.3处理流程分析
3.3.1读流程分析
3.3.2 写流程分析
4系统测试与分析
4.1 测试环境介绍
4.2 测试结果及分析
4.2.1 系统性能测试与比较分析
4.2.2 重复数据删除压缩比测试
4.2.3 检索过滤性能优化的效果测试
5 总结与展望
5.1 总结
5.2 展望
致 谢
参考文献