摘要
HBase是基于分布式文件系统Hadoop的开源数据库,能够支持海量数据的实时读写。为了支持快速的数据写入,HBase并没有提供对多维索引的支持。然而,许多HBase的应用场景中都需要使用多维索引,例如电商、物联网、交通等领域。因此,许多研究致力于为HBase多维索引提供解决方案。其中,基于线性化技术的全局索引方案首先将多维数据线性化为一维序列,然后在一维序列上建立索引结构,取得了不错的插入、查询效率。然而,这种方案在确保数据并发性和一致性的实现上较为复杂,且大多数方案的查询效率和使用的线性化技术仍有改进空间。
本文首先从理论上改进线性化技术。空间填充曲线是一种常见的线性化技术,局部性、聚集度是衡量空间填充曲线“好坏”的两类理论指标。在众多的空间填充曲线中,Z曲线由于编码简单、支持二分划分裁剪多维空间而得到广泛应用,然而其局部性、聚集度较差。本文使用Z曲线等支持二分划分的空间填充曲线混合理论性质优良的其他空间填充曲线,提出一种新的二分混合空间填充曲线,不仅能够支持二分划分裁剪多维数据空间,并且拥有更好的局部性、聚集度。在实验条件下,使用Z曲线混合Hilbert曲线的Blend_Z_Hilbert曲线在局部性指标上较Z曲线提升20%以上,在聚集度指标上提升50%以上。
接着,基于二分混合空间填充曲线降维后的一维数据,提出了一种基于流式批处理计算的HBase多维索引结构SFI-HBase。在数据插入方面,SFI-HBase使用Spark Streaming对插入全量索引层的数据提前进行聚合、存储,提供了对并发插入的支持,实现了良好的插入效率。在数据查询方面,SFI-HBase的全量索引层存储不同粒度的索引信息,能够根据范围、KNN查询条件选择索引粒度,进而提高查询效率。在实验对比方面,与多种HBase多维索引解决方案进行了插入和查询实验对比,结果表明本文提出的索引结构具有综合最优的时空效率。与线性化技术经典解决方案MD-HBase相比,SFI-HBase的范围查询效率提高10%左右,KNN查询效率提高5倍以上,插入效率提高十倍以上。
最后,为了验证SFI-HBase在真实场景中的可用性,基于开源网约车数据集实现了一个网约车离线分析平台。该平台能够自动的将插入到MySQL的数据通过采集、计算的方式自动同步到SFI-HBase的存储层和全量索引层,并能够基于时间、经纬度等多维条件实现对网约车订单数据的高效查询。
关键词:HBase,二分混合空间填充曲线,多维索引,流式计算
Research on HBase multi-dimensional Index construction and query optimization based on binary blend Space Filling Curve
Master Candidate: Hu Zhao
(Software Engineering)
Directed by Sun Qiao, Wang Xinyang and Sun YueRan
ABSTRACT
HBase is an open-source database based on the distributed file system Hadoop, capable of supporting real-time read and write operations for massive data. To facilitate rapid data insertion, HBase does not offer support for multi-dimensional indexing. However, many application scenarios of HBase require the use of multi-dimensional indexing, such as in the fields of e-commerce, IoT (Internet of Things), transportation, etc. Thus, numerous studies are dedicated to providing solutions for multi-dimensional indexing in HBase. Among them, the global indexing scheme based on linearization techniques first linearizes multi-dimensional data into a one-dimensional sequence, and then builds an indexing structure on this one-dimensional sequence, achieving commendable insertion and query efficiency. However, this approach tends to be complex in ensuring data concurrency and consistency, and the query efficiency and employed linearization techniques of most schemes still have room for improvement.
This paper first theoretically improves linearization techniques. Space-filling curves are a common type of linearization technique, where locality and clustering degree are two theoretical indicators to measure the "quality" of space-filling curves. Among various space-filling curves, the Z-curve, due to its simple encoding and support for binary partition trimming of multi-dimensional spaces, is widely applied. However, its locality and clustering degree are relatively poor. This paper mixes the Z-curve, which supports binary partitioning, with other space-filling curves with excellent theoretical properties, proposing a new binary-mixed space-filling curve that not only supports binary partition trimming of multi-dimensional data spaces but also possesses better locality and clustering degree. Under experimental conditions, the Blend_Z_Hilbert curve, which mixes the Z-curve with the Hilbert curve, improves over 20% in locality and over 50% in clustering degree compared to the Z-curve alone.
Following this, based on the one-dimensional data after dimensionality reduction using the binary-mixed space-filling curve, a multi-dimensional index structure for HBase named SFI-HBase, based on stream-batch processing computation, is proposed. For data insertion, SFI-HBase uses Spark Streaming to aggregate and store the data intended for the full index layer in advance, providing support for concurrent insertions and achieving high insertion efficiency. For data querying, the full index layer of SFI-HBase stores different granularities of index information, enabling the selection of index granularity based on range and KNN query conditions, thereby improving query efficiency. In experimental comparisons with various HBase multi-dimensional indexing solutions, the results demonstrate that the index structure proposed in this paper has the best spatial-temporal efficiency. Compared to the classic linearization solution MD-HBase, the range query efficiency of SFI-HBase improves by about 10%, and the KNN query efficiency increases by more than five times, with over ten times improvement in insertion efficiency.
Lastly, to validate the usability of SFI-HBase in real scenarios, an offline analysis platform for ride-hailing services based on an open-source dataset has been implemented. This platform can automatically synchronize the data inserted into MySQL to the storage layer and full index layer of SFI-HBase through collection and computation, and efficiently query the ride-hailing order data based on multi-dimensional conditions such as time and latitude/longitude.
Keywords: HBase, binary mixed space filling curve, multidimensional index, stream computing