摘要

随着互联网技术的飞速发展，Internet 上的 Web 页面呈指数型增长。对于如何自动对这些海量数据有效处理和管理，来取代低效繁琐的人工管理，Web 文本分类技术成了关键技术。目前对于这方面的研究已经有了很大进展，并且产生了一系列分类方法，比较著名的有支持向量机（VSM）、卷积神经网络（CNN）、神经网络和贝叶斯（Bayes）算法等。在这些算法中，CNN 算法由于其简单、有效、参数无关，目前的应用非常广泛。但是，CNN 算法有着不少的缺陷，最关键的两个缺陷是运行速度太慢和分类精度不高。

本设计对CNN算法的缺陷产生原因进行了分析，并对其进行了改进：在特征提取上引入了基于改进的 CHI 方法使得特征提取更加合理；在CNN分类器运行速度的改进方面引入了 Rocchio 算法的思想和一些其他简单的思路对分类器进行速度的提升，使得新的分类器的分类速度得到大幅度提升；在CNN分类器分类精度的改进方面，通过在相似度计算上引入了基于属性熵值的相似度改进和基于CNN类别加权的改进，使得改进的CNN算法又在分类精度上得到了大幅度提高。在基于上述这些改进后，搭建出了一个真正具备高效、实用的网站文本分类系统。

本文完成了分类器系统的实现，并且利用个 3578 个真实网站内容作为测试集对系统进行了性能测试。通过对实验结果进行分析，得出本文提出的新的CNN分类器在测试集数据的环境下达到了高速分类和分类正确率远高出传统方法的结论。本文提出的新的高效CNN算法作为网站文本分类器比原有的CNN分类方法和加权CNN方法有更快的速度，同时比两者有更高的分类精度。

关键词：高效网站文本分类；改进特征提取；快速分类；高精度分类；属性熵值分析

Abstract

With the rapid development of the Internet,web pages on the Internet is growing exponentially. On the issue of how to organize and deal with these massive data effectively, automatically and how to take the place of manual management which is too inefficient and cumbersome, Web text classification has became a key technology. At present,the research in this area has made great progress,and there are a series of classification methods. And there are some well-known methods such as support vector machine (VSM), K-nearest neighbor(CNN), neural networks and Bayes algorithm.CNNmethod is widely used due to that it is sample,effective and regardless of parameters. However,the traditionalCNNmethod has two critical flaws, one flaw of them is thatCNNmethod is running too slow, the other flaw is that the accuracy of this method is not sufficiently high.

This paper analyzed the causes of detects inCNNmethod and made major improvements: In the feature extraction module,we introduced an improved method based on CHI method,which makes the feature extraction more reasonable. In the classifier speed improvements, we introduced the idea of Rocchio method and some other simple ideas to improve classification running speed. Thus new classifier will be greatly improved in the speed. In the classifier accuracy improvements, we made the classification accuracy significantly improved through the introduction of an improved similarity calculation based on the entropy of properties and some ideas based on class weightedCNNmethod.

Based upon these improvements, we built out a definitely efficient and practical web classification system. This paper completed the implementation of the new classification system,and we used 3578 real web content as a test set to test the performance of our system. Through the analysis of experimental results, we drew a conclusion that in our test data set environment, our improved classifier has achieved high-speed sorting and much higher accuracy rate than traditionalCNNmethod. Web classifier based on new efficient improvedCNNmethod proposed in this paper has much faster speed and much higher classification accuracy rate than the originalCNNmethod and weightedCNNmethod.

摘 要

Abstract

摘要