朴素贝叶斯分类器对含暴力网络评论的甄别
【摘要】近年来,随着互联网技术的快速发展,互联网已经变得非常流行。大众可以通过互联网掌握即时资讯,对国家大事、热点新闻进行评论,发表自己的观点。然而,由于网民素质参差不齐,网络暴力事件时常发生,会给当事人带来负面影响。对这些评论进行鉴别能够有效的减少网络暴力,建立一个清净、安全的网络环境。一般的网络暴力甄别技术是基于暴力词所出现的频率,然而随着网络语言的流行以及网络评论数据的暴增,传统的网络暴力甄别技术需要花费大量的搜索时间,并且可达到的精度也不是很高,经常出现误判。由此,本文提出了一种基于朴素贝叶斯的网络暴力分类器,给出了建立模型的所有步骤,包括原始数据的收集,文本的预处理:标点符号的删除,表情符号的删除,分词处理,生成词汇表,转换成词向量的稀疏表示以及分类器的构建与预测。最后根据所建立的模型得出了用于分类的词向量,实验结果表明基于朴素贝叶斯的网络暴力分类器可以在大大缩短分类所需要的时间的同时具有令人信服的准确率。
【关键词】网络暴力,朴素贝叶斯,稀疏表示,准确率
Discrimination of Violent Network Comments by Naive Bayesian Classifier
【Abstract】In recent years, with the rapid development of Internet technology, the Internet has become very popular. The public can master instant information, comment on state affairs and hot news and express their own views through the Internet. However, due to the uneven quality of Internet users, network violence often occurs, which has a great negative impact on the parties. Identification of these comments can effectively reduce network violence and establish a clean and safe network environment. General network violence screening technology is based on the frequency of violent words. However, with the popularity of network language and the explosion of network comment data, traditional network violence screening technology needs to spend a lot of search time, and the accuracy that can be achieved is not very high, which often leads to misjudgment. Therefore, this paper proposes a network violence classifier based on Naive Bayes, and gives all the steps of building the model, including the collection of original data, text preprocessing: punctuation deletion, emoticon deletion, word segmentation, vocabulary generation, loose representation of converted word vector, construction and prediction of classifier. Finally, according to the established model, we get the word vector for classification. The experimental results show that the network violence classifier based on Naive Bayes can greatly shorten the time required for classification and has a convincing accuracy.
【Key Words】network violence, naive Bayes, loose representation, accuracy rate