摘 要
随着经济社会的快速发展,电影作为精神文化产品,得到越来越多人的青睐,人们对电影的评价页也参差不齐,在海量的资源中如何尽快找到符合个人品味的电影,成为观众新的问题。基于Python的数据爬虫技术是目前使用最广泛的方法之一,它能够以最快捷的方式展示用户体验数据,帮助观众进行影片选择。中国影业是著名的电影网站,通过中国影业提供的开放接口大规模地获取电影相关数据。
本毕业设计用Python的Scrapy框架编写爬虫程序抓取了中国影业的影片榜单信息,爬取电影的短评、评分、评价数量等数据,并结合Python的多个库(Pandas、Numpy、Matplotlib),使用Numpy系统存储和处理大型数据,中文Jieba分词工具进行爬取数据的分词文本处理,wordcloud库处理数据关键词,最终通过词云图、网页动态图展示观众情感倾向和影片评分统计等信息。网络信息资源充盈的今天,网络信息的获取工作十分重要,该毕业设计的意义在于为用户观影提供决策支持。
关键词:Python;电影;数据;分析
Abstract
With the rapid development of economy and society, film, as a spiritual and cultural product, has been favored by more and more people. People's evaluation pages of films are also uneven. How to find films that meet personal taste as soon as possible in massive resources has become a new problem for the audience. Python based data crawler technology is one of the most widely used methods at present. It can display the user experience data in the quickest way and help the audience to choose the film. Douban Film is a famous film website, through the open interface provided by Douban Film to obtain film-related data on a large scale.
This graduation design uses the Python Scrapy framework to write the crawler program to grab the Chinese film industry film list information, crawls the movie short comment, the score, the appraisal quantity and so on data, and combines the Python multiple libraries (Pandas、Numpy、Matplotlib), uses the Numpy system to store and process the large data, the Chinese Jieba word segmentation tool carries on the word segmentation text processing, the wordcloud database processing data keyword, finally displays the audience emotion tendency and the film score statistics through the word cloud chart, the web page dynamic chart and so on information. With the full network information resources, the acquisition of network information is very important. The significance of this graduation project is to provide decision support for users to watch the film.
Keywords: Python; film; data; analysis