环境
python 2.7 + pycharm, windows 环境
python已经抓取了评论数据
jieba分词
jieba“结巴”中文分词:使用很广的一个分词组件
支持三种分词模式:
- 精确模式,试图将句子最精确地切开,适合文本分析;
- 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
- 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
因为我们要做文本分析,因此我使用默认的精确模式。
words_ls = jieba.cut(content)
词频统计
词频统计有很多中方法,比如jieba的extract tags, textrank, collections, nltk, pandas, WordCloud等
以下是几种方法的演示
Code中的词频方法说明:
- jieba extract_tags 提取关键词,获取结果是 (word, float)
- jieba textrank 提取摘要 获取结果是(word, float)
- collections获取结果 (word, int)
- jieba & nltk获取结果 (word, int)
Code
# -*- coding: utf-8 -*-
import re
# sys print config
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import uniout
# nlp
import jieba
# 自定义词典
jieba.load_userdict("jieba_dict.txt")
import jieba.analyse
import nltk
import collections
def main():
list = ["好大一只!",
"给母亲买的,希望关键时刻,小布可以给独居的老人可以和我们儿女联系上,所以大家都很期待,今天货一到,我立马送去母亲家安装。特把体验发给大家分享:外形和尺寸感觉不错,语音识别度不高,自己找不到回去充电的路径,拍五下后求救功能竟然没用,机器人显示求救信号发出,我手机没有收到任何信号,感觉非常失望",
"外观很漂亮,京东配送也很好。只是小布语音识别率太低了,而且孩子学习,游戏内容太少,一句话感觉是买了个初级产品。已退货。",
"很好的小布"]
topK = 20
FILE_NAME = 'analysis.txt'
RESULT_NAME = 'result.txt'
analysis_file = open(FILE_NAME, 'w')
for each in list:
analysis_file.write(each.strip())
analysis_file.close()
# analysis
content = open(FILE_NAME, 'rb').read()
# 去除文本中的中文符号和英文符号
content = content.decode("utf8")
content = re.sub("[\s+\.\!\:\/_,$%^*(+\"\']+|[+——!,。??、~@#¥%……&*()]+".decode("utf8"), "".decode("utf8"), content)
# 1. --- analysis with jieba extract_tags. (word, float)
data = jieba.analyse.extract_tags(content, topK=topK, withWeight=True)
with open("result_jieba_tags.txt", 'w') as fw:
for k, v in data:
fw.write("%s,%f\n" % (k, v))
# 2. analysis jieba textrank. (word, float)
data = jieba.analyse.textrank(content, topK=topK, withWeight=True, allowPOS=('ns', 'n', 'vn', 'v'))
with open("result_jieba_rank.txt", 'w') as fw:
for k, v in data:
fw.write("%s,%f\n" % (k, v))
# 3. --- analysis with jieba & collections. (word, int)
words_ls = jieba.cut(content)
data = dict(collections.Counter(words_ls))
sort_dict = sorted(data.iteritems(), key=lambda d: d[1], reverse=True)
with open("result_collect.txt", 'w') as fw:
for k, v in sort_dict:
fw.write("%s,%d\n" % (k.encode('utf-8'), v))
# 4. --- analysis with jieba & nltk. (word, int)
words_ls = jieba.cut(content)
fd = nltk.FreqDist(words_ls)
keys = fd.keys()
item = fd.iteritems()
dicts = dict(item)
sort_dict = sorted(dicts.iteritems(), key=lambda d: d[1], reverse=True)
with open("result_nltk.txt", 'w') as fw:
for k, v in sort_dict:
fw.write("%s,%d\n" % (k.encode('utf-8'), v))
if __name__ == '__main__':
main()