哈哈文案句子网—文案句子大全

哈哈文案句子网—文案句子大全

怎么用python统计中文高频词?

59

一、环境准备

安装Python:

确保已安装Python 3.x环境;

安装必要库:

通过`pip`安装`jieba`(用于分词)和`matplotlib`(用于可视化)。

二、基础方法:字典统计

读取文本:

使用`open()`函数读取中文文本文件;

分词处理:

利用`jieba.lcut()`进行精确分词;

统计词频:

通过字典记录每个词的出现次数,并排序输出。

示例代码

```python

import jieba

from collections import Counter

读取文本文件

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

分词

words = jieba.lcut(text)

过滤单字词

words = [word for word in words if len(word) > 1]

统计词频

counter = Counter(words)

排序并输出前10个高频词

for word, freq in counter.most_common(10):

print(f"{word}: {freq}")

```

三、进阶方法:使用`Counter`简化统计

`Counter`是Python内置的计数工具,可简化词频统计过程。

示例代码

```python

import jieba

from collections import Counter

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1]

top_words = Counter(filtered_words).most_common(10)

for word, freq in top_words:

print(f"{word}: {freq}")

```

四、扩展功能:去除停用词

停用词(如“的”“了”“在”等)会影响统计结果,可结合自定义停用词表进行过滤。

示例代码

```python

import jieba

from collections import Counter

加载自定义停用词表

stop_words = set(open('stopwords.txt', 'r', encoding='utf-8').read().split())

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1 and word not in stop_words]

top_words = Counter(filtered_words).most_common(10)

for word, freq in top_words:

print(f"{word}: {freq}")

```

五、结果可视化(可选)

使用`matplotlib`绘制词频分布图。

示例代码

```python

import jieba

from collections import Counter

import matplotlib.pyplot as plt

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1]

counter = Counter(filtered_words)

top_words = counter.most_common(50)

绘制词云(需安装wordcloud库)

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(counter)

plt.figure(figsize=(10, 5))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.show()

```

六、注意事项

编码问题:

确保文本文件使用`utf-8`编码,避免乱码;

分词模式选择:

`jieba`支持精确模式(`cut_all=False`)和全模式(`cut_all=True`),根据需求选择;

性能优化:

处理大文件时,可考虑使用生成器或并行处理提升效率。

通过以上方法,可灵活实现中文高频词统计,并根据需求进行扩展和优化。