怎么用python统计中文高频词?

2025-04-24 10:28 59

一、环境准备

安装Python：

确保已安装Python 3.x环境；

安装必要库：

通过`pip`安装`jieba`（用于分词）和`matplotlib`（用于可视化）。

二、基础方法：字典统计

读取文本：

使用`open（）`函数读取中文文本文件；

分词处理：

利用`jieba.lcut（）`进行精确分词；

统计词频：

通过字典记录每个词的出现次数，并排序输出。

示例代码：

```python

import jieba

from collections import Counter

读取文本文件

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

分词

words = jieba.lcut(text)

过滤单字词

words = [word for word in words if len(word) > 1]

统计词频

counter = Counter(words)

排序并输出前10个高频词

for word, freq in counter.most_common(10):

print(f"{word}: {freq}")

```

三、进阶方法：使用`Counter`简化统计

`Counter`是Python内置的计数工具，可简化词频统计过程。

示例代码：

```python

import jieba

from collections import Counter

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1]

top_words = Counter(filtered_words).most_common(10)

for word, freq in top_words:

print(f"{word}: {freq}")

```

四、扩展功能：去除停用词

停用词（如“的”“了”“在”等）会影响统计结果，可结合自定义停用词表进行过滤。

示例代码：

```python

import jieba

from collections import Counter

加载自定义停用词表

stop_words = set(open('stopwords.txt', 'r', encoding='utf-8').read().split())

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1 and word not in stop_words]

top_words = Counter(filtered_words).most_common(10)

for word, freq in top_words:

print(f"{word}: {freq}")

```

五、结果可视化（可选）

使用`matplotlib`绘制词频分布图。

示例代码：

```python

import jieba

from collections import Counter

import matplotlib.pyplot as plt

with open('text.txt', 'r', encoding='utf-8') as file:

text = file.read()

words = jieba.lcut(text)

filtered_words = [word for word in words if len(word) > 1]

counter = Counter(filtered_words)

top_words = counter.most_common(50)

绘制词云（需安装wordcloud库）

from wordcloud import WordCloud

import matplotlib.pyplot as plt

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(counter)

plt.figure(figsize=(10, 5))

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis('off')

plt.show()

```

六、注意事项

编码问题：

确保文本文件使用`utf-8`编码，避免乱码；

分词模式选择：

`jieba`支持精确模式（`cut_all=False`）和全模式（`cut_all=True`），根据需求选择；

性能优化：

处理大文件时，可考虑使用生成器或并行处理提升效率。

通过以上方法，可灵活实现中文高频词统计，并根据需求进行扩展和优化。

本文地址： http://www.hahawenanjuzi.cn/fendoujuzi/309490.html

声明：本站内容均来自网络，如有侵权，请联系我们。