一、环境准备
安装Python:
确保已安装Python 3.x环境;
安装必要库:
通过`pip`安装`jieba`(用于分词)和`matplotlib`(用于可视化)。
二、基础方法:字典统计
读取文本:
使用`open()`函数读取中文文本文件;
分词处理:
利用`jieba.lcut()`进行精确分词;
统计词频:
通过字典记录每个词的出现次数,并排序输出。
示例代码:
```python
import jieba
from collections import Counter
读取文本文件
with open('text.txt', 'r', encoding='utf-8') as file:
text = file.read()
分词
words = jieba.lcut(text)
过滤单字词
words = [word for word in words if len(word) > 1]
统计词频
counter = Counter(words)
排序并输出前10个高频词
for word, freq in counter.most_common(10):
print(f"{word}: {freq}")
```
三、进阶方法:使用`Counter`简化统计
`Counter`是Python内置的计数工具,可简化词频统计过程。
示例代码:
```python
import jieba
from collections import Counter
with open('text.txt', 'r', encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text)
filtered_words = [word for word in words if len(word) > 1]
top_words = Counter(filtered_words).most_common(10)
for word, freq in top_words:
print(f"{word}: {freq}")
```
四、扩展功能:去除停用词
停用词(如“的”“了”“在”等)会影响统计结果,可结合自定义停用词表进行过滤。
示例代码:
```python
import jieba
from collections import Counter
加载自定义停用词表
stop_words = set(open('stopwords.txt', 'r', encoding='utf-8').read().split())
with open('text.txt', 'r', encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text)
filtered_words = [word for word in words if len(word) > 1 and word not in stop_words]
top_words = Counter(filtered_words).most_common(10)
for word, freq in top_words:
print(f"{word}: {freq}")
```
五、结果可视化(可选)
使用`matplotlib`绘制词频分布图。
示例代码:
```python
import jieba
from collections import Counter
import matplotlib.pyplot as plt
with open('text.txt', 'r', encoding='utf-8') as file:
text = file.read()
words = jieba.lcut(text)
filtered_words = [word for word in words if len(word) > 1]
counter = Counter(filtered_words)
top_words = counter.most_common(50)
绘制词云(需安装wordcloud库)
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(counter)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
六、注意事项
编码问题:
确保文本文件使用`utf-8`编码,避免乱码;
分词模式选择:
`jieba`支持精确模式(`cut_all=False`)和全模式(`cut_all=True`),根据需求选择;
性能优化:
处理大文件时,可考虑使用生成器或并行处理提升效率。
通过以上方法,可灵活实现中文高频词统计,并根据需求进行扩展和优化。