在Jupyter Notebook中为GooSeeker分词结果计算tf-idf

2022-8-24 16:59| 发布者: Fuller| 查看: 4283| 评论: 0

摘要: 在什么场合下需要计算tf-idf？《GooSeeker分词和情感分析结果excel表怎样计算tf-idf》做了详细说明，在这里就不重复说了。在本notebook中重点对比一下计算tf或者tf-idf会对原始数据产生什么样的影响 ...

【注意】下面介绍的计算公式是一种简化的计算，要计算经典的tf-idf值，不能只用词频表。GooSeeker分词软件的扩展功能模块从V6.0.0开始提供全局tf-idf值和选词矩阵TF-IDF表。具体参看《GooSeeker分词软件的tf-idf算法和特征词选择》

1 介绍

我们发布过一篇文档《GooSeeker分词和情感分析结果excel表怎样计算tf-idf》，介绍了在excel中用公式计算tf-idf。本Jupyter notebook想达到相同的目的，但是，用python程序实现对tf-idf的计算。

在什么场合下需要计算tf-idf？《GooSeeker分词和情感分析结果excel表怎样计算tf-idf》做了详细说明，在这里就不重复说了。在本notebook中重点对比一下计算tf或者tf-idf会对原始数据产生什么样的影响。

GooSeeker分词和情感分析软件导出的含有词频和文档频率的表有：

词频表：一个数据集的所有词的词频和文档频率
选词结果表：如果手工筛选了被分析词，那么这些词的词频和文档频率都存在这个表中
选词矩阵表：上面两个表的词频值都是全局的，就是一个词出现在所有文档的次数，而这个表中的词频分别展示了出现在每个文档的次数，显然，可以用这个表做更进一步的统计计算，比如，进行PCA运算用于降维、计算文档相似度、计算词与词的距离、用社交网络图的方法分析主题组成等等

【注意】在《GooSeeker分词和情感分析结果excel表怎样计算tf-idf》做过说明：我们只使用词频表中的词频和文档频率，没有做归一化，这是一种比较简单的计算公式。

下面就为这几个表计算tf或者tf-idf。

2 使用方法

本次分析任务的操作顺序是：

在GooSeeker分词和文本分析软件上创建文本分析任务并导入包含待分析内容的excel，分析完成后导出词频表、选词结果表和选词矩阵表
将导出的excel表放在本notebook的data/raw文件夹中
从头到尾执行本notebook的单元

注意：GooSeeker发布的每个notebook项目目录都预先规划好了，具体参看Jupyter Notebook项目目录规划参考。如果要新做一个分析项目，把整个模板目录拷贝一份给新项目，然后编写notebook目录下的ipynb文件。

3 修改历史

2022-08-18：第一版发布

4 版权说明

本notebook是GooSeeker大数据分析团队开发的，所分析的源数据是GooSeeker分词和文本分析软件生成的，本notebook中的代码可自由共享使用，包括转发、复制、修改、用于其他项目中。

5 准备运行环境

5.1 引入需要用到的库

# -*- coding: utf-8 -*-

import os
import math
import numpy as np
import pandas as pd

5.2 常量和配置

在我们发布的一系列Jupyter Notebook中，凡是处理GooSeeker分词软件导出的结果文件的，都给各种导出文件起了固定的名字。为了方便大家使用，只要把导出文件放在data/raw文件夹，notebook就会找到导出文件，赋值给对应的文件名变量。下面罗列了可能用到的文件名变量：

file_all_word：词频表
file_chosen_word: 选词结果表
file_seg_effect: 分词效果表
file_word_occurrence_matrix: 选词矩阵表（是否出现）
file_word_frequency_matrix: 文档词频对应矩阵
file_word_document_match: 选词匹配表
file_co_word_matrix: 共词矩阵表

pd.set_option('display.width', 1000) # 设置字符显示宽度
pd.set_option('display.max_rows', None) # 设置显示最大
# np.set_printoptions(threshold=np.inf) # threshold 指定超过多少使用省略号，np.inf代表无限大

# 存原始数据的目录
raw_data_dir = os.path.join(os.getcwd(), '../../data/raw')
# 存处理后的数据的目录
processed_data_dir = os.path.join(os.getcwd(), '../../data/processed')
filename_temp = pd.Series(['词频','分词效果','选词矩阵','选词匹配','选词结果','共词矩阵'])
file_all_word = ''
file_seg_effect = ''
file_word_occurrence_matrix = ''
file_word_document_match = ''
file_chosen_word = ''
file_co_word_matrix = ''

5.3 检测data/raw目录下是否有待分析的GooSeeker分词结果表

本notebook只使用选词矩阵表和共词矩阵表，下面的代码将检查data/raw中有没有这两个表，如果没有会报错，后面的程序就没法执行了。

# 0:'词频', 1:'分词效果', 2:'选词矩阵', 3:'选词匹配', 4:'选词结果', 5:'共词矩阵'
print(raw_data_dir + '\r\n')

for item_filename in os.listdir(raw_data_dir):
if filename_temp[0] in item_filename:
file_all_word = item_filename
continue
if filename_temp[1] in item_filename:
file_seg_effect = item_filename
continue
if filename_temp[2] in item_filename:
file_word_frequency_matrix = item_filename
continue
if filename_temp[4] in item_filename:
file_chosen_word = item_filename
continue
if filename_temp[5] in item_filename:
file_co_word_matrix = item_filename
continue

if file_all_word:
print("词频表：", "data/raw/", file_all_word)
else:
print("词频表：不存在")

if file_seg_effect:
print("分词效果表：", "data/raw/", file_seg_effect)
else:
print("分词效果表：不存在")

if file_word_frequency_matrix:
print("选词矩阵表：", "data/raw/", file_word_frequency_matrix)
else:
print("选词矩阵表：不存在")

if file_chosen_word:
print("选词结果表：", "data/raw/", file_chosen_word)
else:
print("选词结果表：不存在")

if file_co_word_matrix:
print("共词矩阵表：", "data/raw/", file_co_word_matrix)
else:
print("共词矩阵表：不存在")

输出结果像这样：

/Users/work/workspace/notebook/GooSeeker分词软件导出的选词矩阵和共词矩阵的关系-二舅/notebook/eda/../../data/raw

词频表： data/raw/ 词频表-知乎-二舅.xlsx

分词效果表： data/raw/ 分词效果-知乎-二舅.xlsx

选词矩阵表： data/raw/ 选词矩阵-知乎-二舅.xlsx

选词结果表： data/raw/ 选词结果-知乎-二舅.xlsx

共词矩阵表： data/raw/ 共词矩阵-知乎-二舅.xlsx

6 读取数据表并存入矩阵

6.1 词频表

词频表含有所有自动分词得到的词，但是不含有手工添加的词。

6.1.1 读入Pandas DataFrame

df_all_word = pd.read_excel(os.path.join(raw_data_dir, file_all_word))
df_all_word.head()

6.1.2 统计总词数

参考Get the number of rows in a Pandas DataFrame，可以用shape[0]或者len()函数得到DataFrame的行数。

#num_word = df_all_word.shape[0]
num_word = len(df_all_word)
num_word

输出结果：14217

6.1.3 统计总文档数

在分词效果表中有所有文档，所以，用这个表统计文档数。先用Pandas读入该表，然后统计文档数。

df_seg_effect = pd.read_excel(os.path.join(raw_data_dir, file_seg_effect))
df_seg_effect.head(2)

num_doc = df_seg_effect.shape[0]
num_doc

输出结果：829

6.1.4 增加TF列

df_all_word['TF'] = df_all_word['词频'] / num_word
df_all_word.head(5)

6.1.5 增加IDF_2列和IDF列和IDF_10列

分别对应以2为底的对数，e和10为底的对数，以便对比

# math库中的log函数没有能力对一列数据逐个求log，要使用numpy库
#df_all_word['IDF_2'] = math.log2(num_doc / (df_all_word['文档频率'] + 1))
df_all_word['IDF_2'] = np.log2(num_doc / (df_all_word['文档频率'] + 1))
df_all_word['IDF'] = np.log(num_doc / (df_all_word['文档频率'] + 1))
df_all_word['IDF_10'] = np.log10(num_doc / (df_all_word['文档频率'] + 1))
df_all_word.head(5)

6.1.6 增加TF-IDF_2列和TF-IDF列和TF-IDF_10列

tf-idf = tf * idf，计算很简单

df_all_word['TF-IDF_2'] = df_all_word['TF'] * df_all_word['IDF_2']
df_all_word['TF-IDF'] = df_all_word['TF'] * df_all_word['IDF']
df_all_word['TF-IDF_10'] = df_all_word['TF'] * df_all_word['IDF_10']
df_all_word.head()

6.2 选词结果表

6.2.1 读入Pandas DataFrame

df_chosen_word = pd.read_excel(os.path.join(raw_data_dir, file_chosen_word))
df_chosen_word.head()

6.2.2 统计选词数

num_chosen_word = len(df_chosen_word)
num_chosen_word

输出结果：133

6.2.3 增加TF列

df_chosen_word['TF'] = df_chosen_word['词频'] / num_word
df_chosen_word.head(5)

6.2.4 增加IDF_2，IDF，IDF_10列

df_chosen_word['IDF_2'] = np.log2(num_doc / (df_chosen_word['文档频率'] + 1))
df_chosen_word['IDF'] = np.log(num_doc / (df_chosen_word['文档频率'] + 1))
df_chosen_word['IDF_10'] = np.log10(num_doc / (df_chosen_word['文档频率'] + 1))
df_chosen_word.head(5)

6.2.5 增加TF-IDF_2，TF-IDF，TF-IDF_10列

df_chosen_word['TF-IDF_2'] = df_chosen_word['TF'] * df_chosen_word['IDF_2']
df_chosen_word['TF-IDF'] = df_chosen_word['TF'] * df_chosen_word['IDF']
df_chosen_word['TF-IDF_10'] = df_chosen_word['TF'] * df_chosen_word['IDF_10']
df_chosen_word.head()

6.3 选词矩阵表

6.3.1 读入Pandas DataFrame

df_word_frequency_matrix = pd.read_excel(os.path.join(raw_data_dir, file_word_frequency_matrix))
df_word_frequency_matrix.head()

6.3.2 统计匹配上的文档数

num_chosen_doc = df_word_frequency_matrix.shape[0]
num_chosen_doc

输出结果：729

6.3.3 创建TF-IDF_2矩阵

从一个DataFrame拷贝到另一个DataFrame的方法参看：Pandas Create New DataFrame By Selecting Specific Columns。也可以参看Pandas官网关于copy的用法

df_word_tfidf_2 = df_word_frequency_matrix.copy()
df_word_tfidf_2.head()

6.3.4 在Pandas中像Excel一样做统计

参看SUMIF and COUNTIF in pandas，该文讲解了跟EXCEL对应的统计方法。

为了数出来每个词出现在多少个文档中，需要一长串处理，为了搞清楚这一长串处理是什么意思，我们分解成多个步骤做演练。

6.3.4.1 用切片方法把选词矩阵矩阵中的数字切出来

下面这行代码其实生成了一个新DataFrame，只是我们没有赋值给一个变量存下来。

df_word_frequency_matrix.iloc[:, 2:].head(10)

6.3.4.2 找出符合条件的所有行

在上一步的基础上，我们再套一层处理，把数量>0的找出来。下面这行代码其实生成了一个新的DataFrame

df_word_frequency_matrix[df_word_frequency_matrix.iloc[:, 2:] > 0].head(10)

6.3.4.3 针对每一列统计不是NaN的数量

统计结果存到一个新的Series中

word_doc_count = df_word_frequency_matrix[df_word_frequency_matrix.iloc[:, 2:] > 0].iloc[:, 2:].count()
word_doc_count.head(20)

可以试试怎样从Series中取得统计值

print(word_doc_count.iloc[0])
print(word_doc_count.iloc[1])
print(word_doc_count.iloc[2])
print('total ', len(word_doc_count), ' words. Same as ', num_chosen_word)

输出结果：

403

293

184

total 133 words. Same as 133

6.3.4.4 计算TD-IDF_2

计算方案1: 使用下面的公式，df_word_frequency_matrix这个二维结构和word_doc_count这个一维结构一起计算，竟然自动匹配地如此准确。为了验证这个结果，接下来还有计算方案2和计算方案3，计算方法越来越初级。可以看到，计算结果一模一样。

df_word_tfidf_2.iloc[:, 2:] = (df_word_frequency_matrix.iloc[:, 2:] / num_word) * np.log2(num_doc / (word_doc_count.iloc[:] + 1))
df_word_tfidf_2.head()

计算方案2:手工写一个循环，循环每一个所选词，注意，循环次数是num_chosen_word，不是num_word。

for idx in range(num_chosen_word):
#print('idx: ', idx, '; value: ', df_word_frequency_matrix.iloc[0, idx + 2])
df_word_tfidf_2.iloc[:, idx + 2] = (df_word_frequency_matrix.iloc[:, idx + 2] / num_word) * np.log2(num_doc / (word_doc_count.iloc[idx] + 1))
df_word_tfidf_2

计算方案3:手工写一个两层循环，分别循环所选词和匹配上的文档。注意，要用num_chosen_doc，不是num_doc，因为很多文档没有匹配上。

for i in range(num_chosen_doc):
for j in range(num_chosen_word):
df_word_tfidf_2.iloc[i, j + 2] = (df_word_frequency_matrix.iloc[i, j + 2] / num_word) * np.log2(num_doc / (word_doc_count.iloc[j] + 1))
df_word_tfidf_2.head()

6.3.5 把TD-IDF_2保存到excel中

在后续的notebook中，我们将用这些数据计算词与词之间的相关度，然后做网络分析，利用一些社会网络分析法进行计算和探索，所以，把计算好的td-idf选词矩阵存入excel中，放在data/processed文件夹中。

file_word_tdidf2_matrix = os.path.join(processed_data_dir, 'word_tdidf2_matrix.xlsx')
with pd.ExcelWriter(file_word_tdidf2_matrix, engine='xlsxwriter') as writer:
df_word_tfidf_2.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
cell_format = workbook.add_format({'text_wrap': True})
worksheet.set_column('A:Z', cell_format=cell_format)

6.3.6 计算其他对数底数的TD-IDF

6.3.6.1 计算TD-IDF

df_word_tfidf_e = df_word_frequency_matrix.copy()
df_word_tfidf_e.iloc[:, 2:] = (df_word_frequency_matrix.iloc[:, 2:] / num_word) * np.log(num_doc / (word_doc_count.iloc[:] + 1))
df_word_tfidf_e.head()

6.3.6.2 把TD-IDF保存到excel中

file_word_tdidf_matrix = os.path.join(processed_data_dir, 'word_tdidf_matrix.xlsx')
with pd.ExcelWriter(file_word_tdidf_matrix, engine='xlsxwriter') as writer:
df_word_tfidf_e.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
cell_format = workbook.add_format({'text_wrap': True})
worksheet.set_column('A:Z', cell_format=cell_format)

6.3.6.3 计算TD-IDF_10

df_word_tfidf_10 = df_word_frequency_matrix.copy()
df_word_tfidf_10.iloc[:, 2:] = (df_word_frequency_matrix.iloc[:, 2:] / num_word) * np.log10(num_doc / (word_doc_count.iloc[:] + 1))
df_word_tfidf_10.head()

6.3.6.4 把TD-IDF_10保存到excel中

file_word_tdidf10_matrix = os.path.join(processed_data_dir, 'word_tdidf10_matrix.xlsx')
with pd.ExcelWriter(file_word_tdidf10_matrix, engine='xlsxwriter') as writer:
df_word_tfidf_10.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
cell_format = workbook.add_format({'text_wrap': True})
worksheet.set_column('A:Z', cell_format=cell_format)

7 总结

上面已经为词频表、选词结果表、选词矩阵计算了TD-IDF，期望压制一下高频词的打分，其实效果很有限：