一、项目介绍
- 文本摘要任务就是利用模型自动完成关键信息的抽取, 文本核心语义的概括,用一个简短的结果文本来表达和原文本同样的意思, 并传达等效的信息.
- 中学语文课的中心思想概括.
- 新浪体育上的体育新闻短评.
- 今日头条上的每日重要新闻概览.
- 英语考试中的概括某段落信息的选择题.
- 从NLP的角度看待文本摘要任务, 主流的涵盖两大方法:
- 抽取式摘要: Extraction-based
- 无监督抽取. 不需要平行语料, 节省了人工标记的成本. 大体上有如下几种:
- Lead
- Centroid
- ClusterCMRW
- TextRank (基于统计层面的, 即最大化摘要句子对原始文档的表征能力.)
- 有监督抽取. (最著名的有监督抽取方法就是BertSum算法. 也是目前有监督抽取中最有效, 最前沿的方法.)
- R2N2
- NeuralSum
- SummaRuNNer
- BertSum
- 无监督抽取. 不需要平行语料, 节省了人工标记的成本. 大体上有如下几种:
- 生成式摘要: Abstraction-based
- 采用基于神经网络模型的结构, 通过Encoder + Decoder的连接方式, 自由生成一段概括源文档信息的文本.
- 生成式摘要基于对篇章article的精确理解.
- 抽取式摘要: Extraction-based
二、模型架构
1. 基于TextRank架构
- 如果一个单词出现在很多单词的后面, 就是它和很多单词有关联, 那么说明这个单词比较重要.
- 如果一个TextRank值很高的单词后面跟着另一个单词, 那么后面这个单词的TextRank值也会相应的被提高.
- 优缺点分析:
- 优点:
- 算法简单,可解释性强;
- 抽取式摘要语句是从原文抽取,语句通顺流畅;
- 缺点:
- 文章摘要有时并不能通过原文词频获取,而是更高层次的语义抽象;
- 优点:
2. 基于seq2seq的生成式架构
- 优缺点分析:
- 优点:
- 可以不限于原文自由生成摘要文本;
- 摘要中也展示了原文中的关键信息;
- 缺点:
- 最大痛点就是”重复子串生成”问题,原因是因为解码步骤中的argmax导致的;
- 生成文本有时不通顺,不易理解;
- 优点:
三、调优步骤
- 基于TextRank调优
- 基于seq2seq调优
- 解决”重复子串生成”问题.
- 将解码步骤中的argmax生成词语,改为按概率生成(从概率排名top10中的词汇抽取)
- 解决”重复子串生成”问题.
- 基于word2vec训练
四、运行截图&结果截图
- TextRank
- 基于seq2seq
- 未调优
- 解决”重复子串生成”问题
- 基于word2vec训练
五、代码
1. TextRank
1.1 model.py
# 导入正则表达式工具包, 用来删除特定模式的数据
import re
# 导入textrank4zh的相关工具包
from textrank4zh import TextRank4Keyword, TextRank4Sentence
import pandas as pd
def clean_sentence(sentence):
# 第一步要处理的代码
# 1. 将sentence按照'|'分句,并只提取技师的话
sub_jishi = []
# 按照'|'字符将车主和用户的对话分离
sub = sentence.split('|')
# 遍历每个子句
for i in range(len(sub)):
# 如果不是以句号结尾, 增加一个句号
if not sub[i].endswith('。'):
sub[i] += '。'
# 只使用技师说的句子
if sub[i].startswith('技师'):
sub_jishi.append(sub[i])
# 拼接成字符串并返回
sentence = ''.join(sub_jishi)
# 第二步中添加的两个处理, 利用正则表达式re工具
# 2. 删除1. 2. 3. 这些标题
r = re.compile("\D(\d\.)\D")
sentence = r.sub("", sentence)
# 3. 删除一些无关紧要的词以及语气助词
r = re.compile(r"车主说|技师说|语音|图片|呢|吧|哈|啊|啦")
sentence = r.sub("", sentence)
# 第三步中添加的4个处理
# 4. 删除带括号的 进口 海外
r = re.compile(r"[((]进口[))]|\(海外\)")
sentence = r.sub("", sentence)
# 5. 删除除了汉字数字字母和,!?。.- 以外的字符
r = re.compile("[^,!?。\.\-\u4e00-\u9fa5_a-zA-Z0-9]")
# 6. 半角变为全角
sentence = sentence.replace(",", ",")
sentence = sentence.replace("!", "!")
sentence = sentence.replace("?", "?")
# 7. 问号叹号变为句号
sentence = sentence.replace("?", "。")
sentence = sentence.replace("!", "。")
sentence = r.sub("", sentence)
# 第四步添加的删除特定位置的特定字符
# 8. 删除句子开头的逗号
if sentence.startswith(','):
sentence = sentence[1:]
return sentence
if __name__ == '__main__':
# 读取数据, 并指定编码格式为'utf-8'
df = pd.read_csv('../data/dev.csv', engine='python', encoding='utf-8')
texts = df['Dialogue'].tolist()
# 初始化结果存放的列表
results = []
# 初始化textrank4zh类对象
tr4s = TextRank4Sentence()
# 循环遍历整个测试集, texts是经历前面数据预处理后的结果列表
for i in range(len(texts)):
text = clean_sentence(texts[i])
# 直接调用分析函数
tr4s.analyze(text=text, lower=True, source='all_filters')
result = ''
# 直接调用函数获取关键语句
# num=3: 获取重要性最高的3个句子.
# sentence_min_len=2: 句子的长度最小等于2.
for item in tr4s.get_key_sentences(num=3, sentence_min_len=2):
result += item.sentence
result += '。'
results.append(result)
# 间隔100次打印结果
if (i + 1) % 100 == 0:
print(i + 1, result)
print('result length: ', len(results))
# 保存结果
df['Prediction'] = results
# 提取ID, Report, 和预测结果这3列
df = df[['QID', 'Report', 'Prediction']]
# 保存结果,这里自动生成一个结果名
df.to_csv('../data/textrank_result_.csv', index=None, sep=',')
# 将空行置换为随时联系, 文件保存格式指定为utf-8
df = pd.read_csv('../data/textrank_result_.csv', engine='python', encoding='utf-8')
df = df.fillna('随时联系。')
# 将处理后的文件保存起来
df.to_csv('../data/textrank_result_final_.csv', index=None, sep=',')
1.2 train_dev_split
import pandas as pd
# 切分训练集与验证集
train_path = '../data/train.csv'
test_path = '../data/test.csv'
df_train = pd.read_csv(train_path, encoding='utf-8')
# df_train.info()
train_sum = 70000
df_train_new = df_train.iloc[1:train_sum, :]
df_dev = df_train.iloc[train_sum:, :]
df_train_new.to_csv('../data/train_new.csv', columns=['QID','Brand','Model','Question','Dialogue','Report'], sep=',', index=None)
df_dev.to_csv('../data/dev.csv', columns=['QID','Brand','Model','Question','Dialogue','Report'] ,sep=',', index=None)
2. seq2seq
2.1 layers
# 导入工具包
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import sys
# 设定项目的root路径, 方便后续的代码文件导入
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# 导入项目相关的代码文件
from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings,load_embedding_matrix_from_model
# 构建编码器类
class Encoder(nn.Module):
def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
super(Encoder, self).__init__()
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.enc_units = enc_units
self.batch_size = batch_size
# 第一层: 词嵌入层
# self.embedding = nn.Embedding(vocab_size, embedding_dim)
# 采用word2wec预训练词向量
# 参数embedding_matrix: 从文件中读取的词向量矩阵, 直接作为参数传入即可.
embedding_matrix = torch.tensor(load_embedding_matrix_from_model(word_vector_path)).float()
self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
# 第二层: GRU层
self.gru = nn.GRU(input_size=embedding_dim, hidden_size=enc_units, num_layers=1, batch_first=True)
def forward(self, x, h0):
# x.shape: (batch_size, sequence_length)
# h0.shape: (num_layers, batch_size, enc_units)
x = self.embedding(x)
output, hn = self.gru(x, h0)
return output, hn.transpose(1, 0)
def initialize_hidden_state(self):
# hidden state张量形状: (num_layers, batch_size, enc_units)
return torch.zeros(1, self.batch_size, self.enc_units)
class Attention(nn.Module):
def __init__(self, enc_units, dec_units, attn_units):
super(Attention, self).__init__()
self.enc_units = enc_units
self.dec_units = dec_units
self.attn_units = attn_units
# 计算注意力的三次矩阵乘法, 对应着3个全连接层.
self.w1 = nn.Linear(enc_units, attn_units)
self.w2 = nn.Linear(dec_units, attn_units)
self.v = nn.Linear(attn_units, 1)
def forward(self, query, value):
# query为上次的decoder隐藏层,shape: (batch_size, dec_units)
# values为编码器的编码结果enc_output,shape: (batch_size, enc_seq_len, enc_units)
# 在应用self.V之前,张量的形状是(batch_size, enc_seq_len, attention_units)
# 得到score的shape: (batch_size, seq_len, 1)
score = self.v(torch.tanh(self.w1(value) + self.w2(query)))
# 注意力权重,是score经过softmax,但是要作用在第一个轴上(seq_len的轴)
attention_weights = F.softmax(score, dim=1)
# (batch_size, enc_seq_len, 1) * (batch_size, enc_seq_len, enc_units)
# 广播, encoder unit的每个位置都对应相乘
context_vector = attention_weights * value
# 在最大长度enc_seq_len这一维度上求和
context_vector = torch.sum(context_vector, dim=1)
# context_vector求和之后的shape: (batch_size, enc_units)
return context_vector, attention_weights
class Decoder(nn.Module):
def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
super(Decoder, self).__init__()
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.dec_units = dec_units
self.batch_size = batch_size
# self.embedding = nn.Embedding(vocab_size, embedding_dim)
# 采用word2wec预训练词向量
# 参数embedding_matrix: 从文件中读取的词向量矩阵, 直接作为参数传入即可.
embedding_matrix = torch.tensor(load_embedding_matrix_from_model(word_vector_path)).float()
self.embedding = nn.Embedding.from_pretrained(embedding_matrix)
self.gru = nn.GRU(input_size=embedding_dim + dec_units,
hidden_size=dec_units,
num_layers=1,
batch_first=True)
self.fc = nn.Linear(dec_units, vocab_size)
def forward(self, x, context_vector):
x = self.embedding(x)
# x.shape after passing through embedding: (batch_size, 1, embedding_dim),1指的是一次只解码一个单词
# 将上一循环的预测结果跟注意力权重值结合在一起作为本次的GRU网络输入
x = torch.cat([torch.unsqueeze(context_vector, 1), x], dim=-1)
output, hn = self.gru(x)
output = output.squeeze(1)
prediction = self.fc(output)
return prediction, hn.transpose(1, 0)
if __name__ == '__main__':
# word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)
word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
vocab_size = len(word_to_id)
# 测试用参数
EXAMPLE_INPUT_SEQUENCE_LEN = 300
BATCH_SIZE = 64
EMBEDDING_DIM = 500
GRU_UNITS = 512
ATTENTION_UNITS = 20
encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)
input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
h0 = encoder.initialize_hidden_state()
output, hn = encoder(input0, h0)
attention = Attention(GRU_UNITS, GRU_UNITS, ATTENTION_UNITS)
context_vector, attention_weights = attention(hn, output)
decoder = Decoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)
input1 = torch.ones((BATCH_SIZE, 1), dtype=torch.long)
output1, hn = decoder(input1, context_vector)
print(output1.shape)
print(hn.shape)
encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)
input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
h0 = encoder.initialize_hidden_state()
output, hn = encoder(input0, h0)
print(output.shape)
print(hn.shape)
2.2 model
import os
import sys
# 设定项目的root路径, 方便后续代码文件的导入
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# 导入工具包和项目相关的代码文件
import torch
import torch.nn as nn
from src.layers import Encoder, Attention, Decoder
from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings
# 构建完整的seq2seq模型
class Seq2Seq(nn.Module):
def __init__(self, params):
super(Seq2Seq, self).__init__()
embedding_matrix = load_embedding_matrix_from_model(word_vector_path)
self.embedding_matrix = torch.from_numpy(embedding_matrix)
self.params = params
# 第一层: 编码器层
self.encoder = Encoder(params['vocab_size'], params['embed_size'],
params['enc_units'], params['batch_size'])
# 第二层: 注意力机制层
self.attention = Attention(params['enc_units'], params['dec_units'], params['attn_units'])
# 第三层: 解码器层
self.decoder = Decoder(params['vocab_size'], params['embed_size'],
params['dec_units'], params['batch_size'])
# 实质上是在调用解码器,因为需要注意力机制,直接封装到forward中. 要调用编码器直接encoder()即可
def forward(self, dec_input, dec_hidden, enc_output, dec_target):
# 这里的dec_input实质是(batch_size, 1)大小的<START>
predictions = []
# 拿编码器的输出和最终隐含层向量来计算
context_vector, attention_weights = self.attention(dec_hidden, enc_output)
# 循环解码
for t in range(dec_target.shape[1]):
# dec_input (batch_size, 1); dec_hidden (batch_size, hidden_units)
pred, dec_hidden = self.decoder(dec_input, context_vector)
context_vector, attention_weights = self.attention(dec_hidden, enc_output)
# 使用teacher forcing, 并扩展维度到三维张量
dec_input = dec_target[:, t].unsqueeze(1)
predictions.append(pred)
return torch.stack(predictions, 1), dec_hidden
if __name__ == '__main__':
# word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)
word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
vocab_size = len(word_to_id)
batch_size = 64
input_seq_len = 300
# 模拟测试参数
params = {"vocab_size": vocab_size, "embed_size": 500, "enc_units": 512,
"attn_units": 20, "dec_units": 512,"batch_size": batch_size}
# 实例化类对象
model = Seq2Seq(params)
# 初始化测试输入数据
sample_input_batch = torch.ones((batch_size, input_seq_len), dtype=torch.long)
sample_hidden = model.encoder.initialize_hidden_state()
# 调用Encoder进行编码
sample_output, sample_hidden = model.encoder(sample_input_batch, sample_hidden)
# 打印输出张量维度
print('Encoder output shape: (batch_size, enc_seq_len, enc_units) {}'.format(sample_output.shape))
print('Encoder Hidden state shape: (batch_size, enc_units) {}'.format(sample_hidden.shape))
# 调用Attention进行注意力张量
context_vector, attention_weights = model.attention(sample_hidden, sample_output)
print("Attention context_vector shape: (batch_size, enc_units) {}".format(context_vector.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))
# 调用Decoder进行解码
dec_input = torch.ones((batch_size, 1), dtype=torch.long)
sample_decoder_output, _, = model.decoder(dec_input, context_vector)
print('Decoder output shape: (batch_size, vocab_size) {}'.format(sample_decoder_output.shape))
# 这里仅测试一步,没有用到dec_seq_len
2.3 train
import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
import torch
import torch.nn as nn
import torch.optim as optim
from batcher import train_batch_generator
import time
from src.model import Seq2Seq
from src.train_helper import train_model
from utils.config import *
from utils.params_utils import get_params
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings
def train(params):
# 读取word_to_id训练
# word_to_id, _ = get_word_id_mappings(vocab_path, reverse_vocab_path)
word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
# 动态添加字典大小参数
params['vocab_size'] = len(word_to_id)
# 构建模型
print("Building the model ...")
model = Seq2Seq(params)
# 训练模型
print('开始训练模型')
train_model(model, word_to_id, params)
if __name__ == '__main__':
params = get_params()
train(params)
2.4 train_helper
import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# print(root_path)
import torch
import torch.nn as nn
import torch.optim as optim
from batcher import train_batch_generator
import time
def train_model(model, word_to_id, params):
# 载入参数:seq2seq模型的训练轮次以及batch大小
epochs = params['seq2seq_train_epochs']
batch_size = params['batch_size']
pad_index = word_to_id['<PAD>']
unk_index = word_to_id['<UNK>']
start_index = word_to_id['<START>']
params['vocab_size'] = len(word_to_id)
# 如果有GPU则将模型放在GPU上训练
device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
print('Model has put to GPU...')
# 选定Adam优化器, 和交叉熵损失函数
optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
criterion = nn.CrossEntropyLoss()
# 定义损失函数
def loss_function(pred, real):
# 相同为1, 不同为0. 则0的位置天然形成了掩码
pad_mask = torch.eq(real, pad_index)
unk_mask = torch.eq(real, unk_index)
mask = torch.logical_not(torch.logical_or(pad_mask, unk_mask))
pred = pred.transpose(2, 1)
# 真实标签乘以掩码后, 表达的是真实参与损失计算的序列
real = real * mask
loss_ = criterion(pred, real)
return torch.mean(loss_)
def train_step(enc_input, dec_target):
initial_hidden_state = model.encoder.initialize_hidden_state()
initial_hidden_state = initial_hidden_state.to(device)
# 老三样: 1.梯度归零
optimizer.zero_grad()
enc_output, enc_hidden = model.encoder(enc_input, initial_hidden_state)
# 第一个decoder输入, 构造(batch_size, 1)的<START>标签作为起始
dec_input = torch.tensor([start_index] * batch_size)
dec_input = dec_input.unsqueeze(1)
# 第一个隐藏层输入
dec_hidden = enc_hidden
# for循环逐个预测序列
dec_input = dec_input.to(device)
dec_hidden = dec_hidden.to(device)
enc_output = enc_output.to(device)
dec_target = dec_target.to(device)
predictions, _ = model(dec_input, dec_hidden, enc_output, dec_target)
# 计算损失, 两个张量形状均为(batch, dec_target的len-1)
loss = loss_function(predictions, dec_target)
# 老三样: 2.反向传播 + 3.梯度更新
loss.backward()
optimizer.step()
return loss.item()
# 读取数据
dataset, steps_per_epoch = train_batch_generator(batch_size)
for epoch in range(epochs):
start_time = time.time()
total_loss = 0
# 按批次数据进行训练
for batch, (inputs, targets) in enumerate(dataset):
inputs = inputs.to(device)
targets = targets.to(device)
# 将标签张量的类型转变成和输入张量一致
targets = targets.type_as(inputs)
batch_loss = train_step(inputs, targets)
total_loss = total_loss + batch_loss
# 每50个batch打印一次训练信息
if (batch + 1) % 50 == 0:
print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch + 1, batch_loss))
if (epoch + 1) % 2 == 0:
MODEL_PATH = root_path + '/src/saved_model/' + 'model_' + str(epoch) + '.pt'
torch.save(model.state_dict(), MODEL_PATH)
print('The model has saved for epoch {}'.format(epoch + 1))
print('Epoch {} Total Loss {:.4f}'.format(epoch + 1, total_loss))
print('*************************************')
# 打印一个epoch所用时间
print('Time taken for 1 epoch {} sec\n'.format(time.time() - start_time))
2.5 test
import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
import torch
import torch.nn as nn
import time
import pandas as pd
from src.model import Seq2Seq
from src.test_helper import greedy_decode
from utils.data_loader import load_test_dataset
from utils.config import *
from utils.params_utils import get_params
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings
def test(params):
print("创建字典")
# word_to_id, id_to_word = get_word_id_mappings(word_vector_path)
# word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)
word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
# 动态添加字典大小参数
params['vocab_size'] = len(word_to_id)
print("创建模型")
model = Seq2Seq(params)
MODEL_PATH = root_path + '/src/saved_model/' + 'model_19.pt'
model.load_state_dict(torch.load(MODEL_PATH))
print("模型加载完毕!")
# 预测结果、保存结果并返回
print("生成测试数据迭代器")
test_x = load_test_dataset()
print("开始解码......")
with torch.no_grad():
results = greedy_decode(model, test_x, word_to_id, id_to_word, params)
# 去掉预测结果之间的空格
print("解码完毕, 开始保存结果......")
results = list(map(lambda x: x.replace(" ", ""), results))
save_predict_result(results)
return results
def save_predict_result(results):
# 读取原始文件(使用其QID列,合并新增的Prediction后再保存)
print("读取原始测试数据...")
test_df = pd.read_csv(test_raw_data_path)
# 填充结果
print("构建新的DataFrame并保存文件...")
test_df['Prediction'] = results
# 提取ID和预测结果两列
test_df = test_df[['QID', 'Prediction']]
# 保存结果, 这里自动生成一个结果名
test_df.to_csv(get_result_filename(), index=None, sep=',')
print("保存测试结果完毕!")
def get_result_filename():
now_time = time.strftime('%Y_%m_%d_%H_%M_%S')
filename = 'seq2seq_' + now_time + '.csv'
result_path = os.path.join(result_save_path, filename)
return result_path
if __name__ == '__main__':
params = get_params()
results = test(params)
print(results[:10])
2.6 test_helper
import torch
import torch.nn as nn
import tqdm
import random
def greedy_decode(model, test_x, word_to_id, id_to_word, params):
batch_size = params['batch_size']
results = []
total_test_num = len(test_x)
# batch操作轮数math.ceil向上取整+1, 因为最后一个batch可能不足一个batch size大小, 但是依然需要计算
step_epoch = total_test_num // batch_size + 1
for i in range(step_epoch):
batch_data = test_x[i * batch_size: (i + 1) * batch_size]
results += batch_predict(model, batch_data, word_to_id, id_to_word, params)
if (i + 1) % 10 == 0:
print('i = ', i + 1)
return results
def batch_predict(model, inputs, word_to_id, id_to_word, params):
data_num = len(inputs)
# 开辟结果存储list
predicts = [''] * data_num
# 输入参数inputs是从文件中加载的numpy类型数据, 需要转换成tensor类型
inputs = torch.from_numpy(inputs)
# 注意这里的batch_size与config中的batch_size不一定一致
# 原因是最后一个batch可能不是64, 因此应当按以下形式初始化隐藏层, 而不要直接调用类内函数
initial_hidden_state = torch.zeros(1, data_num, model.encoder.enc_units)
# 第一步: 首先经过编码器的处理
enc_output, enc_hidden = model.encoder(inputs, initial_hidden_state)
# 为注意力层和解码器层处理准备数据
dec_hidden = enc_hidden
dec_input = torch.tensor([word_to_id['<START>']] * data_num)
dec_input = dec_input.unsqueeze(1)
# 第二步: 经过注意力层的处理, 得到语义内容分布张量context_vector
context_vector, _ = model.attention(dec_hidden, enc_output)
# 第三步: 解码器的解码流程是经典的"自回归"模式, 以for循环连续解码max_dec_len次.
for t in range(params['max_dec_len']):
# 计算上下文
context_vector, attention_weights = model.attention(dec_hidden, enc_output)
# 单步预测
predictions, dec_hidden = model.decoder(dec_input, context_vector)
# # id转换成字符, 采用贪心解码
# predict_ids = torch.argmax(predictions, dim=1)
predict_ids = []
# 这里不采用贪心解码,而使用概率获取生成单词
# print(predictions)
for pre in predictions:
val, index = pre.topk(10)
rand = random.uniform(0, sum(val))
cur = 0
for i in range(len(val)):
if cur < rand <= (cur + val[i]):
predict_ids.append(index[i])
break
cur = cur + val[i]
predict_ids = torch.tensor(predict_ids)
# print(predict_ids)
# 内层for循环是为了处理batch中的每一条数据.
for index, p_id in enumerate(predict_ids.numpy()):
predicts[index] += id_to_word[str(p_id)] + ' '
dec_input = predict_ids.unsqueeze(1)
results = []
for pred in predicts:
# 去掉句子前后空格
pred = pred.strip()
# 句子小于max_len就结束, 直接截断
if '<STOP>' in pred:
pred = pred[:pred.index('<STOP>')]
results.append(pred)
return results
2.7 batcher
# 导入工具包
import os
import sys
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
# 设定项目的rootL路径, 方便后续相关代码文件的导入
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# 导入项目相关的代码文件
from utils.data_loader import load_train_dataset, load_test_dataset
# 训练批次数据生成器函数
def train_batch_generator(batch_size, max_enc_len=300, max_dec_len=50, sample_num=None):
# batch_size: batch大小
# max_enc_len: 样本最大长度
# max_dec_len: 标签最大长度
# sample_num: 限定样本个数大小
# 直接从已经预处理好的数据文件中加载训练集数据
train_X, train_Y = load_train_dataset(max_enc_len, max_dec_len)
# 对数据进行限定长度的切分
if sample_num:
train_X = train_X[:sample_num]
train_Y = train_Y[:sample_num]
# 将numpy类型的数据转换为Pytorch下的tensor类型, 因为TensorDataset只接收tensor类型数据
x_data = torch.from_numpy(train_X)
y_data = torch.from_numpy(train_Y)
# 第一步: 先对数据进行封装
dataset = TensorDataset(x_data, y_data)
# 第二步: 再对dataset进行迭代器的构建
# 如果机器没有GPU, 请采用下面的注释行代码
# dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
# 如果机器有GPU, 请采用下面的代码, 可以加速训练流程
dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True,
num_workers=4, pin_memory=True)
# 计算每个epoch要循环多少次
steps_per_epoch = len(train_X) // batch_size
# 将封装好的数据集和次数返回
return dataset, steps_per_epoch
# 测试批次数据生成器函数
def test_batch_generator(batch_size, max_enc_len=300):
# batch_size: batch大小
# max_enc_len: 样本最大长度
# 直接从已经预处理好的数据文件中加载测试集数据
test_X = load_test_dataset(max_enc_len)
# 将numpy类型的数据转换为Pytorch下的tensor类型, 因为TensorDataset只接收tensor类型数据
x_data = torch.from_numpy(test_X)
# 第一步: 先对数据进行封装
dataset = TensorDataset(x_data)
# 第二步: 再对dataset进行迭代器的构建
dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)
# 计算每个epoch要循环多少次
steps_per_epoch = len(test_X) // batch_size
# 将封装好的数据集和次数返回
return dataset, steps_per_epoch
2.8 config
# 导入os工具包
import os
# 设置项目代码库的root路径, 为后续所有的包导入提供便利
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
print(root_path)
# 设置原始数据文件的路径, 通过以项目root路径为基础, 逐级添加到文件路径
train_raw_data_path = os.path.join(root_path, 'data', 'train.csv')
test_raw_data_path = os.path.join(root_path, 'data', 'test.csv')
# 停用词路径和jieba分词用户自定义字典路径
stop_words_path = os.path.join(root_path, 'data', 'stopwords.txt')
user_dict_path = os.path.join(root_path, 'data', 'user_dict.txt')
# 预处理+切分后的训练测试数据路径
train_seg_path = os.path.join(root_path, 'data', 'train_seg_data.csv')
test_seg_path = os.path.join(root_path, 'data', 'test_seg_data.csv')
# 将训练集和测试机数据混合后的文件路径
merged_seg_path = os.path.join(root_path, 'data', 'merged_seg_data.csv')
# 样本与标签分离,并经过pad处理后的数据路径
train_x_pad_path = os.path.join(root_path, 'data', 'train_X_pad_data.csv')
train_y_pad_path = os.path.join(root_path, 'data', 'train_Y_pad_data.csv')
test_x_pad_path = os.path.join(root_path, 'data', 'test_X_pad_data.csv')
# numpy转换为数字后最终使用的的数据路径
train_x_path = os.path.join(root_path, 'data', 'train_X.npy')
train_y_path = os.path.join(root_path, 'data', 'train_Y.npy')
test_x_path = os.path.join(root_path, 'data', 'test_X.npy')
# 正向词典和反向词典路径
vocab_path = os.path.join(root_path, 'data', 'wv', 'vocab.txt')
reverse_vocab_path = os.path.join(root_path, 'data', 'wv', 'reverse_vocab.txt')
# 测试集结果保存路径
result_save_path = os.path.join(root_path, 'data', 'result')
# 增加词向量模型路径
word_vector_path = os.path.join(root_path, 'data', 'wv', 'word2vec.model')
2.9 data_loader
import numpy as np
import os
import sys
import re
import jieba
import pandas as pd
import numpy as np
# 并行处理模块
from utils.multi_proc_utils import parallelize
# 参数模块
from utils.params_utils import get_params
# 配置模块
from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
def get_max_len(data):
# 获得合适的最大长度值(被build_dataset调用)
# data: 待统计的数据train_df['Question']
# 句子最大长度为空格数+1
max_lens = data.apply(lambda x: x.count(' ') + 1)
# 平均值+2倍方差的方式
return int(np.mean(max_lens) + 2 * np.std(max_lens))
def transform_data(sentence, word_to_id):
# 句子转换为index序列(被build_dataset调用)
# sentence: 'word1 word2 word3 ...' -> [index1, index2, index3 ...]
# word_to_id: 映射字典
# 字符串切分成词
words = sentence.split(' ')
# 按照word_to_id的id进行转换, 到未知词就填充unk的索引
ids = [word_to_id[w] if w in word_to_id else word_to_id['<UNK>'] for w in words]
# 返回映射后的文本id值列表
return ids
def pad_proc(sentence, max_len, word_to_id):
# 根据max_len和vocab填充<START> <STOP> <PAD> <UNK>
# 0. 按空格统计切分出词
words = sentence.strip().split(' ')
# 1. 截取规定长度的词数
words = words[:max_len]
# 2. 填充<UNK>
sentence = [w if w in word_to_id else '<UNK>' for w in words]
# 3. 填充<START> <END>
sentence = ['<START>'] + sentence + ['<STOP>']
# 4. 判断长度,填充<PAD>
sentence = sentence + ['<PAD>'] * (max_len - len(words))
# 以空格连接列表, 返回结果字符串
return ' '.join(sentence)
def load_stop_words(stop_word_path):
# 加载停用词(程序调用)
# stop_word_path: 停用词路径
# 打开停用词文件
f = open(stop_word_path, 'r', encoding='utf-8')
# 读取所有行
stop_words = f.readlines()
# 去除每一个停用词前后 空格 换行符
stop_words = [stop_word.strip() for stop_word in stop_words]
return stop_words
# 加载停用词, 这里面的stop_words_path是早已在config.py文件中配置好的
stop_words = load_stop_words(stop_words_path)
print('stop_words: ', stop_words[:20])
def clean_sentence(sentence):
# 特殊符号去除(被sentence_proc调用)
# sentence: 待处理的字符串
if isinstance(sentence, str):
# 删除1. 2. 3. 这些标题
r = re.compile("\D(\d\.)\D")
sentence = r.sub("", sentence)
# 删除带括号的 进口 海外
r = re.compile(r"[((]进口[))]|\(海外\)")
sentence = r.sub("", sentence)
# 删除除了汉字数字字母和,!?。.- 以外的字符
r = re.compile("[^,!?。\.\-\u4e00-\u9fa5_a-zA-Z0-9]")
# 用中文输入法下的,!?来替换英文输入法下的,!?
sentence = sentence.replace(",", ",")
sentence = sentence.replace("!", "!")
sentence = sentence.replace("?", "?")
sentence = r.sub("", sentence)
# 删除 车主说 技师说 语音 图片 你好 您好
r = re.compile(r"车主说|技师说|语音|图片|你好|您好")
sentence = r.sub("", sentence)
return sentence
else:
return ''
def filter_stopwords(seg_list):
# 过滤一句切好词的话中的停用词(被sentence_proc调用)
# seg_list: 切好词的列表 [word1 ,word2 .......]
# 首先去掉多余空字符
words = [word for word in seg_list if word]
# 去掉停用词
return [word for word in words if word not in stop_words]
def sentence_proc(sentence):
# 预处理模块(处理一条句子, 被sentences_proc调用)
# sentence: 待处理字符串
# 第一步: 执行清洗原始文本的操作
sentence = clean_sentence(sentence)
# 第二步: 执行分词操作, 默认精确模式, 全模式cut参数cut_all=True
words = jieba.cut(sentence)
# 第三步: 将分词结果输入过滤停用词函数中
words = filter_stopwords(words)
# 返回字符串结果, 按空格分隔, 将过滤停用词后的列表拼接
return ' '.join(words)
def sentences_proc(df):
# 预处理模块(处理一个句子列表, 对每个句子调用sentence_proc操作)
# df: 数据集
# 批量预处理训练集和测试集
for col_name in ['Brand', 'Model', 'Question', 'Dialogue']:
df[col_name] = df[col_name].apply(sentence_proc)
# 训练集Report预处理
if 'Report' in df.columns:
df['Report'] = df['Report'].apply(sentence_proc)
# 以Pandas的DataFrame格式返回
return df
import numpy as np
# 加载处理好的训练样本和训练标签.npy文件(执行完build_dataset后才能使用)
def load_train_dataset(max_enc_len=300, max_dec_len=50):
# max_enc_len: 最长样本长度, 后面的截断
# max_dec_len: 最长标签长度, 后面的截断
train_X = np.load(train_x_path)
train_Y = np.load(train_y_path)
train_X = train_X[:, :max_enc_len]
train_Y = train_Y[:, :max_dec_len]
return train_X, train_Y
# 加载处理好的测试样本.npy文件(执行完build_dataset后才能使用)
def load_test_dataset(max_enc_len=300):
# max_enc_len: 最长样本长度, 后面的截断
test_X = np.load(test_x_path)
test_X = test_X[:, :max_enc_len]
return test_X
# 作废!!! 使用下方通过word2vec训练的model
def build_dataset_not_used(train_raw_data_path, test_raw_data_path):
# 1. 加载原始数据
print('1. 加载原始数据')
print(train_raw_data_path)
# 必须设定数据格式为utf-8
train_df = pd.read_csv(train_raw_data_path, engine='python', encoding='utf-8')
test_df = pd.read_csv(test_raw_data_path, engine='python', encoding='utf-8')
# 82943, 20000
print('原始训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
print('\n')
# 2. 空值去除(对于一行数据, 任意列只要有空值就去掉该行)
print('2. 空值去除(对于一行数据,任意列只要有空值就去掉该行)')
train_df.dropna(subset=['Question', 'Dialogue', 'Report'], how='any', inplace=True)
test_df.dropna(subset=['Question', 'Dialogue'], how='any', inplace=True)
print('空值去除后训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
print('\n')
# 3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)
print('3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)')
train_df = parallelize(train_df, sentences_proc)
test_df = parallelize(test_df, sentences_proc)
print('\n')
print('sentences_proc has done!')
# 4. 合并训练测试集, 用于构造映射字典word_to_id
print('4. 合并训练测试集, 用于构造映射字典word_to_id')
# 新建一列, 按行堆积
train_df['merged'] = train_df[['Question', 'Dialogue', 'Report']].apply(lambda x: ' '.join(x), axis=1)
# 新建一列, 按行堆积
test_df['merged'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
# merged列是训练集三列和测试集两列按行连接在一起再按列堆积, 用于构造映射字典
# 按列堆积, 用于构造映射字典
merged_df = pd.concat([train_df[['merged']], test_df[['merged']]], axis=0)
print('训练集行数{}, 测试集行数{}, 合并数据集行数{}'.format(len(train_df), len(test_df), len(merged_df)))
print('\n')
# 5. 保存分割处理好的train_seg_data.csv, test_set_data.csv
print('5. 保存分割处理好的train_seg_data.csv, test_set_data.csv')
# 把建立的列merged去掉, 该列对于神经网络无用
train_df = train_df.drop(['merged'], axis=1)
test_df = test_df.drop(['merged'], axis=1)
# 将处理后的数据存入持久化文件
train_df.to_csv(train_seg_path, index=None, header=True)
test_df.to_csv(test_seg_path, index=None, header=True)
print('The csv_file has saved!')
print('\n')
# 6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id
print('6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id')
merged_df.to_csv(merged_seg_path, index=None, header=False)
print('The word_to_vector file has saved!')
print('\n')
# 7. 构建word_to_id字典和id_to_word字典, 根据第6步存储的合并文件数据来完成.
word_to_id = {}
count = 0
# 对训练集数据X进行处理
with open(merged_seg_path, 'r', encoding='utf-8') as f1:
for line in f1.readlines():
line = line.strip().split(' ')
for w in line:
if w not in word_to_id:
word_to_id[w] = 1
count += 1
else:
word_to_id[w] += 1
print('总体单词总数count=', count)
print('\n')
res_dict = {}
number = 0
for w, i in word_to_id.items():
if i >= 5:
res_dict[w] = i
number += 1
print('进入到字典中的单词总数number=', number)
print('合并数据集的字典构造完毕, word_to_id容量: ', len(res_dict))
print('\n')
word_to_id = {}
count = 0
for w, i in res_dict.items():
if w not in word_to_id:
word_to_id[w] = count
count += 1
print('最终构造完毕字典, word_to_id容量=', len(word_to_id))
print('count=', count)
# 8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']
print("8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']")
train_df['X'] = train_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
test_df['X'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
print('\n')
# 9. 填充<START>, <STOP>, <UNK>和<PAD>, 使数据变为等长
print('9. 填充<START>, <STOP>, <UNK> 和 <PAD>, 使数据变为等长')
# 获取适当的最大长度
train_x_max_len = get_max_len(train_df['X'])
test_x_max_len = get_max_len(test_df['X'])
train_y_max_len = get_max_len(train_df['Report'])
print('填充前训练集样本的最大长度为: ', train_x_max_len)
print('填充前测试集样本的最大长度为: ', test_x_max_len)
print('填充前训练集标签的最大长度为: ', train_y_max_len)
# 选训练集和测试集中较大的值
x_max_len = max(train_x_max_len, test_x_max_len)
# 训练集X填充处理
# train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, vocab))
print('训练集X填充PAD, START, STOP, UNK处理中...')
train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
# 测试集X填充处理
print('测试集X填充PAD, START, STOP, UNK处理中...')
test_df['X'] = test_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
# 训练集Y填充处理
print('训练集Y填充PAD, START, STOP, UNK处理中...')
train_df['Y'] = train_df['Report'].apply(lambda x: pad_proc(x, train_y_max_len, word_to_id))
print('\n')
# 10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y
print('10. 保存填充<START>, <STOP>, <UNK> 和 <PAD>后的X和Y')
train_df['X'].to_csv(train_x_pad_path, index=None, header=False)
train_df['Y'].to_csv(train_y_pad_path, index=None, header=False)
test_df['X'].to_csv(test_x_pad_path, index=None, header=False)
print('填充后的三个文件保存完毕!')
print('\n')
# 11. 重新构建word_to_id字典和id_to_word字典, 根据第10步存储的3个文件数据来完成.
word_to_id = {}
count = 0
# 对训练集数据X进行处理
with open(train_x_pad_path, 'r', encoding='utf-8') as f1:
for line in f1.readlines():
line = line.strip().split(' ')
for w in line:
if w not in word_to_id:
word_to_id[w] = count
count += 1
print('训练集X字典构造完毕, word_to_id容量: ', len(word_to_id))
# 对训练集数据Y进行处理
with open(train_y_pad_path, 'r', encoding='utf-8') as f2:
for line in f2.readlines():
line = line.strip().split(' ')
for w in line:
if w not in word_to_id:
word_to_id[w] = count
count += 1
print('训练集Y字典构造完毕, word_to_id容量: ', len(word_to_id))
# 对测试集数据X进行处理
with open(test_x_pad_path, 'r', encoding='utf-8') as f3:
for line in f3.readlines():
line = line.strip().split(' ')
for w in line:
if w not in word_to_id:
word_to_id[w] = count
count += 1
print('测试集X字典构造完毕, word_to_id容量: ', len(word_to_id))
print('单词总数量count= ', count)
# 构造逆向字典id_to_word
id_to_word = {}
for w, i in word_to_id.items():
id_to_word[i] = w
print('逆向字典构造完毕, id_to_word容量: ', len(id_to_word))
print('\n')
# 12. 更新vocab并保存
print('12. 更新vocab并保存')
save_vocab_as_txt(vocab_path, word_to_id)
save_vocab_as_txt(reverse_vocab_path, id_to_word)
print('字典映射器word_to_id, id_to_word保存完毕!')
print('\n')
# 13. 数据集转换 将词转换成索引[<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]
print('13. 数据集转换 将词转换成索引[<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]')
print('训练集X执行transform_data中......')
train_ids_x = train_df['X'].apply(lambda x: transform_data(x, word_to_id))
print('训练集Y执行transform_data中......')
train_ids_y = train_df['Y'].apply(lambda x: transform_data(x, word_to_id))
print('测试集X执行transform_data中......')
test_ids_x = test_df['X'].apply(lambda x: transform_data(x, word_to_id))
print('\n')
# 14. 数据转换成numpy数组(需等长)
# 将索引列表转换成矩阵 [32800, 403, 986, 246, 231] --> array([[32800, 403, 986, 246, 231], ...])
print('14. 数据转换成numpy数组(需等长)')
train_X = np.array(train_ids_x.tolist())
train_Y = np.array(train_ids_y.tolist())
test_X = np.array(test_ids_x.tolist())
print('转换为numpy数组的形状如下: \ntrain_X的shape为: ', train_X.shape, '\ntrain_Y的shape为: ', train_Y.shape, '\ntest_X的shape为: ', test_X.shape)
print('\n')
# 15. 保存数据
print('15. 保存数据......')
np.save(train_x_path, train_X)
np.save(train_y_path, train_Y)
np.save(test_x_path, test_X)
print('\n')
print('数据集构造完毕, 存储于seq2seq/data/目录下.')
# 数据预处理总函数, 用于数据加载 + 预处理 (注意: 只需执行一次)
# 使用word2vec进行词向量训练
def build_dataset(train_raw_data_path, test_raw_data_path):
# 1. 加载原始数据
print('1. 加载原始数据')
print(train_raw_data_path)
# 必须设定数据格式为utf-8
train_df = pd.read_csv(train_raw_data_path, engine='python', encoding='utf-8')
test_df = pd.read_csv(test_raw_data_path, engine='python', encoding='utf-8')
# 82943, 20000
print('原始训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
print('\n')
# 2. 空值去除(对于一行数据, 任意列只要有空值就去掉该行)
print('2. 空值去除(对于一行数据,任意列只要有空值就去掉该行)')
train_df.dropna(subset=['Question', 'Dialogue', 'Report'], how='any', inplace=True)
test_df.dropna(subset=['Question', 'Dialogue'], how='any', inplace=True)
print('空值去除后训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
print('\n')
# 3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)
print('3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)')
train_df = parallelize(train_df, sentences_proc)
test_df = parallelize(test_df, sentences_proc)
print('\n')
print('sentences_proc has done!')
# 4. 合并训练测试集, 用于构造映射字典word_to_id
print('4. 合并训练测试集, 用于构造映射字典word_to_id')
# 新建一列, 按行堆积
train_df['merged'] = train_df[['Question', 'Dialogue', 'Report']].apply(lambda x: ' '.join(x), axis=1)
# 新建一列, 按行堆积
test_df['merged'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
# merged列是训练集三列和测试集两列按行连接在一起再按列堆积, 用于构造映射字典
# 按列堆积, 用于构造映射字典
merged_df = pd.concat([train_df[['merged']], test_df[['merged']]], axis=0)
print('训练集行数{}, 测试集行数{}, 合并数据集行数{}'.format(len(train_df), len(test_df), len(merged_df)))
print('\n')
# 5. 保存分割处理好的train_seg_data.csv, test_set_data.csv
print('5. 保存分割处理好的train_seg_data.csv, test_set_data.csv')
# 把建立的列merged去掉, 该列对于神经网络无用
train_df = train_df.drop(['merged'], axis=1)
test_df = test_df.drop(['merged'], axis=1)
# 将处理后的数据存入持久化文件
train_df.to_csv(train_seg_path, index=None, header=True)
test_df.to_csv(test_seg_path, index=None, header=True)
print('The csv_file has saved!')
print('\n')
# 6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id
print('6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id')
merged_df.to_csv(merged_seg_path, index=None, header=False)
print('The word_to_vector file has saved!')
print('\n')
from gensim.models.word2vec import LineSentence, Word2Vec
# 7. 训练词向量, LineSentence传入csv文件名
print('7. 训练词向量, LineSentence传入csv文件名.')
# gensim中的Word2Vec算法默认采用CBOW模式训练.
wv_model = Word2Vec(LineSentence(merged_seg_path), vector_size=params['embed_size'],
negative=5, workers=8, epochs=params['wv_train_epochs'],
window=3, min_count=5)
print('The wv_model has trained over!')
print('\n')
# 8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']
print("8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']")
train_df['X'] = train_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
test_df['X'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
print('\n')
# 9. 填充<START>, <STOP>, <UNK>和<PAD>, 使数据变为等长
print('9. 填充<START>, <STOP>, <UNK> 和 <PAD>, 使数据变为等长')
# 获取适当的最大长度
train_x_max_len = get_max_len(train_df['X'])
test_x_max_len = get_max_len(test_df['X'])
train_y_max_len = get_max_len(train_df['Report'])
print('填充前训练集样本的最大长度为: ', train_x_max_len)
print('填充前测试集样本的最大长度为: ', test_x_max_len)
print('填充前训练集标签的最大长度为: ', train_y_max_len)
# 选训练集和测试集中较大的值
x_max_len = max(train_x_max_len, test_x_max_len)
# 训练集X填充处理
# train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, vocab))
word_to_id, _ = get_word_id_mappings(vocab_path, reverse_vocab_path)
print('训练集X填充PAD, START, STOP, UNK处理中...')
train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
# 测试集X填充处理
print('测试集X填充PAD, START, STOP, UNK处理中...')
test_df['X'] = test_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
# 训练集Y填充处理
print('训练集Y填充PAD, START, STOP, UNK处理中...')
train_df['Y'] = train_df['Report'].apply(lambda x: pad_proc(x, train_y_max_len, word_to_id))
print('\n')
# 10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y
print('10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y')
train_df['X'].to_csv(train_x_pad_path, index=None, header=False)
train_df['Y'].to_csv(train_y_pad_path, index=None, header=False)
test_df['X'].to_csv(test_x_pad_path, index=None, header=False)
print('填充后的三个文件保存完毕!')
print('\n')
# 11. 重新训练词向量,将<START>, <STOP>, <UNK>, <PAD>加入词典最后
print('11. 重新训练词向量, 将<START>, <STOP>, <UNK>, <PAD>加入词典最后')
wv_model.build_vocab(LineSentence(train_x_pad_path), update=True)
wv_model.train(LineSentence(train_x_pad_path),
epochs=params['wv_train_epochs'],
total_examples=wv_model.corpus_count)
print('1/3 train_x_pad_path')
wv_model.build_vocab(LineSentence(train_y_pad_path), update=True)
wv_model.train(LineSentence(train_y_pad_path),
epochs=params['wv_train_epochs'],
total_examples=wv_model.corpus_count)
print('2/3 train_y_pad_path')
wv_model.build_vocab(LineSentence(test_x_pad_path), update=True)
wv_model.train(LineSentence(test_x_pad_path),
epochs=params['wv_train_epochs'],
total_examples=wv_model.corpus_count)
print('3/3 test_x_pad_path')
# 保存词向量模型.model
wv_model.save(word_vector_path)
print('词向量训练完成')
# print('最终词向量的词典大小为: ', len(wv_model.wv.vocab))
print('\n')
# 12. 更新vocab并保存
print('12. 更新vocab并保存')
word_to_id = {word: index for index, word in enumerate(wv_model.wv.index_to_key)}
id_to_word = {index: word for index, word in enumerate(wv_model.wv.index_to_key)}
save_vocab_as_txt(vocab_path, word_to_id)
save_vocab_as_txt(reverse_vocab_path, id_to_word)
print('更新后的word_to_id, id_to_word保存完毕!')
print('\n')
# 13. 数据集转换 将词转换成索引 [<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]
print('13. 数据集转换 将词转换成索引 [<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]')
print('训练集X执行transform_data中......')
train_ids_x = train_df['X'].apply(lambda x: transform_data(x, word_to_id))
print('训练集Y执行transform_data中......')
train_ids_y = train_df['Y'].apply(lambda x: transform_data(x, word_to_id))
print('测试集X执行transform_data中......')
test_ids_x = test_df['X'].apply(lambda x: transform_data(x, word_to_id))
print('\n')
# 14. 数据转换成numpy数组(需等长)
# 将索引列表转换成矩阵 [32800, 403, 986, 246, 231] --> array([[32800, 403, 986, 246, 231], ...])
print('14. 数据转换成numpy数组(需等长)')
train_X = np.array(train_ids_x.tolist())
train_Y = np.array(train_ids_y.tolist())
test_X = np.array(test_ids_x.tolist())
print('转换为numpy数组的形状如下: \ntrain_X的shape为: ', train_X.shape, '\ntrain_Y的shape为: ', train_Y.shape, '\ntest_X的shape为: ', test_X.shape)
print('\n')
# 15. 保存数据
print('15. 保存数据')
np.save(train_x_path, train_X)
np.save(train_y_path, train_Y)
np.save(test_x_path, test_X)
print('\n')
print('数据集构造完毕,于seq2seq/data/目录下')
# 导入若干工具包
import re
import jieba
import pandas as pd
import numpy as np
import os
import sys
# 设定项目的root路径, 方便后续各个模块代码的导入
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# 并行处理模块
from utils.multi_proc_utils import parallelize
# 配置模块
from utils.config import *
# 参数模块
from utils.params_utils import get_params
# 保存字典为txt
from utils.word2vec_utils import save_vocab_as_txt
# 载入词向量参数
params = get_params()
# jieba载入自定义切词表
jieba.load_userdict(user_dict_path)
if __name__ == '__main__':
build_dataset(train_raw_data_path, test_raw_data_path)
2.10 multi_proc_utils
import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool
cores = cpu_count()
partitions = cores
print(cores)
def parallelize(df, func):
data_split = np.array_split(df, partitions)
# 初始化线程池
pool = Pool(cores)
# 数据分发, 处理, 再合并
data = pd.concat(pool.map(func, data_split))
pool.close()
# 执行完close后不会有新的进程加入到pool, join函数等待所有子进程结束
pool.join()
# 返回处理后的数据
return data
2.11 params_utils
import argparse
def get_params():
parser = argparse.ArgumentParser()
# 编码器和解码器的最大序列长度
parser.add_argument("--max_enc_len", default=300, help="Encoder input max sequence length", type=int)
parser.add_argument("--max_dec_len", default=50, help="Decoder input max sequence length", type=int)
# 一个训练批次的大小
parser.add_argument("--batch_size", default=64, help="Batch size", type=int)
# seq2seq训练轮数
parser.add_argument("--seq2seq_train_epochs", default=20, help="Seq2seq model training epochs", type=int)
# 词嵌入大小
parser.add_argument("--embed_size", default=500, help="Words embeddings dimension", type=int)
# word2vec模型训练轮数
parser.add_argument("--wv_train_epochs", default=10, help="Word2vec model training epochs", type=int)
# 编码器、解码器以及attention的隐含层单元数
parser.add_argument("--enc_units", default=512, help="Encoder GRU cell units number", type=int)
parser.add_argument("--dec_units", default=512, help="Decoder GRU cell units number", type=int)
parser.add_argument("--attn_units", default=20, help="Used to compute the attention weights", type=int)
# 学习率
parser.add_argument("--learning_rate", default=0.001, help="Learning rate", type=float)
args = parser.parse_args()
# param是一个字典类型的变量,键为参数名,值为参数值
params = vars(args)
return params
if __name__ == '__main__':
res = get_params()
print(res)
2.22 word2vec_utils
from gensim.models.word2vec import Word2Vec
def load_embedding_matrix_from_model(wv_model_path):
# 从word2vec模型中获取词向量矩阵
# wv_model_path: word2vec模型的路径
wv_model = Word2Vec.load(wv_model_path)
# wv_model.wv.vectors包含词向量矩阵
embedding_matrix = wv_model.wv.vectors
return embedding_matrix
def get_vocab_from_model(wv_model_path):
# 从word2vec模型中获取正向和反向词典
# wv_model_path: word2vec模型的路径
wv_model = Word2Vec.load(wv_model_path)
id_to_word = {index: word for index, word in enumerate(wv_model.wv.index2word)}
word_to_id = {word: index for index, word in enumerate(wv_model.wv.index2word)}
return word_to_id, id_to_word
def save_vocab_as_txt(filename, word_to_id):
# 保存字典
# filename: 目标txt文件路径
# word_to_id: 要保存的字典
with open(filename, 'w', encoding='utf-8') as f:
for k, v in word_to_id.items():
f.write("{}\t{}\n".format(k, v))
def get_word_id_mappings(vocab_path, reverse_vocab_path):
word_to_id = {}
id_to_word = {}
with open(vocab_path, 'r') as f:
for line in f.readlines():
val = line.split('\t')
word_to_id[val[0].strip()] = int(val[1].strip())
with open(reverse_vocab_path, 'r') as f:
for line in f.readlines():
val = line.split('\t')
id_to_word[val[0].strip()] = val[1].strip()
return word_to_id, id_to_word