NLP

文本摘要(一)

TextRank, seq2seq, word2vec训练词向量

Posted by 新宇 on April 23, 2021

一、项目介绍

  • 文本摘要任务就是利用模型自动完成关键信息的抽取, 文本核心语义的概括,用一个简短的结果文本来表达和原文本同样的意思, 并传达等效的信息.
    • 中学语文课的中心思想概括.
    • 新浪体育上的体育新闻短评.
    • 今日头条上的每日重要新闻概览.
    • 英语考试中的概括某段落信息的选择题.
  • 从NLP的角度看待文本摘要任务, 主流的涵盖两大方法:
    • 抽取式摘要: Extraction-based
      • 无监督抽取. 不需要平行语料, 节省了人工标记的成本. 大体上有如下几种:
        • Lead
        • Centroid
        • ClusterCMRW
        • TextRank (基于统计层面的, 即最大化摘要句子对原始文档的表征能力.)
      • 有监督抽取. (最著名的有监督抽取方法就是BertSum算法. 也是目前有监督抽取中最有效, 最前沿的方法.)
        • R2N2
        • NeuralSum
        • SummaRuNNer
        • BertSum
    • 生成式摘要: Abstraction-based
      • 采用基于神经网络模型的结构, 通过Encoder + Decoder的连接方式, 自由生成一段概括源文档信息的文本.
      • 生成式摘要基于对篇章article的精确理解.

二、模型架构

1. 基于TextRank架构

  • 如果一个单词出现在很多单词的后面, 就是它和很多单词有关联, 那么说明这个单词比较重要.
  • 如果一个TextRank值很高的单词后面跟着另一个单词, 那么后面这个单词的TextRank值也会相应的被提高.
  • 优缺点分析:
    • 优点:
      • 算法简单,可解释性强;
      • 抽取式摘要语句是从原文抽取,语句通顺流畅;
    • 缺点:
      • 文章摘要有时并不能通过原文词频获取,而是更高层次的语义抽象;

2. 基于seq2seq的生成式架构

  • 优缺点分析:
    • 优点:
      • 可以不限于原文自由生成摘要文本;
      • 摘要中也展示了原文中的关键信息;
    • 缺点:
      • 最大痛点就是”重复子串生成”问题,原因是因为解码步骤中的argmax导致的;
      • 生成文本有时不通顺,不易理解;

三、调优步骤

  • 基于TextRank调优
  • 基于seq2seq调优
    • 解决”重复子串生成”问题.
      • 将解码步骤中的argmax生成词语,改为按概率生成(从概率排名top10中的词汇抽取)
  • 基于word2vec训练

四、运行截图&结果截图

  • TextRank
  • 基于seq2seq
    • 未调优
    • 解决”重复子串生成”问题
  • 基于word2vec训练

五、代码

1. TextRank

1.1 model.py

# 导入正则表达式工具包, 用来删除特定模式的数据

import re

# 导入textrank4zh的相关工具包

from textrank4zh import TextRank4Keyword, TextRank4Sentence
import pandas as pd

def clean_sentence(sentence):
    # 第一步要处理的代码

    # 1. 将sentence按照'|'分句,并只提取技师的话

    sub_jishi = []
    # 按照'|'字符将车主和用户的对话分离

    sub = sentence.split('|')

    # 遍历每个子句

    for i in range(len(sub)):
        # 如果不是以句号结尾, 增加一个句号

        if not sub[i].endswith('。'):
            sub[i] += '。'
        # 只使用技师说的句子

        if sub[i].startswith('技师'):
            sub_jishi.append(sub[i])

    # 拼接成字符串并返回

    sentence = ''.join(sub_jishi)

    # 第二步中添加的两个处理, 利用正则表达式re工具

    # 2. 删除1. 2. 3. 这些标题

    r = re.compile("\D(\d\.)\D")
    sentence = r.sub("", sentence)

    # 3. 删除一些无关紧要的词以及语气助词

    r = re.compile(r"车主说|技师说|语音|图片|呢|吧|哈|啊|啦")
    sentence = r.sub("", sentence)

    # 第三步中添加的4个处理

    # 4. 删除带括号的 进口 海外

    r = re.compile(r"[((]进口[))]|\(海外\)")
    sentence = r.sub("", sentence)

    # 5. 删除除了汉字数字字母和,!?。.- 以外的字符

    r = re.compile("[^,!?。\.\-\u4e00-\u9fa5_a-zA-Z0-9]")

    # 6. 半角变为全角

    sentence = sentence.replace(",", ",")
    sentence = sentence.replace("!", "!")
    sentence = sentence.replace("?", "?")

    # 7. 问号叹号变为句号

    sentence = sentence.replace("?", "。")
    sentence = sentence.replace("!", "。")
    sentence = r.sub("", sentence)

    # 第四步添加的删除特定位置的特定字符

    # 8. 删除句子开头的逗号

    if sentence.startswith(','):
        sentence = sentence[1:]

    return sentence

if __name__ == '__main__':

    # 读取数据, 并指定编码格式为'utf-8'

    df = pd.read_csv('../data/dev.csv', engine='python', encoding='utf-8')
    texts = df['Dialogue'].tolist()
    # 初始化结果存放的列表

    results = []
    # 初始化textrank4zh类对象

    tr4s = TextRank4Sentence()

    # 循环遍历整个测试集, texts是经历前面数据预处理后的结果列表

    for i in range(len(texts)):
        text = clean_sentence(texts[i])
        # 直接调用分析函数

        tr4s.analyze(text=text, lower=True, source='all_filters')
        result = ''

        # 直接调用函数获取关键语句

        # num=3: 获取重要性最高的3个句子.

        # sentence_min_len=2: 句子的长度最小等于2.

        for item in tr4s.get_key_sentences(num=3, sentence_min_len=2):
            result += item.sentence
            result += '。'

        results.append(result)

        # 间隔100次打印结果

        if (i + 1) % 100 == 0:
            print(i + 1, result)

    print('result length: ', len(results))

    # 保存结果

    df['Prediction'] = results

    # 提取ID, Report, 和预测结果这3列

    df = df[['QID', 'Report', 'Prediction']]

    # 保存结果,这里自动生成一个结果名

    df.to_csv('../data/textrank_result_.csv', index=None, sep=',')

    # 将空行置换为随时联系, 文件保存格式指定为utf-8

    df = pd.read_csv('../data/textrank_result_.csv', engine='python', encoding='utf-8')
    df = df.fillna('随时联系。')

    # 将处理后的文件保存起来

    df.to_csv('../data/textrank_result_final_.csv', index=None, sep=',')

1.2 train_dev_split

import pandas as pd

# 切分训练集与验证集

train_path = '../data/train.csv'
test_path = '../data/test.csv'
df_train = pd.read_csv(train_path, encoding='utf-8')
# df_train.info()

train_sum = 70000
df_train_new = df_train.iloc[1:train_sum, :]
df_dev = df_train.iloc[train_sum:, :]

df_train_new.to_csv('../data/train_new.csv', columns=['QID','Brand','Model','Question','Dialogue','Report'], sep=',', index=None)
df_dev.to_csv('../data/dev.csv', columns=['QID','Brand','Model','Question','Dialogue','Report'] ,sep=',', index=None)


2. seq2seq

2.1 layers

# 导入工具包

import torch
import torch.nn as nn
import torch.nn.functional as F
import os
import sys

# 设定项目的root路径, 方便后续的代码文件导入

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

# 导入项目相关的代码文件

from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings,load_embedding_matrix_from_model


# 构建编码器类

class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_size):
        super(Encoder, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.enc_units = enc_units
        self.batch_size = batch_size

        # 第一层: 词嵌入层

        # self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # 采用word2wec预训练词向量

        # 参数embedding_matrix: 从文件中读取的词向量矩阵, 直接作为参数传入即可.

        embedding_matrix = torch.tensor(load_embedding_matrix_from_model(word_vector_path)).float()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix)

        # 第二层: GRU层

        self.gru = nn.GRU(input_size=embedding_dim, hidden_size=enc_units, num_layers=1, batch_first=True)

    def forward(self, x, h0):
        # x.shape: (batch_size, sequence_length)

        # h0.shape: (num_layers, batch_size, enc_units)

        x = self.embedding(x)
        output, hn = self.gru(x, h0)
        return output, hn.transpose(1, 0)

    def initialize_hidden_state(self):
        # hidden state张量形状: (num_layers, batch_size, enc_units)

        return torch.zeros(1, self.batch_size, self.enc_units)

class Attention(nn.Module):
    def __init__(self, enc_units, dec_units, attn_units):
        super(Attention, self).__init__()
        self.enc_units = enc_units
        self.dec_units = dec_units
        self.attn_units = attn_units

        # 计算注意力的三次矩阵乘法, 对应着3个全连接层.

        self.w1 = nn.Linear(enc_units, attn_units)
        self.w2 = nn.Linear(dec_units, attn_units)
        self.v = nn.Linear(attn_units, 1)

    def forward(self, query, value):
        # query为上次的decoder隐藏层,shape: (batch_size, dec_units)

        # values为编码器的编码结果enc_output,shape: (batch_size, enc_seq_len, enc_units)

        # 在应用self.V之前,张量的形状是(batch_size, enc_seq_len, attention_units)

        # 得到score的shape: (batch_size, seq_len, 1)

        score = self.v(torch.tanh(self.w1(value) + self.w2(query)))

        # 注意力权重,是score经过softmax,但是要作用在第一个轴上(seq_len的轴)

        attention_weights = F.softmax(score, dim=1)

        # (batch_size, enc_seq_len, 1) * (batch_size, enc_seq_len, enc_units)

        # 广播, encoder unit的每个位置都对应相乘

        context_vector = attention_weights * value
        # 在最大长度enc_seq_len这一维度上求和

        context_vector = torch.sum(context_vector, dim=1)
        # context_vector求和之后的shape: (batch_size, enc_units)

        return context_vector, attention_weights

class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_size):
        super(Decoder, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.dec_units = dec_units
        self.batch_size = batch_size

        # self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # 采用word2wec预训练词向量

        # 参数embedding_matrix: 从文件中读取的词向量矩阵, 直接作为参数传入即可.
        
        embedding_matrix = torch.tensor(load_embedding_matrix_from_model(word_vector_path)).float()
        self.embedding = nn.Embedding.from_pretrained(embedding_matrix)

        self.gru = nn.GRU(input_size=embedding_dim + dec_units,
                          hidden_size=dec_units,
                          num_layers=1,
                          batch_first=True)

        self.fc = nn.Linear(dec_units, vocab_size)

    def forward(self, x, context_vector):
        x = self.embedding(x)
        # x.shape after passing through embedding: (batch_size, 1, embedding_dim),1指的是一次只解码一个单词

        # 将上一循环的预测结果跟注意力权重值结合在一起作为本次的GRU网络输入

        x = torch.cat([torch.unsqueeze(context_vector, 1), x], dim=-1)

        output, hn = self.gru(x)
        output = output.squeeze(1)
        prediction = self.fc(output)
        return prediction, hn.transpose(1, 0)

if __name__ == '__main__':
    # word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)
    word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
    vocab_size = len(word_to_id)

    # 测试用参数

    EXAMPLE_INPUT_SEQUENCE_LEN = 300
    BATCH_SIZE = 64
    EMBEDDING_DIM = 500
    GRU_UNITS = 512
    ATTENTION_UNITS = 20

    encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)

    input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
    h0 = encoder.initialize_hidden_state()
    output, hn = encoder(input0, h0)

    attention = Attention(GRU_UNITS, GRU_UNITS, ATTENTION_UNITS)
    context_vector, attention_weights = attention(hn, output)

    decoder = Decoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)
    input1 = torch.ones((BATCH_SIZE, 1), dtype=torch.long)
    output1, hn = decoder(input1, context_vector)
    print(output1.shape)
    print(hn.shape)

    encoder = Encoder(vocab_size, EMBEDDING_DIM, GRU_UNITS, BATCH_SIZE)

    input0 = torch.ones((BATCH_SIZE, EXAMPLE_INPUT_SEQUENCE_LEN), dtype=torch.long)
    h0 = encoder.initialize_hidden_state()
    output, hn = encoder(input0, h0)
    print(output.shape)
    print(hn.shape)

2.2 model

import os
import sys

# 设定项目的root路径, 方便后续代码文件的导入

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

# 导入工具包和项目相关的代码文件

import torch
import torch.nn as nn
from src.layers import Encoder, Attention, Decoder
from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings


# 构建完整的seq2seq模型

class Seq2Seq(nn.Module):
    def __init__(self, params):
        super(Seq2Seq, self).__init__()

        embedding_matrix = load_embedding_matrix_from_model(word_vector_path)
        self.embedding_matrix = torch.from_numpy(embedding_matrix)

        self.params = params

        # 第一层: 编码器层

        self.encoder = Encoder(params['vocab_size'], params['embed_size'],
                               params['enc_units'], params['batch_size'])

        # 第二层: 注意力机制层

        self.attention = Attention(params['enc_units'], params['dec_units'], params['attn_units'])

        # 第三层: 解码器层

        self.decoder = Decoder(params['vocab_size'], params['embed_size'],
                               params['dec_units'], params['batch_size'])

    # 实质上是在调用解码器,因为需要注意力机制,直接封装到forward中. 要调用编码器直接encoder()即可

    def forward(self, dec_input, dec_hidden, enc_output, dec_target):
        # 这里的dec_input实质是(batch_size, 1)大小的<START>

        predictions = []

        # 拿编码器的输出和最终隐含层向量来计算

        context_vector, attention_weights = self.attention(dec_hidden, enc_output)

        # 循环解码

        for t in range(dec_target.shape[1]):
            # dec_input (batch_size, 1); dec_hidden (batch_size, hidden_units)

            pred, dec_hidden = self.decoder(dec_input, context_vector)

            context_vector, attention_weights = self.attention(dec_hidden, enc_output)

            # 使用teacher forcing, 并扩展维度到三维张量

            dec_input = dec_target[:, t].unsqueeze(1)

            predictions.append(pred)

        return torch.stack(predictions, 1), dec_hidden

if __name__ == '__main__':
    # word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)

    word_to_id, id_to_word = get_vocab_from_model(word_vector_path)
    vocab_size = len(word_to_id)
    batch_size = 64
    input_seq_len = 300

    # 模拟测试参数

    params = {"vocab_size": vocab_size, "embed_size": 500, "enc_units": 512,
              "attn_units": 20, "dec_units": 512,"batch_size": batch_size}

    # 实例化类对象

    model = Seq2Seq(params)

    # 初始化测试输入数据

    sample_input_batch = torch.ones((batch_size, input_seq_len), dtype=torch.long)
    sample_hidden = model.encoder.initialize_hidden_state()

    # 调用Encoder进行编码

    sample_output, sample_hidden = model.encoder(sample_input_batch, sample_hidden)

    # 打印输出张量维度

    print('Encoder output shape: (batch_size, enc_seq_len, enc_units) {}'.format(sample_output.shape))
    print('Encoder Hidden state shape: (batch_size, enc_units) {}'.format(sample_hidden.shape))

    # 调用Attention进行注意力张量

    context_vector, attention_weights = model.attention(sample_hidden, sample_output)

    print("Attention context_vector shape: (batch_size, enc_units) {}".format(context_vector.shape))
    print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

    # 调用Decoder进行解码

    dec_input = torch.ones((batch_size, 1), dtype=torch.long)
    sample_decoder_output, _, = model.decoder(dec_input, context_vector)

    print('Decoder output shape: (batch_size, vocab_size) {}'.format(sample_decoder_output.shape))
    # 这里仅测试一步,没有用到dec_seq_len

2.3 train

import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

import torch
import torch.nn as nn
import torch.optim as optim
from batcher import train_batch_generator
import time
from src.model import Seq2Seq
from src.train_helper import train_model
from utils.config import *
from utils.params_utils import get_params
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings


def train(params):
    # 读取word_to_id训练

    # word_to_id, _ = get_word_id_mappings(vocab_path, reverse_vocab_path)

    word_to_id, id_to_word = get_vocab_from_model(word_vector_path)

    # 动态添加字典大小参数

    params['vocab_size'] = len(word_to_id)

    # 构建模型

    print("Building the model ...")
    model = Seq2Seq(params)

    # 训练模型

    print('开始训练模型')
    train_model(model, word_to_id, params)

if __name__ == '__main__':
    params = get_params()
    train(params)

2.4 train_helper

import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)
# print(root_path)

import torch
import torch.nn as nn
import torch.optim as optim
from batcher import train_batch_generator
import time


def train_model(model, word_to_id, params):
    # 载入参数:seq2seq模型的训练轮次以及batch大小

    epochs = params['seq2seq_train_epochs']
    batch_size = params['batch_size']

    pad_index = word_to_id['<PAD>']
    unk_index = word_to_id['<UNK>']
    start_index = word_to_id['<START>']

    params['vocab_size'] = len(word_to_id)

    # 如果有GPU则将模型放在GPU上训练

    device = torch.device('cuda:1' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    print('Model has put to GPU...')

    # 选定Adam优化器, 和交叉熵损失函数

    optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
    criterion = nn.CrossEntropyLoss()

     # 定义损失函数
    def loss_function(pred, real):
        # 相同为1, 不同为0. 则0的位置天然形成了掩码

        pad_mask = torch.eq(real, pad_index)
        unk_mask = torch.eq(real, unk_index)
        mask = torch.logical_not(torch.logical_or(pad_mask, unk_mask))
        pred = pred.transpose(2, 1)
        # 真实标签乘以掩码后, 表达的是真实参与损失计算的序列

        real = real * mask
        loss_ = criterion(pred, real)
        return torch.mean(loss_)

    def train_step(enc_input, dec_target):
        initial_hidden_state = model.encoder.initialize_hidden_state()
        initial_hidden_state = initial_hidden_state.to(device)

        # 老三样: 1.梯度归零

        optimizer.zero_grad()

        enc_output, enc_hidden = model.encoder(enc_input, initial_hidden_state)

        # 第一个decoder输入, 构造(batch_size, 1)的<START>标签作为起始

        dec_input = torch.tensor([start_index] * batch_size)
        dec_input = dec_input.unsqueeze(1)

        # 第一个隐藏层输入

        dec_hidden = enc_hidden

        # for循环逐个预测序列

        dec_input = dec_input.to(device)
        dec_hidden = dec_hidden.to(device)
        enc_output = enc_output.to(device)
        dec_target = dec_target.to(device)
        predictions, _ = model(dec_input, dec_hidden, enc_output, dec_target)

        # 计算损失, 两个张量形状均为(batch, dec_target的len-1)

        loss = loss_function(predictions, dec_target)

        # 老三样: 2.反向传播 + 3.梯度更新

        loss.backward()
        optimizer.step()
        return loss.item()


    # 读取数据

    dataset, steps_per_epoch = train_batch_generator(batch_size)

    for epoch in range(epochs):
        start_time = time.time()
        total_loss = 0

        # 按批次数据进行训练

        for batch, (inputs, targets) in enumerate(dataset):
            inputs = inputs.to(device)
            targets = targets.to(device)
            # 将标签张量的类型转变成和输入张量一致

            targets = targets.type_as(inputs)
            batch_loss = train_step(inputs, targets)
            total_loss = total_loss + batch_loss

            # 每50个batch打印一次训练信息

            if (batch + 1) % 50 == 0:
                print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1, batch + 1, batch_loss))

        if (epoch + 1) % 2 == 0:
            MODEL_PATH = root_path + '/src/saved_model/' + 'model_' + str(epoch) + '.pt'
            torch.save(model.state_dict(), MODEL_PATH)
            print('The model has saved for epoch {}'.format(epoch + 1))
            print('Epoch {} Total Loss {:.4f}'.format(epoch + 1, total_loss))
            print('*************************************')

        # 打印一个epoch所用时间

        print('Time taken for 1 epoch {} sec\n'.format(time.time() - start_time))

2.5 test

import os
import sys
root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

import torch
import torch.nn as nn
import time
import pandas as pd
from src.model import Seq2Seq
from src.test_helper import greedy_decode
from utils.data_loader import load_test_dataset
from utils.config import *
from utils.params_utils import get_params
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings

def test(params):
    print("创建字典")
    # word_to_id, id_to_word = get_word_id_mappings(word_vector_path)

    # word_to_id, id_to_word = get_word_id_mappings(vocab_path, reverse_vocab_path)

    word_to_id, id_to_word = get_vocab_from_model(word_vector_path)

    # 动态添加字典大小参数

    params['vocab_size'] = len(word_to_id)

    print("创建模型")
    model = Seq2Seq(params)

    MODEL_PATH = root_path + '/src/saved_model/' + 'model_19.pt'
    model.load_state_dict(torch.load(MODEL_PATH))
    print("模型加载完毕!")

    # 预测结果、保存结果并返回

    print("生成测试数据迭代器")
    test_x = load_test_dataset()
    print("开始解码......")
    with torch.no_grad():
        results = greedy_decode(model, test_x, word_to_id, id_to_word, params)

    # 去掉预测结果之间的空格

    print("解码完毕, 开始保存结果......")
    results = list(map(lambda x: x.replace(" ", ""), results))
    save_predict_result(results)
    return results


def save_predict_result(results):
    # 读取原始文件(使用其QID列,合并新增的Prediction后再保存)

    print("读取原始测试数据...")
    test_df = pd.read_csv(test_raw_data_path)

    # 填充结果

    print("构建新的DataFrame并保存文件...")
    test_df['Prediction'] = results

    # 提取ID和预测结果两列

    test_df = test_df[['QID', 'Prediction']]

    # 保存结果, 这里自动生成一个结果名

    test_df.to_csv(get_result_filename(), index=None, sep=',')
    print("保存测试结果完毕!")


def get_result_filename():
    now_time = time.strftime('%Y_%m_%d_%H_%M_%S')
    filename = 'seq2seq_' + now_time + '.csv'
    result_path = os.path.join(result_save_path, filename)
    return result_path

if __name__ == '__main__':
    params = get_params()
    results = test(params)
    print(results[:10])

2.6 test_helper

import torch
import torch.nn as nn
import tqdm
import random


def greedy_decode(model, test_x, word_to_id, id_to_word, params):
    batch_size = params['batch_size']
    results = []

    total_test_num = len(test_x)
    # batch操作轮数math.ceil向上取整+1, 因为最后一个batch可能不足一个batch size大小, 但是依然需要计算

    step_epoch = total_test_num // batch_size + 1

    for i in range(step_epoch):
        batch_data = test_x[i * batch_size: (i + 1) * batch_size]
        results += batch_predict(model, batch_data, word_to_id, id_to_word, params)
        if (i + 1) % 10 == 0:
            print('i = ', i + 1)
    return results


def batch_predict(model, inputs, word_to_id, id_to_word, params):
    data_num = len(inputs)
    # 开辟结果存储list

    predicts = [''] * data_num

    # 输入参数inputs是从文件中加载的numpy类型数据, 需要转换成tensor类型

    inputs = torch.from_numpy(inputs)

    # 注意这里的batch_size与config中的batch_size不一定一致

    # 原因是最后一个batch可能不是64, 因此应当按以下形式初始化隐藏层, 而不要直接调用类内函数

    initial_hidden_state = torch.zeros(1, data_num, model.encoder.enc_units)

    # 第一步: 首先经过编码器的处理

    enc_output, enc_hidden = model.encoder(inputs, initial_hidden_state)

    # 为注意力层和解码器层处理准备数据

    dec_hidden = enc_hidden
    dec_input = torch.tensor([word_to_id['<START>']] * data_num)
    dec_input = dec_input.unsqueeze(1)

    # 第二步: 经过注意力层的处理, 得到语义内容分布张量context_vector

    context_vector, _ = model.attention(dec_hidden, enc_output)

    # 第三步: 解码器的解码流程是经典的"自回归"模式, 以for循环连续解码max_dec_len次.

    for t in range(params['max_dec_len']):
        # 计算上下文

        context_vector, attention_weights = model.attention(dec_hidden, enc_output)
        # 单步预测

        predictions, dec_hidden = model.decoder(dec_input, context_vector)


        # # id转换成字符, 采用贪心解码

        # predict_ids = torch.argmax(predictions, dim=1)

        predict_ids = []
        # 这里不采用贪心解码,而使用概率获取生成单词

        # print(predictions)

        for pre in predictions:
            val, index = pre.topk(10)
            rand = random.uniform(0, sum(val))
            cur = 0
            for i in range(len(val)):
                if cur < rand <= (cur + val[i]):
                    predict_ids.append(index[i])
                    break
                cur = cur + val[i]
        predict_ids = torch.tensor(predict_ids)
        # print(predict_ids)

        # 内层for循环是为了处理batch中的每一条数据.

        for index, p_id in enumerate(predict_ids.numpy()):
            predicts[index] += id_to_word[str(p_id)] + ' '

        dec_input = predict_ids.unsqueeze(1)

    results = []
    for pred in predicts:
        # 去掉句子前后空格

        pred = pred.strip()
        # 句子小于max_len就结束, 直接截断

        if '<STOP>' in pred:
            pred = pred[:pred.index('<STOP>')]
        results.append(pred)

    return results

2.7 batcher

# 导入工具包

import os
import sys
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# 设定项目的rootL路径, 方便后续相关代码文件的导入

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

# 导入项目相关的代码文件

from utils.data_loader import load_train_dataset, load_test_dataset

# 训练批次数据生成器函数

def train_batch_generator(batch_size, max_enc_len=300, max_dec_len=50, sample_num=None):
    # batch_size: batch大小

    # max_enc_len: 样本最大长度

    # max_dec_len: 标签最大长度

    # sample_num: 限定样本个数大小

    # 直接从已经预处理好的数据文件中加载训练集数据

    train_X, train_Y = load_train_dataset(max_enc_len, max_dec_len)
    # 对数据进行限定长度的切分

    if sample_num:
        train_X = train_X[:sample_num]
        train_Y = train_Y[:sample_num]

    # 将numpy类型的数据转换为Pytorch下的tensor类型, 因为TensorDataset只接收tensor类型数据

    x_data = torch.from_numpy(train_X)
    y_data = torch.from_numpy(train_Y)

    # 第一步: 先对数据进行封装

    dataset = TensorDataset(x_data, y_data)

    # 第二步: 再对dataset进行迭代器的构建

    # 如果机器没有GPU, 请采用下面的注释行代码

    # dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

    # 如果机器有GPU, 请采用下面的代码, 可以加速训练流程

    dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True,
                         num_workers=4, pin_memory=True)

    # 计算每个epoch要循环多少次

    steps_per_epoch = len(train_X) // batch_size

    # 将封装好的数据集和次数返回

    return dataset, steps_per_epoch

# 测试批次数据生成器函数

def test_batch_generator(batch_size, max_enc_len=300):
    # batch_size: batch大小

    # max_enc_len: 样本最大长度


    # 直接从已经预处理好的数据文件中加载测试集数据

    test_X = load_test_dataset(max_enc_len)

    # 将numpy类型的数据转换为Pytorch下的tensor类型, 因为TensorDataset只接收tensor类型数据

    x_data = torch.from_numpy(test_X)

    # 第一步: 先对数据进行封装

    dataset = TensorDataset(x_data)

    # 第二步: 再对dataset进行迭代器的构建

    dataset = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

    # 计算每个epoch要循环多少次

    steps_per_epoch = len(test_X) // batch_size

    # 将封装好的数据集和次数返回

    return dataset, steps_per_epoch

2.8 config

# 导入os工具包

import os

# 设置项目代码库的root路径, 为后续所有的包导入提供便利

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
print(root_path)

# 设置原始数据文件的路径, 通过以项目root路径为基础, 逐级添加到文件路径

train_raw_data_path = os.path.join(root_path, 'data', 'train.csv')
test_raw_data_path = os.path.join(root_path, 'data', 'test.csv')

# 停用词路径和jieba分词用户自定义字典路径

stop_words_path = os.path.join(root_path, 'data', 'stopwords.txt')
user_dict_path = os.path.join(root_path, 'data', 'user_dict.txt')

# 预处理+切分后的训练测试数据路径

train_seg_path = os.path.join(root_path, 'data', 'train_seg_data.csv')
test_seg_path = os.path.join(root_path, 'data', 'test_seg_data.csv')

# 将训练集和测试机数据混合后的文件路径

merged_seg_path = os.path.join(root_path, 'data', 'merged_seg_data.csv')

# 样本与标签分离,并经过pad处理后的数据路径

train_x_pad_path = os.path.join(root_path, 'data', 'train_X_pad_data.csv')
train_y_pad_path = os.path.join(root_path, 'data', 'train_Y_pad_data.csv')
test_x_pad_path = os.path.join(root_path, 'data', 'test_X_pad_data.csv')

# numpy转换为数字后最终使用的的数据路径

train_x_path = os.path.join(root_path, 'data', 'train_X.npy')
train_y_path = os.path.join(root_path, 'data', 'train_Y.npy')
test_x_path = os.path.join(root_path, 'data', 'test_X.npy')

# 正向词典和反向词典路径

vocab_path = os.path.join(root_path, 'data', 'wv', 'vocab.txt')
reverse_vocab_path = os.path.join(root_path, 'data', 'wv', 'reverse_vocab.txt')

# 测试集结果保存路径

result_save_path = os.path.join(root_path, 'data', 'result')

# 增加词向量模型路径

word_vector_path = os.path.join(root_path, 'data', 'wv', 'word2vec.model')

2.9 data_loader

import numpy as np
import os
import sys
import re
import jieba
import pandas as pd
import numpy as np
# 并行处理模块

from utils.multi_proc_utils import parallelize
# 参数模块

from utils.params_utils import get_params
# 配置模块

from utils.config import *
from utils.word2vec_utils import get_vocab_from_model, get_word_id_mappings

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)



def get_max_len(data):
    # 获得合适的最大长度值(被build_dataset调用)

    # data: 待统计的数据train_df['Question']

    # 句子最大长度为空格数+1

    max_lens = data.apply(lambda x: x.count(' ') + 1)
    # 平均值+2倍方差的方式

    return int(np.mean(max_lens) + 2 * np.std(max_lens))

def transform_data(sentence, word_to_id):
    # 句子转换为index序列(被build_dataset调用)

    # sentence: 'word1 word2 word3 ...'  ->  [index1, index2, index3 ...]

    # word_to_id: 映射字典

    # 字符串切分成词

    words = sentence.split(' ')

    # 按照word_to_id的id进行转换, 到未知词就填充unk的索引

    ids = [word_to_id[w] if w in word_to_id else word_to_id['<UNK>'] for w in words]

    # 返回映射后的文本id值列表

    return ids

def pad_proc(sentence, max_len, word_to_id):
    # 根据max_len和vocab填充<START> <STOP> <PAD> <UNK>

    # 0. 按空格统计切分出词

    words = sentence.strip().split(' ')

    # 1. 截取规定长度的词数

    words = words[:max_len]

    # 2. 填充<UNK>

    sentence = [w if w in word_to_id else '<UNK>' for w in words]

    # 3. 填充<START> <END>

    sentence = ['<START>'] + sentence + ['<STOP>']

    # 4. 判断长度,填充<PAD>

    sentence = sentence + ['<PAD>'] * (max_len - len(words))

    # 以空格连接列表, 返回结果字符串

    return ' '.join(sentence)

def load_stop_words(stop_word_path):
    # 加载停用词(程序调用)

    # stop_word_path: 停用词路径


    # 打开停用词文件

    f = open(stop_word_path, 'r', encoding='utf-8')
    # 读取所有行

    stop_words = f.readlines()

    # 去除每一个停用词前后 空格 换行符

    stop_words = [stop_word.strip() for stop_word in stop_words]
    return stop_words

# 加载停用词, 这里面的stop_words_path是早已在config.py文件中配置好的

stop_words = load_stop_words(stop_words_path)
print('stop_words: ', stop_words[:20])

def clean_sentence(sentence):
    # 特殊符号去除(被sentence_proc调用)

    # sentence: 待处理的字符串

    if isinstance(sentence, str):
        # 删除1. 2. 3. 这些标题

        r = re.compile("\D(\d\.)\D")
        sentence = r.sub("", sentence)

        # 删除带括号的 进口 海外

        r = re.compile(r"[((]进口[))]|\(海外\)")
        sentence = r.sub("", sentence)
        # 删除除了汉字数字字母和,!?。.- 以外的字符

        r = re.compile("[^,!?。\.\-\u4e00-\u9fa5_a-zA-Z0-9]")
        # 用中文输入法下的,!?来替换英文输入法下的,!?

        sentence = sentence.replace(",", ",")
        sentence = sentence.replace("!", "!")
        sentence = sentence.replace("?", "?")
        sentence = r.sub("", sentence)

        # 删除 车主说 技师说 语音 图片 你好 您好

        r = re.compile(r"车主说|技师说|语音|图片|你好|您好")
        sentence = r.sub("", sentence)
        return sentence
    else:
        return ''

def filter_stopwords(seg_list):
    # 过滤一句切好词的话中的停用词(被sentence_proc调用)

    # seg_list: 切好词的列表 [word1 ,word2 .......]

    # 首先去掉多余空字符

    words = [word for word in seg_list if word]
    # 去掉停用词

    return [word for word in words if word not in stop_words]

def sentence_proc(sentence):
    # 预处理模块(处理一条句子, 被sentences_proc调用)

    # sentence: 待处理字符串

    # 第一步: 执行清洗原始文本的操作

    sentence = clean_sentence(sentence)

    # 第二步: 执行分词操作, 默认精确模式, 全模式cut参数cut_all=True

    words = jieba.cut(sentence)

    # 第三步: 将分词结果输入过滤停用词函数中

    words = filter_stopwords(words)

    # 返回字符串结果, 按空格分隔, 将过滤停用词后的列表拼接

    return ' '.join(words)

def sentences_proc(df):
    # 预处理模块(处理一个句子列表, 对每个句子调用sentence_proc操作)

    # df: 数据集

    # 批量预处理训练集和测试集

    for col_name in ['Brand', 'Model', 'Question', 'Dialogue']:
        df[col_name] = df[col_name].apply(sentence_proc)

    # 训练集Report预处理

    if 'Report' in df.columns:
        df['Report'] = df['Report'].apply(sentence_proc)

    # 以Pandas的DataFrame格式返回

    return df

import numpy as np

# 加载处理好的训练样本和训练标签.npy文件(执行完build_dataset后才能使用)

def load_train_dataset(max_enc_len=300, max_dec_len=50):
    # max_enc_len: 最长样本长度, 后面的截断

    # max_dec_len: 最长标签长度, 后面的截断

    train_X = np.load(train_x_path)
    train_Y = np.load(train_y_path)

    train_X = train_X[:, :max_enc_len]
    train_Y = train_Y[:, :max_dec_len]

    return train_X, train_Y

# 加载处理好的测试样本.npy文件(执行完build_dataset后才能使用)

def load_test_dataset(max_enc_len=300):
    # max_enc_len: 最长样本长度, 后面的截断

    test_X = np.load(test_x_path)
    test_X = test_X[:, :max_enc_len]
    return test_X

# 作废!!! 使用下方通过word2vec训练的model

def build_dataset_not_used(train_raw_data_path, test_raw_data_path):
    # 1. 加载原始数据

    print('1. 加载原始数据')
    print(train_raw_data_path)
    # 必须设定数据格式为utf-8

    train_df = pd.read_csv(train_raw_data_path, engine='python', encoding='utf-8')
    test_df = pd.read_csv(test_raw_data_path, engine='python', encoding='utf-8')

    # 82943, 20000

    print('原始训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
    print('\n')

    # 2. 空值去除(对于一行数据, 任意列只要有空值就去掉该行)

    print('2. 空值去除(对于一行数据,任意列只要有空值就去掉该行)')
    train_df.dropna(subset=['Question', 'Dialogue', 'Report'], how='any', inplace=True)
    test_df.dropna(subset=['Question', 'Dialogue'], how='any', inplace=True)
    print('空值去除后训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
    print('\n')

    # 3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)

    print('3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)')
    train_df = parallelize(train_df, sentences_proc)
    test_df = parallelize(test_df, sentences_proc)
    print('\n')
    print('sentences_proc has done!')

    # 4. 合并训练测试集, 用于构造映射字典word_to_id

    print('4. 合并训练测试集, 用于构造映射字典word_to_id')
    # 新建一列, 按行堆积

    train_df['merged'] = train_df[['Question', 'Dialogue', 'Report']].apply(lambda x: ' '.join(x), axis=1)
    # 新建一列, 按行堆积

    test_df['merged'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    # merged列是训练集三列和测试集两列按行连接在一起再按列堆积, 用于构造映射字典

    # 按列堆积, 用于构造映射字典

    merged_df = pd.concat([train_df[['merged']], test_df[['merged']]], axis=0)
    print('训练集行数{}, 测试集行数{}, 合并数据集行数{}'.format(len(train_df), len(test_df), len(merged_df)))
    print('\n')

    # 5. 保存分割处理好的train_seg_data.csv, test_set_data.csv

    print('5. 保存分割处理好的train_seg_data.csv, test_set_data.csv')
    # 把建立的列merged去掉, 该列对于神经网络无用

    train_df = train_df.drop(['merged'], axis=1)
    test_df = test_df.drop(['merged'], axis=1)
    # 将处理后的数据存入持久化文件

    train_df.to_csv(train_seg_path, index=None, header=True)
    test_df.to_csv(test_seg_path, index=None, header=True)
    print('The csv_file has saved!')
    print('\n')

    # 6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id

    print('6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id')
    merged_df.to_csv(merged_seg_path, index=None, header=False)
    print('The word_to_vector file has saved!')
    print('\n')

    # 7. 构建word_to_id字典和id_to_word字典, 根据第6步存储的合并文件数据来完成.

    word_to_id = {}
    count = 0

    # 对训练集数据X进行处理

    with open(merged_seg_path, 'r', encoding='utf-8') as f1:
        for line in f1.readlines():
            line = line.strip().split(' ')
            for w in line:
                if w not in word_to_id:
                    word_to_id[w] = 1
                    count += 1
                else:
                    word_to_id[w] += 1

    print('总体单词总数count=', count)
    print('\n')

    res_dict = {}
    number = 0
    for w, i in word_to_id.items():
        if i >= 5:
            res_dict[w] = i
            number += 1

    print('进入到字典中的单词总数number=', number)
    print('合并数据集的字典构造完毕, word_to_id容量: ', len(res_dict))
    print('\n')

    word_to_id = {}
    count = 0
    for w, i in res_dict.items():
        if w not in word_to_id:
            word_to_id[w] = count
            count += 1

    print('最终构造完毕字典, word_to_id容量=', len(word_to_id))
    print('count=', count)

    # 8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']

    print("8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']")
    train_df['X'] = train_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    test_df['X'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    print('\n')

    # 9. 填充<START>, <STOP>, <UNK>和<PAD>, 使数据变为等长

    print('9. 填充<START>, <STOP>, <UNK> 和 <PAD>, 使数据变为等长')

    # 获取适当的最大长度

    train_x_max_len = get_max_len(train_df['X'])
    test_x_max_len = get_max_len(test_df['X'])
    train_y_max_len = get_max_len(train_df['Report'])

    print('填充前训练集样本的最大长度为: ', train_x_max_len)
    print('填充前测试集样本的最大长度为: ', test_x_max_len)
    print('填充前训练集标签的最大长度为: ', train_y_max_len)


    # 选训练集和测试集中较大的值

    x_max_len = max(train_x_max_len, test_x_max_len)

    # 训练集X填充处理

    # train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, vocab))

    print('训练集X填充PAD, START, STOP, UNK处理中...')
    train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
    # 测试集X填充处理

    print('测试集X填充PAD, START, STOP, UNK处理中...')
    test_df['X'] = test_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
    # 训练集Y填充处理

    print('训练集Y填充PAD, START, STOP, UNK处理中...')
    train_df['Y'] = train_df['Report'].apply(lambda x: pad_proc(x, train_y_max_len, word_to_id))
    print('\n')

    # 10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y

    print('10. 保存填充<START>, <STOP>, <UNK> 和 <PAD>后的X和Y')
    train_df['X'].to_csv(train_x_pad_path, index=None, header=False)
    train_df['Y'].to_csv(train_y_pad_path, index=None, header=False)
    test_df['X'].to_csv(test_x_pad_path, index=None, header=False)
    print('填充后的三个文件保存完毕!')
    print('\n')

    # 11. 重新构建word_to_id字典和id_to_word字典, 根据第10步存储的3个文件数据来完成.

    word_to_id = {}
    count = 0

    # 对训练集数据X进行处理

    with open(train_x_pad_path, 'r', encoding='utf-8') as f1:
        for line in f1.readlines():
            line = line.strip().split(' ')
            for w in line:
                if w not in word_to_id:
                    word_to_id[w] = count
                    count += 1

    print('训练集X字典构造完毕, word_to_id容量: ', len(word_to_id))

    # 对训练集数据Y进行处理

    with open(train_y_pad_path, 'r', encoding='utf-8') as f2:
        for line in f2.readlines():
            line = line.strip().split(' ')
            for w in line:
                if w not in word_to_id:
                    word_to_id[w] = count
                    count += 1

    print('训练集Y字典构造完毕, word_to_id容量: ', len(word_to_id))

    # 对测试集数据X进行处理

    with open(test_x_pad_path, 'r', encoding='utf-8') as f3:
        for line in f3.readlines():
            line = line.strip().split(' ')
            for w in line:
                if w not in word_to_id:
                    word_to_id[w] = count
                    count += 1

    print('测试集X字典构造完毕, word_to_id容量: ', len(word_to_id))
    print('单词总数量count= ', count)

    # 构造逆向字典id_to_word

    id_to_word = {}
    for w, i in word_to_id.items():
        id_to_word[i] = w

    print('逆向字典构造完毕, id_to_word容量: ', len(id_to_word))
    print('\n')

    # 12. 更新vocab并保存

    print('12. 更新vocab并保存')
    save_vocab_as_txt(vocab_path, word_to_id)
    save_vocab_as_txt(reverse_vocab_path, id_to_word)
    print('字典映射器word_to_id, id_to_word保存完毕!')
    print('\n')


    # 13. 数据集转换 将词转换成索引[<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]

    print('13. 数据集转换 将词转换成索引[<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]')
    print('训练集X执行transform_data中......')
    train_ids_x = train_df['X'].apply(lambda x: transform_data(x, word_to_id))
    print('训练集Y执行transform_data中......')
    train_ids_y = train_df['Y'].apply(lambda x: transform_data(x, word_to_id))
    print('测试集X执行transform_data中......')
    test_ids_x = test_df['X'].apply(lambda x: transform_data(x, word_to_id))
    print('\n')

    # 14. 数据转换成numpy数组(需等长)

    # 将索引列表转换成矩阵 [32800, 403, 986, 246, 231] --> array([[32800, 403, 986, 246, 231], ...])

    print('14. 数据转换成numpy数组(需等长)')
    train_X = np.array(train_ids_x.tolist())
    train_Y = np.array(train_ids_y.tolist())
    test_X = np.array(test_ids_x.tolist())
    print('转换为numpy数组的形状如下: \ntrain_X的shape为: ', train_X.shape, '\ntrain_Y的shape为: ', train_Y.shape, '\ntest_X的shape为: ', test_X.shape)
    print('\n')

    # 15. 保存数据

    print('15. 保存数据......')
    np.save(train_x_path, train_X)
    np.save(train_y_path, train_Y)
    np.save(test_x_path, test_X)
    print('\n')
    print('数据集构造完毕, 存储于seq2seq/data/目录下.')

# 数据预处理总函数, 用于数据加载 + 预处理 (注意: 只需执行一次)

# 使用word2vec进行词向量训练

def build_dataset(train_raw_data_path, test_raw_data_path):
    # 1. 加载原始数据
    print('1. 加载原始数据')
    print(train_raw_data_path)
    # 必须设定数据格式为utf-8
    train_df = pd.read_csv(train_raw_data_path, engine='python', encoding='utf-8')
    test_df = pd.read_csv(test_raw_data_path, engine='python', encoding='utf-8')

    # 82943, 20000
    print('原始训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
    print('\n')

    # 2. 空值去除(对于一行数据, 任意列只要有空值就去掉该行)
    print('2. 空值去除(对于一行数据,任意列只要有空值就去掉该行)')
    train_df.dropna(subset=['Question', 'Dialogue', 'Report'], how='any', inplace=True)
    test_df.dropna(subset=['Question', 'Dialogue'], how='any', inplace=True)
    print('空值去除后训练集行数 {}, 测试集行数 {}'.format(len(train_df), len(test_df)))
    print('\n')

    # 3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)
    print('3. 多线程, 批量数据预处理(对每个句子执行sentence_proc, 清除无用词, 分词, 过滤停用词, 再用空格拼接为一个字符串)')
    train_df = parallelize(train_df, sentences_proc)
    test_df = parallelize(test_df, sentences_proc)
    print('\n')
    print('sentences_proc has done!')

    # 4. 合并训练测试集, 用于构造映射字典word_to_id
    print('4. 合并训练测试集, 用于构造映射字典word_to_id')
    # 新建一列, 按行堆积
    train_df['merged'] = train_df[['Question', 'Dialogue', 'Report']].apply(lambda x: ' '.join(x), axis=1)
    # 新建一列, 按行堆积
    test_df['merged'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    # merged列是训练集三列和测试集两列按行连接在一起再按列堆积, 用于构造映射字典
    # 按列堆积, 用于构造映射字典
    merged_df = pd.concat([train_df[['merged']], test_df[['merged']]], axis=0)
    print('训练集行数{}, 测试集行数{}, 合并数据集行数{}'.format(len(train_df), len(test_df), len(merged_df)))
    print('\n')

    # 5. 保存分割处理好的train_seg_data.csv, test_set_data.csv
    print('5. 保存分割处理好的train_seg_data.csv, test_set_data.csv')
    # 把建立的列merged去掉, 该列对于神经网络无用
    train_df = train_df.drop(['merged'], axis=1)
    test_df = test_df.drop(['merged'], axis=1)
    # 将处理后的数据存入持久化文件
    train_df.to_csv(train_seg_path, index=None, header=True)
    test_df.to_csv(test_seg_path, index=None, header=True)
    print('The csv_file has saved!')
    print('\n')

    # 6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id
    print('6. 保存合并数据merged_seg_data.csv, 用于构造映射字典word_to_id')
    merged_df.to_csv(merged_seg_path, index=None, header=False)
    print('The word_to_vector file has saved!')
    print('\n')

    from gensim.models.word2vec import LineSentence, Word2Vec

    # 7. 训练词向量, LineSentence传入csv文件名
    print('7. 训练词向量, LineSentence传入csv文件名.')
    # gensim中的Word2Vec算法默认采用CBOW模式训练.
    wv_model = Word2Vec(LineSentence(merged_seg_path), vector_size=params['embed_size'],
                        negative=5, workers=8, epochs=params['wv_train_epochs'],
                        window=3, min_count=5)
    print('The wv_model has trained over!')
    print('\n')

    # 8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']
    print("8. 将Question和Dialogue用空格连接作为模型输入形成train_df['X']")
    train_df['X'] = train_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    test_df['X'] = test_df[['Question', 'Dialogue']].apply(lambda x: ' '.join(x), axis=1)
    print('\n')

    # 9. 填充<START>, <STOP>, <UNK>和<PAD>, 使数据变为等长
    print('9. 填充<START>, <STOP>, <UNK> 和 <PAD>, 使数据变为等长')

    # 获取适当的最大长度
    train_x_max_len = get_max_len(train_df['X'])
    test_x_max_len = get_max_len(test_df['X'])
    train_y_max_len = get_max_len(train_df['Report'])

    print('填充前训练集样本的最大长度为: ', train_x_max_len)
    print('填充前测试集样本的最大长度为: ', test_x_max_len)
    print('填充前训练集标签的最大长度为: ', train_y_max_len)

    # 选训练集和测试集中较大的值
    x_max_len = max(train_x_max_len, test_x_max_len)

    # 训练集X填充处理
    # train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, vocab))
    word_to_id, _ = get_word_id_mappings(vocab_path, reverse_vocab_path)
    print('训练集X填充PAD, START, STOP, UNK处理中...')
    train_df['X'] = train_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
    # 测试集X填充处理
    print('测试集X填充PAD, START, STOP, UNK处理中...')
    test_df['X'] = test_df['X'].apply(lambda x: pad_proc(x, x_max_len, word_to_id))
    # 训练集Y填充处理
    print('训练集Y填充PAD, START, STOP, UNK处理中...')
    train_df['Y'] = train_df['Report'].apply(lambda x: pad_proc(x, train_y_max_len, word_to_id))
    print('\n')

    # 10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y
    print('10. 保存填充<START>, <STOP>, <UNK>和<PAD>后的X和Y')
    train_df['X'].to_csv(train_x_pad_path, index=None, header=False)
    train_df['Y'].to_csv(train_y_pad_path, index=None, header=False)
    test_df['X'].to_csv(test_x_pad_path, index=None, header=False)
    print('填充后的三个文件保存完毕!')
    print('\n')

    # 11. 重新训练词向量,将<START>, <STOP>, <UNK>, <PAD>加入词典最后
    print('11. 重新训练词向量, 将<START>, <STOP>, <UNK>, <PAD>加入词典最后')
    wv_model.build_vocab(LineSentence(train_x_pad_path), update=True)
    wv_model.train(LineSentence(train_x_pad_path),
                   epochs=params['wv_train_epochs'],
                   total_examples=wv_model.corpus_count)
    print('1/3 train_x_pad_path')
    wv_model.build_vocab(LineSentence(train_y_pad_path), update=True)
    wv_model.train(LineSentence(train_y_pad_path),
                   epochs=params['wv_train_epochs'],
                   total_examples=wv_model.corpus_count)
    print('2/3 train_y_pad_path')
    wv_model.build_vocab(LineSentence(test_x_pad_path), update=True)
    wv_model.train(LineSentence(test_x_pad_path),
                   epochs=params['wv_train_epochs'],
                   total_examples=wv_model.corpus_count)
    print('3/3 test_x_pad_path')

    # 保存词向量模型.model
    wv_model.save(word_vector_path)
    print('词向量训练完成')
    # print('最终词向量的词典大小为: ', len(wv_model.wv.vocab))
    print('\n')

    # 12. 更新vocab并保存
    print('12. 更新vocab并保存')
    word_to_id = {word: index for index, word in enumerate(wv_model.wv.index_to_key)}
    id_to_word = {index: word for index, word in enumerate(wv_model.wv.index_to_key)}
    save_vocab_as_txt(vocab_path, word_to_id)
    save_vocab_as_txt(reverse_vocab_path, id_to_word)
    print('更新后的word_to_id, id_to_word保存完毕!')
    print('\n')

    # 13. 数据集转换 将词转换成索引  [<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]
    print('13. 数据集转换 将词转换成索引  [<START> 方向机 重 ...] -> [32800, 403, 986, 246, 231]')
    print('训练集X执行transform_data中......')
    train_ids_x = train_df['X'].apply(lambda x: transform_data(x, word_to_id))
    print('训练集Y执行transform_data中......')
    train_ids_y = train_df['Y'].apply(lambda x: transform_data(x, word_to_id))
    print('测试集X执行transform_data中......')
    test_ids_x = test_df['X'].apply(lambda x: transform_data(x, word_to_id))
    print('\n')

    # 14. 数据转换成numpy数组(需等长)
    # 将索引列表转换成矩阵 [32800, 403, 986, 246, 231] --> array([[32800, 403, 986, 246, 231], ...])
    print('14. 数据转换成numpy数组(需等长)')
    train_X = np.array(train_ids_x.tolist())
    train_Y = np.array(train_ids_y.tolist())
    test_X = np.array(test_ids_x.tolist())
    print('转换为numpy数组的形状如下: \ntrain_X的shape为: ', train_X.shape, '\ntrain_Y的shape为: ', train_Y.shape, '\ntest_X的shape为: ', test_X.shape)
    print('\n')

    # 15. 保存数据
    print('15. 保存数据')
    np.save(train_x_path, train_X)
    np.save(train_y_path, train_Y)
    np.save(test_x_path, test_X)
    print('\n')
    print('数据集构造完毕,于seq2seq/data/目录下')

# 导入若干工具包

import re
import jieba
import pandas as pd
import numpy as np
import os
import sys

# 设定项目的root路径, 方便后续各个模块代码的导入

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(root_path)

# 并行处理模块

from utils.multi_proc_utils import parallelize
# 配置模块

from utils.config import *
# 参数模块

from utils.params_utils import get_params
# 保存字典为txt

from utils.word2vec_utils import save_vocab_as_txt

# 载入词向量参数

params = get_params()
# jieba载入自定义切词表

jieba.load_userdict(user_dict_path)


if __name__ == '__main__':
    build_dataset(train_raw_data_path, test_raw_data_path)

2.10 multi_proc_utils

import pandas as pd
import numpy as np
from multiprocessing import cpu_count, Pool

cores = cpu_count()
partitions = cores
print(cores)

def parallelize(df, func):
    data_split = np.array_split(df, partitions)
    # 初始化线程池

    pool = Pool(cores)
    # 数据分发, 处理, 再合并

    data = pd.concat(pool.map(func, data_split))
    pool.close()
    # 执行完close后不会有新的进程加入到pool, join函数等待所有子进程结束

    pool.join()
    # 返回处理后的数据

    return data


2.11 params_utils


import argparse

def get_params():
    parser = argparse.ArgumentParser()
    # 编码器和解码器的最大序列长度

    parser.add_argument("--max_enc_len", default=300, help="Encoder input max sequence length", type=int)
    parser.add_argument("--max_dec_len", default=50, help="Decoder input max sequence length", type=int)
    # 一个训练批次的大小

    parser.add_argument("--batch_size", default=64, help="Batch size", type=int)
    # seq2seq训练轮数

    parser.add_argument("--seq2seq_train_epochs", default=20, help="Seq2seq model training epochs", type=int)
    # 词嵌入大小

    parser.add_argument("--embed_size", default=500, help="Words embeddings dimension", type=int)
    # word2vec模型训练轮数

    parser.add_argument("--wv_train_epochs", default=10, help="Word2vec model training epochs", type=int)
    # 编码器、解码器以及attention的隐含层单元数

    parser.add_argument("--enc_units", default=512, help="Encoder GRU cell units number", type=int)
    parser.add_argument("--dec_units", default=512, help="Decoder GRU cell units number", type=int)
    parser.add_argument("--attn_units", default=20, help="Used to compute the attention weights", type=int)
    # 学习率

    parser.add_argument("--learning_rate", default=0.001, help="Learning rate", type=float)
    args = parser.parse_args()

    # param是一个字典类型的变量,键为参数名,值为参数值

    params = vars(args)
    return params

if __name__ == '__main__':
    res = get_params()
    print(res)

2.22 word2vec_utils

from gensim.models.word2vec import Word2Vec


def load_embedding_matrix_from_model(wv_model_path):
    # 从word2vec模型中获取词向量矩阵

    # wv_model_path: word2vec模型的路径

    wv_model = Word2Vec.load(wv_model_path)
    # wv_model.wv.vectors包含词向量矩阵

    embedding_matrix = wv_model.wv.vectors
    return embedding_matrix

def get_vocab_from_model(wv_model_path):
    # 从word2vec模型中获取正向和反向词典

    # wv_model_path: word2vec模型的路径

    wv_model = Word2Vec.load(wv_model_path)
    id_to_word = {index: word for index, word in enumerate(wv_model.wv.index2word)}
    word_to_id = {word: index for index, word in enumerate(wv_model.wv.index2word)}
    return word_to_id, id_to_word


def save_vocab_as_txt(filename, word_to_id):
    # 保存字典

    # filename: 目标txt文件路径

    # word_to_id: 要保存的字典

    with open(filename, 'w', encoding='utf-8') as f:
        for k, v in word_to_id.items():
            f.write("{}\t{}\n".format(k, v))

def get_word_id_mappings(vocab_path, reverse_vocab_path):
    word_to_id = {}
    id_to_word = {}
    with open(vocab_path, 'r') as f:
        for line in f.readlines():
            val = line.split('\t')
            word_to_id[val[0].strip()] = int(val[1].strip())
    with open(reverse_vocab_path, 'r') as f:
        for line in f.readlines():
            val = line.split('\t')
            id_to_word[val[0].strip()] = val[1].strip()
    return word_to_id, id_to_word