Convolutional Sequence to Sequence Learning 论文地址: Convolutional Sequence to Sequence Learning
Introduction 项目5和之前的项目不同,没有采用循环神经网络,而是采用的常用于图像处理的卷积神经网络。
Preparing the Data 前面没变。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 import torchimport torch.nn as nnimport torch.optim as optimimport torch.nn.functional as Ffrom torchtext.datasets import Multi30kfrom import Field, BucketIteratorimport matplotlib.pyplot as pltimport matplotlib.ticker as tickerimport spacyimport numpy as npimport randomimport mathimport timeSEED = 1234 random.seed(SEED) np.random.seed(SEED) torch.manual_seed(SEED) torch.cuda.manual_seed(SEED) torch.backends.cudnn.deterministic = True spacy_de = spacy.load("en_core_web_sm" ) spacy_en = spacy.load("de_core_news_sm" ) def tokenize_de (text ): """ Tokenizes German text from a string into a list of strings """ return [tok.text for tok in spacy_de.tokenizer(text)] def tokenize_en (text ): """ Tokenizes English text from a string into a list of strings """ return [tok.text for tok in spacy_en.tokenizer(text)]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 SRC = Field(tokenize = tokenize_de, init_token = '<sos>' , eos_token = '<eos>' , lower = True , batch_first = True ) TRG = Field(tokenize = tokenize_en, init_token = '<sos>' , eos_token = '<eos>' , lower = True , batch_first = True ) train_data, valid_data, test_data = Multi30k.splits(exts=('.de' , '.en' ), fields=(SRC, TRG)) SRC.build_vocab(train_data, min_freq = 2 ) TRG.build_vocab(train_data, min_freq = 2 ) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu' ) BATCH_SIZE = 128 train_iterator, valid_iterator, test_iterator = BucketIterator.splits( (train_data, valid_data, test_data), batch_size = BATCH_SIZE, device = device)
Building the model 下一步是建立模型,和之前一样,模型由解码器和编码器组成。编码器将源语言中的输入句子编码为上下文向量,解码器解码上下文向量产生目标语言的输出语句。
Encoder Encoder 之前的基于RNN的编码器可以将整个输入句子压缩为单个上下文向量$z$.卷积序列到序列模型的编码器则不同,它为输入句子中的的每个token获取两个上下文向量。因此,若输入句子有6个token,则将获得12个上下文向量。
下图显示输入语句zwei menschen fechten通过编码器的结果。
首先每个token通过嵌入层,但是由于此模型没有循环连接,因此我们对序列中标记的顺序一无所知。为了解决这个问题,我们设置了第二个嵌入层,即位置嵌入层,它的输入不是令牌本身,而是令牌在序列中的位置。 然后,将标记和位置嵌入元素逐个相加,获得嵌入向量,其中包含有关标记及其在序列中的位置信息,然后是线性层,将输入向量转为具有所需隐藏大小的向量。 下一步是将隐藏向量传递到$N$卷积块中。通过卷积快后,向量将通过另一个线性层,使得其从hid_dim转换为嵌入的尺寸大小。得到的便是conved vector。 最后通过残差连接将conved向量和嵌入向量进行元素求和,获得每个token的组合向量。所以输入序列中的每个token都有一个组合向量。
Convolutional Blocks 下面将介绍卷积块如何工作。
首先,填充输入句子。这是因为卷积层将减少输入句子的长度,并且我们希望进入卷积块的句子的长度等于从卷积块中出来的句子的长度。没有填充,从卷积层出来的序列的长度将比进入卷积层的序列的长度短filter_size-1。例如,如果过滤器大小为3,则序列将短2个元素。因此,我们在句子的每一侧都填充了一个填充元素。我们可以通过简单地对奇数大小的过滤器执行(filter_size-1)/ 2来计算每一侧的填充量-在本教程中,我们将不讨论偶数大小的过滤器。
这些过滤器的设计使其输出隐藏尺寸为输入隐藏尺寸的两倍。在计算机视觉术语中,这些隐藏的维度称为渠道-但我们将坚持将其称为隐藏的维度。为什么我们要使卷积滤波器的隐藏维的大小加倍?这是因为我们正在使用一种称为门控线性单元(GLU)的特殊激活函数。 GLU具有包含在激活函数中的选通机制(类似于LSTM和GRU),实际上是隐藏维的一半大小,而激活函数通常会使隐藏维保持相同的大小。
Encoder Implementation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 class Encoder (nn.Module ): def __init__ (self, input_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, device, max_length = 100 ): super().__init__() assert kernel_size % 2 == 1 , "Kernel size must be odd!" self.device = device self.scale = torch.sqrt(torch.FloatTensor([0.5 ])).to(device) self.tok_embedding = nn.Embedding(input_dim, emb_dim) self.pos_embedding = nn.Embedding(max_length, emb_dim) self.emb2hid = nn.Linear(emb_dim, hid_dim) self.hid2emb = nn.Linear(hid_dim, emb_dim) self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, out_channels = 2 * hid_dim, kernel_size = kernel_size, padding = (kernel_size - 1 ) // 2 ) for _ in range(n_layers)]) self.dropout = nn.Dropout(dropout) def forward (self, src ): batch_size = src.shape[0 ] src_len = src.shape[1 ] pos = torch.arange(0 , src_len).unsqueeze(0 ).repeat(batch_size, 1 ).to(self.device) tok_embedded = self.tok_embedding(src) pos_embedded = self.pos_embedding(pos) embedded = self.dropout(tok_embedded + pos_embedded) conv_input = self.emb2hid(embedded) conv_input = conv_input.permute(0 , 2 , 1 ) for i, conv in enumerate(self.convs): conved = conv(self.dropout(conv_input)) conved = F.glu(conved, dim = 1 ) conved = (conved + conv_input) * self.scale conv_input = conved conved = self.hid2emb(conved.permute(0 , 2 , 1 )) combined = (conved + embedded) * self.scale return conved, combined
Decoder Decoder
首先,在经过卷积块和变换之后,嵌入没有连接剩余连接。相反,嵌入被输入到卷积块中,被用作残差链接。 其次,为了利用编码器的信息,解码器再次在卷积块中使用编码器的conved和combined向量。 最后,解码器的输出是从嵌入维度到输出维度的线性层,用来预测翻译的下一个单词是什么。
Decoder Convolutional Blocks
首先,填充不同。没有像之前一样在两侧均匀地填充以确保句子长度在整个过程中保持相同,而是仅在句子开头进行填充。 由于我们同时并行地处理所有目标,因此我们需要一种方法,仅允许转换token$i$的过滤器查看单词$i$之前token。如果允许他们查看$i+1$token,它可以直接复制而不是学习。。
Decoder Impementation 由于我们仅在一侧进行填充,因此允许解码器使用奇数和偶数大小的填充。 同样,scale用于减少整个模型的方差,并且位置嵌入被初始化为“词汇量”为100。
该模型以其正向方法接收编码器表示形式,并将两者都传递给calculate_attention方法,该方法计算并施加注意。 它还会返回实际的attention值,但是我们目前未使用它们。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 class Decoder (nn.Module ): def __init__ (self, output_dim, emb_dim, hid_dim, n_layers, kernel_size, dropout, trg_pad_idx, device, max_length = 100 ): super().__init__() self.kernel_size = kernel_size self.trg_pad_idx = trg_pad_idx self.device = device self.scale = torch.sqrt(torch.FloatTensor([0.5 ])).to(device) self.tok_embedding = nn.Embedding(output_dim, emb_dim) self.pos_embedding = nn.Embedding(max_length, emb_dim) self.emb2hid = nn.Linear(emb_dim, hid_dim) self.hid2emb = nn.Linear(hid_dim, emb_dim) self.attn_hid2emb = nn.Linear(hid_dim, emb_dim) self.attn_emb2hid = nn.Linear(emb_dim, hid_dim) self.fc_out = nn.Linear(emb_dim, output_dim) self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, out_channels = 2 * hid_dim, kernel_size = kernel_size) for _ in range(n_layers)]) self.dropout = nn.Dropout(dropout) def calculate_attention (self, embedded, conved, encoder_conved, encoder_combined ): conved_emb = self.attn_hid2emb(conved.permute(0 , 2 , 1 )) combined = (conved_emb + embedded) * self.scale energy = torch.matmul(combined, encoder_conved.permute(0 , 2 , 1 )) attention = F.softmax(energy, dim=2 ) attended_encoding = torch.matmul(attention, encoder_combined) attended_encoding = self.attn_emb2hid(attended_encoding) attended_combined = (conved + attended_encoding.permute(0 , 2 , 1 )) * self.scale return attention, attended_combined def forward (self, trg, encoder_conved, encoder_combined ): batch_size = trg.shape[0 ] trg_len = trg.shape[1 ] pos = torch.arange(0 , trg_len).unsqueeze(0 ).repeat(batch_size, 1 ).to(self.device) tok_embedded = self.tok_embedding(trg) pos_embedded = self.pos_embedding(pos) embedded = self.dropout(tok_embedded + pos_embedded) conv_input = self.emb2hid(embedded) conv_input = conv_input.permute(0 , 2 , 1 ) batch_size = conv_input.shape[0 ] hid_dim = conv_input.shape[1 ] for i, conv in enumerate(self.convs): conv_input = self.dropout(conv_input) padding = torch.zeros(batch_size, hid_dim, self.kernel_size - 1 ).fill_(self.trg_pad_idx).to(self.device) padded_conv_input =, conv_input), dim = 2 ) conved = conv(padded_conv_input) conved = F.glu(conved, dim = 1 ) attention, conved = self.calculate_attention(embedded, conved, encoder_conved, encoder_combined) conved = (conved + conv_input) * self.scale conv_input = conved conved = self.hid2emb(conved.permute(0 , 2 , 1 )) output = self.fc_out(self.dropout(conved)) return output, attention
Seq2seq 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 class Seq2Seq (nn.Module ): def __init__ (self, encoder, decoder ): super().__init__() self.encoder = encoder self.decoder = decoder def forward (self, src, trg ): encoder_conved, encoder_combined = self.encoder(src) output, attention = self.decoder(trg, encoder_conved, encoder_combined)
Training the seq2seq model 训练部分和之前的相似,论文得到作者发现较小的卷积核和大量的 层效果更好。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 INPUT_DIM = len(SRC.vocab) OUTPUT_DIM = len(TRG.vocab) EMB_DIM = 256 HID_DIM = 512 ENC_LAYERS = 10 DEC_LAYERS = 10 ENC_KERNEL_SIZE = 3 DEC_KERNEL_SIZE = 3 ENC_DROPOUT = 0.25 DEC_DROPOUT = 0.25 TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token] enc = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, ENC_LAYERS, ENC_KERNEL_SIZE, ENC_DROPOUT, device) dec = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, DEC_LAYERS, DEC_KERNEL_SIZE, DEC_DROPOUT, TRG_PAD_IDX, device) model = Seq2Seq(enc, dec).to(device)
1 2 3 4 5 6 7 def count_parameters (model ): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(f'The model has {count_parameters(model):,} trainable parameters' ) optimizer = optim.Adam(model.parameters()) criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 def train(model, iterator, optimizer, criterion, clip): model.train() epoch_loss = 0 for i, batch in enumerate(iterator): src = batch.src trg = batch.trg optimizer.zero_grad() output, _ = model(src, trg[:,:-1]) #output = [batch size, trg len - 1, output dim] #trg = [batch size, trg len] output_dim = output.shape[-1] output = output.contiguous().view(-1, output_dim) trg = trg[:,1:].contiguous().view(-1) #output = [batch size * trg len - 1, output dim] #trg = [batch size * trg len - 1] loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 def evaluate (model, iterator, criterion ): model.eval() epoch_loss = 0 with torch.no_grad(): for i, batch in enumerate(iterator): src = batch.src trg = batch.trg output, _ = model(src, trg[:,:-1 ]) output_dim = output.shape[-1 ] output = output.contiguous().view(-1 , output_dim) trg = trg[:,1 :].contiguous().view(-1 ) loss = criterion(output, trg) epoch_loss += loss.item() return epoch_loss / len(iterator)
1 2 3 4 5 def epoch_time (start_time, end_time ): elapsed_time = end_time - start_time elapsed_mins = int(elapsed_time / 60 ) elapsed_secs = int(elapsed_time - (elapsed_mins * 60 )) return elapsed_mins, elapsed_secs
尽管我们的参数几乎是基于注意力的RNN模型的两倍,但实际上它花费的时间是标准版本的一半左右,而打包的填充序列版本则花费了大约同一时间。 这是由于所有计算都是使用卷积滤波器并行完成的,而不是依次使用RNN进行的。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 if __name__=="__main__" : print(f'The model has {count_parameters(model):,} trainable parameters' ) N_EPOCHS = 10 CLIP = 0.1 best_valid_loss = float('inf' ) for epoch in range(N_EPOCHS): start_time = time.time() train_loss = train(model, train_iterator, optimizer, criterion, CLIP) valid_loss = evaluate(model, valid_iterator, criterion) end_time = time.time() epoch_mins, epoch_secs = epoch_time(start_time, end_time) if valid_loss < best_valid_loss: best_valid_loss = valid_loss, '' ) print(f'Epoch: {epoch + 1 :02 } | Time: {epoch_mins} m {epoch_secs} s' ) print(f'\tTrain Loss: {train_loss:.3 f} | Train PPL: {math.exp(train_loss):7.3 f} ' ) print(f'\t Val. Loss: {valid_loss:.3 f} | Val. PPL: {math.exp(valid_loss):7.3 f} ' )
Inference 步骤:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 def translate_sentence (sentence, src_field, trg_field, model, device, max_len = 50 ): model.eval() if isinstance(sentence, str): nlp = spacy.load('de' ) tokens = [token.text.lower() for token in nlp(sentence)] else : tokens = [token.lower() for token in sentence] tokens = [src_field.init_token] + tokens + [src_field.eos_token] src_indexes = [src_field.vocab.stoi[token] for token in tokens] src_tensor = torch.LongTensor(src_indexes).unsqueeze(0 ).to(device) with torch.no_grad(): encoder_conved, encoder_combined = model.encoder(src_tensor) trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]] for i in range(max_len): trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0 ).to(device) with torch.no_grad(): output, attention = model.decoder(trg_tensor, encoder_conved, encoder_combined) pred_token = output.argmax(2 )[:,-1 ].item() trg_indexes.append(pred_token) if pred_token == trg_field.vocab.stoi[trg_field.eos_token]: break trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes] return trg_tokens[1 :], attention