您当前的位置：首页 > IT编程 > cnn卷积神经网络
\| C语言 \| Java \| VB \| VC \| python \| Android \| TensorFlow \| C++ \| oracle \| 学术与代码 \| cnn卷积神经网络 \| gnn \| 图像修复 \| Keras \| 数据集 \| Neo4j \| 自然语言处理 \| 深度学习 \| 医学CAD \| 医学影像 \| 超参数 \| pointnet \| pytorch \|

自学教程：Machine Translation with Recurrent Neural Networks

51自学网 2020-05-25 16:52:21

cnn卷积神经网络

这篇教程Machine Translation with Recurrent Neural Networks写得很实用，希望能帮到您。

Introduction¶

In this post, we're going to build a machine learning model to translate German sentences into English. It's not quite going to be Google Translate, but it will actually come surprisingly close!

We'll be using the same model architecture as Google, a neural sequence-to-sequence model. For the details on Google's setup, you can read their paper (Wu et al.). For our setup, we'll work primarily from this paper (Bahdanau et al.), a breakthough paper that serves as the base of Google's model.

We will build the model from scratch with PyTorch 0.4 and Python 3. In fact, everything here is an iPython notebook (published here) that you can run yourself. Additionally, everything is uploaded publicly to GitHub.

At the end of the day, you'll be able to produce translations like these:

German	Dieser Tag hat unsere Sicht nachhaltig verändert .
Professional Translation	And that day really changed our perspective .
Our Translation	This day, our view has changed .

German	So lernt man als Kind eine Sprache .
Professional Translation	And this is what you learn when you learn a language as a child .
Our Translation	This is how you learn a language as a child .

German	Man kann eine Kultur ohne Austausch pflegen .
Professional Translation	You can have culture without exchange .
Our Translation	You can exchange a culture without exchange .

This tutorial is ideally for someone with some experience with neural networks, but unfamiliar with natural language processing or machine translation.

For those looking to take machine translation to the next level, try out the brilliant OpenNMT platform, also built in PyTorch.

Now, let's dive into translation.

Overview¶

Our goal is to convert a German source sentence into an English sentence. To do this, we will first encode each word of the German sentence and then decode an English sentence one word at a time. During decoding, we will use attention to look at the encoded English words as we go along.

For example, when trying to translate the third sentence above, in the first step of decoding, we might attend to the encoding of German word "Man" in the source sentence and then produce the English word "You". For this reason, we call our model an encoder-decoder sequence-to-sequence model. This type of model is now ubiquitous in natural language processing and other areas that deal with sequences (text, speech, etc.).

The Data¶

To train our translation model, we need a ton of German-English sentence pairs translated by professional translators. We'll use the IWSLT collection of German TED Talks translated into English. For training bigger models, researchers often use the proceedings of the European Parliament.

We preprocess the data with the spacy library. In preprocessing, we split our sentence into tokens (words) and add special <s> and </s> tokens to mark the beginning and end of sentences. We also replace words which occur fewer than 5 times with an unknown token <unk> -- this helps us keep our vocabulary to a managable 11560 English words and 13353 German words.

If you're running this code on your own computer, this preprocessing might take a little while.

## Uncomment these lines if you have not downloaded spacy and torchtext
# !pip install spacy
# !pip install torch torchvision torchtext
# !python -m spacy download en
# !python -m spacy download de

import itertools, os, time , datetime
import numpy as np
import spacy
import torch
import torch.nn as nn
from torchtext import data, datasets
from torchtext.vocab import Vectors, GloVe
use_gpu = torch.cuda.is_available()

def preprocess(vocab_size=0, batchsize=16, max_sent_len=20):
    '''Loads data from text files into iterators'''

    # Load text tokenizers
    spacy_de = spacy.load('de')
    spacy_en = spacy.load('en')

    def tokenize(text, lang='en'):
        if lang is 'de':
            return [tok.text for tok in spacy_de.tokenizer(text)]
        elif lang is 'en':
            return [tok.text for tok in spacy_en.tokenizer(text)]
        else:
            raise Exception('Invalid language')

    # Add beginning-of-sentence and end-of-sentence tokens 
    BOS_WORD = '<s>'
    EOS_WORD = '</s>'
    DE = data.Field(tokenize=lambda x: tokenize(x, 'de'))
    EN = data.Field(tokenize=tokenize, init_token=BOS_WORD, eos_token=EOS_WORD)

    # Create sentence pair dataset with max length 20
    train, val, test = datasets.IWSLT.splits(exts=('.de', '.en'), fields=(DE, EN), filter_pred = lambda x: max(len(vars(x)['src']), len(vars(x)['trg'])) <= max_sent_len)

    # Build vocabulary and convert text to indices
    # Convert words that appear fewer than 5 times to <unk>
    if vocab_size > 0:
        DE.build_vocab(train.src, min_freq=5, max_size=vocab_size)
        EN.build_vocab(train.trg, min_freq=5, max_size=vocab_size)
    else:
        DE.build_vocab(train.src, min_freq=5)
        EN.build_vocab(train.trg, min_freq=5)

    # Create iterators to process text in batches of approx. the same length
    train_iter = data.BucketIterator(train, batch_size=batchsize, device=-1, repeat=False, sort_key=lambda x: len(x.src))
    val_iter = data.BucketIterator(val, batch_size=1, device=-1, repeat=False, sort_key=lambda x: len(x.src))
    
    return DE, EN, train_iter, val_iter

# Test
timer = time.time()
SRC, TGT, train_iter, val_iter = preprocess()

print('''This is a test of our preprocessing function. It took {:.1f} seconds to load the data. 
Our German vocab has size {} and our English vocab has size {}.
Our training data has {} batches, each with {} sentences, and our validation data has {} batches.'''.format(
time.time() - timer, len(SRC.vocab), len(TGT.vocab), len(train_iter), train_iter.batch_size, len(val_iter)))

This is a test of our preprocessing function. It took 202.0 seconds to load the data. 
Our German vocab has size 13353 and our English vocab has size 11560.
Our training data has 7443 batches, each with 16 sentences, and our validation data has 570 batches.

The Model¶

Attention

Word Embeddings¶

The first step of both the encoder and decoder is to convert the input words to into vectors, a form our model can work with. We do so with word embeddings, which are mappings from each word in our vocab to a vector in some high-dimensional space (say, 300 dimensions).

Word embeddings are a subject for an entirely separate blog post, but the basic idea is that word vectors capture some semantic meaning: the vector for dog is closer to the vector for cat than the vector for asparagus.

You can get your own vectors from a source likeGloVe or fastText, or you can use the ones I've generated and uploaded to GitHub (links below).

!wget https://github.com/lukemelas/Machine-Translation/raw/master/scripts/emb-13353-de.npy https://github.com/lukemelas/Machine-Translation/raw/master/scripts/emb-11560-en.npy

def load_embeddings(SRC, TGT, np_src_file, tgt_file):
    '''Load English and German embeddings from saved numpy files'''
    emb_tr_src = torch.from_numpy(np.load(np_src_file))
    emb_tr_tgt = torch.from_numpy(np.load(np_tgt_file))
    return emb_tr_src, emb_tr_tgt

Encoder¶

Our encoder (red in the model diagram above) is a bidirectional recurrent neural network. Bidirectional simply means that we run the model both backwards and forwards along the sentence.

Our encoder outputs a vector for each word in the source sentence. All these vectors together are called a memory bank.

class EncoderLSTM(nn.Module):
    def __init__(self, embedding, h_dim, num_layers, dropout_p=0.0, bidirectional=True):
        super(EncoderLSTM, self).__init__()
        self.vocab_size, self.embedding_size = embedding.size()
        self.num_layers, self.h_dim, self.dropout_p, self.bidirectional = num_layers, h_dim, dropout_p, bidirectional 

        # Create embedding and LSTM
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        self.embedding.weight.data.copy_(embedding)
        self.lstm = nn.LSTM(self.embedding_size, self.h_dim, self.num_layers, dropout=self.dropout_p, bidirectional=bidirectional)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, x):
        '''Embed text, get initial LSTM hidden state, and encode with LSTM'''
        x = self.dropout(self.embedding(x)) # embedding
        h0 = self.init_hidden(x.size(1)) # initial state of LSTM
        memory_bank, h = self.lstm(x, h0) # encoding
        return memory_bank, h

    def init_hidden(self, batch_size):
        '''Create initial hidden state of zeros: 2-tuple of num_layers x batch size x hidden dim'''
        num_layers = self.num_layers * 2 if self.bidirectional else self.num_layers
        init = torch.zeros(num_layers, batch_size, self.h_dim)
        init = init.cuda() if use_gpu else init
        h0 = (init, init.clone())

Attention¶

Attention enables our decoder to look at our encoded source words while translating. We use dot-product attention, which means we take the dot product of our intermediate decoder output and our encoder output. We then take a weighted sum of our encoder vectors, using this dot product as the weight.

There are lots of other types of attention, described in more detail here, but we use dot-product attention because it is simple and works well.

class Attention(nn.Module):
    def __init__(self, pad_token=1, bidirectional=True, h_dim=300):
        super(Attention, self).__init__()
        self.bidirectional, self.h_dim, self.pad_token = bidirectional, h_dim, pad_token
        self.softmax = nn.Softmax(dim=1)

    def forward(self, in_e, out_e, out_d):
        '''Produces context with attention distribution'''

        # Deal with bidirectional encoder, move batches first
        if self.bidirectional: # sum hidden states for both directions
            out_e = out_e.contiguous().view(out_e.size(0), out_e.size(1), 2, -1).sum(2).view(out_e.size(0), out_e.size(1), -1)
            
        # Move batches first
        out_e = out_e.transpose(0,1) # b x sl x hd
        out_d = out_d.transpose(0,1) # b x tl x hd

        # Dot product attention, softmax, and reshape
        attn = out_e.bmm(out_d.transpose(1,2)) # (b x sl x hd) (b x hd x tl) --> (b x sl x tl)
        attn = self.softmax(attn).transpose(1,2) # --> b x tl x sl

        # Get attention distribution
        context = attn.bmm(out_e) # --> b x tl x hd
        context = context.transpose(0,1) # --> tl x b x hd
        return context

Decoder¶

Our decoder (blue and yellow in the diagram) is a recurrent neural network. Within our decoder, we have an attention layer, which looks at the memory bank from the encoder.

We begin by feeding in the start token <s>. Our decoder tries to predict the next word by outputting a distribution over all words in the vocabulary. During training, we know the ground truth sentence, so we feed it into the decoder word-by-word at each step. We penalize the model's predictions using a cross-entropy loss function. During testing, we do not know the ground truth, so we use a prediction of the model as input to the next time step. We'll discuss this process in more detail below.

class DecoderLSTM(nn.Module):
    def __init__(self, embedding, h_dim, num_layers, dropout_p=0.0):
        super(DecoderLSTM, self).__init__()
        self.vocab_size, self.embedding_size = embedding.size()
        self.num_layers, self.h_dim, self.dropout_p = num_layers, h_dim, dropout_p
        
        # Create embedding and LSTM
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_size)
        self.embedding.weight.data.copy_(embedding) 
        self.lstm = nn.LSTM(self.embedding_size, self.h_dim, self.num_layers, dropout=self.dropout_p)
        self.dropout = nn.Dropout(self.dropout_p)
    
    def forward(self, x, h0):
        '''Embed text and pass through LSTM'''
        x = self.embedding(x)
        x = self.dropout(x)
        out, h = self.lstm(x, h0)
        return out, h

Beam Search¶

At test time, we need to use the output of our decoder as the input to the model at the next time step. We could do this by taking the most likely word each time, a strategy known as greedy search. Here, we'll use a fancier method known as beam search, which keeps around a list of likely partial sentences during decoding.

Beam search is a bit complicated, but I've decided to include it because few other tutorials do, and because it really does improve translation quality. To make it simpler, I've only implemented it for batch size = 1. Since beam search will be a method of our final model class, the code is in the section below.

Final model¶

Our final model combines the encoder, attention, decoder, and beam search. We call it Seq2seq.

class Seq2seq(nn.Module):
    def __init__(self, embedding_src, embedding_tgt, h_dim, num_layers, dropout_p, bi, tokens_bos_eos_pad_unk=[0,1,2,3]):
        super(Seq2seq, self).__init__()
        # Store hyperparameters
        self.h_dim = h_dim
        self.vocab_size_tgt, self.emb_dim_tgt = embedding_tgt.size()
        self.bos_token, self.eos_token, self.pad_token, self.unk_token = tokens_bos_eos_pad_unk

        # Create encoder, decoder, attention
        self.encoder = EncoderLSTM(embedding_src, h_dim, num_layers, dropout_p=dropout_p, bidirectional=bi)
        self.decoder = DecoderLSTM(embedding_tgt, h_dim, num_layers * 2 if bi else num_layers, dropout_p=dropout_p)
        self.attention = Attention(pad_token=self.pad_token, bidirectional=bi, h_dim=self.h_dim)

        # Create linear layers to combine context and hidden state
        self.linear1 = nn.Linear(2 * self.h_dim, self.emb_dim_tgt)
        self.tanh = nn.Tanh()
        self.dropout = nn.Dropout(dropout_p)
        self.linear2 = nn.Linear(self.emb_dim_tgt, self.vocab_size_tgt)
        
        # Share weights between decoder embedding and output 
        if self.decoder.embedding.weight.size() == self.linear2.weight.size():
            self.linear2.weight = self.decoder.embedding.weight

    def forward(self, src, tgt):
        if use_gpu: src = src.cuda()
        
        # Encode
        out_e, final_e = self.encoder(src)
        
        # Decode
        out_d, final_d = self.decoder(tgt, final_e)
        
        # Attend
        context = self.attention(src, out_e, out_d)
        out_cat = torch.cat((out_d, context), dim=2) 
        
        # Predict (returns probabilities)
        x = self.linear1(out_cat)
        x = self.dropout(self.tanh(x))
        x = self.linear2(x)
        return x

    def predict(self, src, beam_size=1): 
        '''Predict top 1 sentence using beam search. Note that beam_size=1 is greedy search.'''
        beam_outputs = self.beam_search(src, beam_size, max_len=30) # returns top beam_size options (as list of tuples)
        top1 = beam_outputs[0][1] # a list of word indices (as ints)
        return top1

    def beam_search(self, src, beam_size, max_len, remove_tokens=[]):
        '''Returns top beam_size sentences using beam search. Works only when src has batch size 1.'''
        if use_gpu: src = src.cuda()
        
        # Encode
        outputs_e, states = self.encoder(src) # batch size = 1
        
        # Start with '<s>'
        init_lprob = -1e10
        init_sent = [self.bos_token]
        best_options = [(init_lprob, init_sent, states)] # beam
        
        # Beam search
        k = beam_size # store best k options
        for length in range(max_len): # maximum target length
            options = [] # candidates 
            for lprob, sentence, current_state in best_options:
                # Prepare last word
                last_word = sentence[-1]
                if last_word != self.eos_token:
                    last_word_input = torch.LongTensor([last_word]).view(1,1)
                    if use_gpu: last_word_input = last_word_input.cuda()
                    # Decode
                    outputs_d, new_state = self.decoder(last_word_input, current_state)
                    # Attend
                    context = self.attention(src, outputs_e, outputs_d)
                    out_cat = torch.cat((outputs_d, context), dim=2)
                    x = self.linear1(out_cat)
                    x = self.dropout(self.tanh(x))
                    x = self.linear2(x)
                    x = x.squeeze().data.clone()
                    # Block predictions of tokens in remove_tokens
                    for t in remove_tokens: x[t] = -10e10
                    lprobs = torch.log(x.exp() / x.exp().sum()) # log softmax
                    # Add top k candidates to options list for next word
                    for index in torch.topk(lprobs, k)[1]: 
                        option = (float(lprobs[index]) + lprob, sentence + [index], new_state) 
                        options.append(option)
                else: # keep sentences ending in '</s>' as candidates
                    options.append((lprob, sentence, current_state))
            options.sort(key = lambda x: x[0], reverse=True) # sort by lprob
            best_options = options[:k] # place top candidates in beam
        best_options.sort(key = lambda x: x[0], reverse=True)
        return best_options

Training¶

The section above was a bit complicated, but we're almost done. We just have to set up our training and validation loops.

class AverageMeter(object):
  '''A handy class for moving averages''' 
  def __init__(self):
    self.reset()
  def reset(self):
    self.val, self.avg, self.sum, self.count = 0, 0, 0, 0
  def update(self, val, n=1):
    self.val = val
    self.sum += val * n
    self.count += n
    self.avg = self.sum / self.count

At each iteration of training, we update our model weights with gradient descent.

def train(train_iter, val_iter, model, criterion, optimizer, num_epochs):  
    for epoch in range(num_epochs):
      
        # Validate model
        with torch.no_grad():
          val_loss = validate(val_iter, model, criterion) 
          print('Validating Epoch [{e}/{num_e}]\t Average loss: {l:.3f}\t Perplexity: {p:.3f}'.format(
            e=epoch, num_e=num_epochs, l=val_loss, p=torch.FloatTensor([val_loss]).exp().item()))

        # Train model
        model.train()
        losses = AverageMeter()
        for i, batch in enumerate(train_iter): 
            src = batch.src.cuda() if use_gpu else batch.src
            tgt = batch.trg.cuda() if use_gpu else batch.trg
            
            # Forward, backprop, optimizer
            model.zero_grad()
            scores = model(src, tgt)

            # Remove <s> from target and </s> from scores (output)
            scores = scores[:-1]
            tgt = tgt[1:]           

            # Reshape for loss function
            scores = scores.view(scores.size(0) * scores.size(1), scores.size(2))
            tgt = tgt.view(scores.size(0))

            # Pass through loss function
            loss = criterion(scores, tgt) 
            loss.backward()
            losses.update(loss.item())

            # Clip gradient norms and step optimizer
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            # Log within epoch
            if i % 1000 == 10:
                print('''Epoch [{e}/{num_e}]\t Batch [{b}/{num_b}]\t Loss: {l:.3f}'''.format(e=epoch+1, num_e=num_epochs, b=i, num_b=len(train_iter), l=losses.avg))

        # Log after each epoch
        print('''Epoch [{e}/{num_e}] complete. Loss: {l:.3f}'''.format(e=epoch+1, num_e=num_epochs, l=losses.avg))

During training, we simply calculate the log likelihood of the ground truth sentence under our model. The exponential of this value is known as perplexity. If you are interested, there are more details on Wikipedia.

def validate(val_iter, model, criterion):
    '''Calculate losses by teacher forcing on the validation set'''
    model.eval()
    losses = AverageMeter()
    for i, batch in enumerate(val_iter):
        src = batch.src.cuda() if use_gpu else batch.src
        tgt = batch.trg.cuda() if use_gpu else batch.trg
        
        # Forward 
        scores = model(src, tgt)
        scores = scores[:-1]
        tgt = tgt[1:]           
        
        # Reshape for loss function
        scores = scores.view(scores.size(0) * scores.size(1), scores.size(2))
        tgt = tgt.view(scores.size(0))
        num_words = (tgt != 0).float().sum()
        
        # Calculate loss
        loss = criterion(scores, tgt) 
        losses.update(loss.item())
    
    return losses.avg

Finally, we need an actual predict function, to see our translations!

def predict_from_text(model, input_sentence, SRC, TGT):
    sent_german = input_sentence.split(' ') # sentence --> list of words
    sent_indices = [SRC.vocab.stoi[word] if word in SRC.vocab.stoi else SRC.vocab.stoi['<unk>'] for word in sent_german]
    sent = torch.LongTensor([sent_indices])
    if use_gpu: sent = sent.cuda()
    sent = sent.view(-1,1) # reshape to sl x bs
    print('German: ' + ' '.join([SRC.vocab.itos[index] for index in sent_indices]))
    # Predict five sentences with beam search 
    pred = model.predict(sent, beam_size=5) # returns list of 5 lists of word indices
    out = ' '.join([TGT.vocab.itos[index] for index in pred[1:-1]])
    print('English: ' + out)

Translation¶

It's time to run our model. First we load our embeddings.

# Load embeddings
embedding_src, embedding_tgt = load_embeddings(SRC, TGT, 'emb-13353-de.npy', 'emb-11560-en.npy')

Then we create our model and move it onto the GPU.

# Create model 
tokens = [TGT.vocab.stoi[x] for x in ['<s>', '</s>', '<pad>', '<unk>']]
model = Seq2seq(embedding_src, embedding_tgt, 300, 2, 0.3, True, tokens_bos_eos_pad_unk=tokens)
model = model.cuda() if use_gpu else model

Next, we make our cross entropy loss function (criterion) and optimizer. For our optimizer, we'll use Adam, but you can definitely use other loss functions (SGD, RMSProp, Adamax, etc.) as well.

# Create weight to mask padding tokens for loss function
weight = torch.ones(len(TGT.vocab))
weight[TGT.vocab.stoi['<pad>']] = 0
weight = weight.cuda() if use_gpu else weight

# Create loss function and optimizer
criterion = nn.CrossEntropyLoss(weight=weight)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-3)

Finally, we can train our model!

train(train_iter, val_iter, model, criterion, optimizer, 50)

Validating Epoch [0/50]	 Average loss: 3.439	 Perplexity: 31.152
Epoch [1/50]	 Batch [10/7443]	 Loss: 3.980
Epoch [1/50]	 Batch [1010/7443]	 Loss: 3.609
Epoch [1/50]	 Batch [2010/7443]	 Loss: 3.446
Epoch [1/50]	 Batch [3010/7443]	 Loss: 3.336
Epoch [1/50]	 Batch [4010/7443]	 Loss: 3.252
Epoch [1/50]	 Batch [5010/7443]	 Loss: 3.182
Epoch [1/50]	 Batch [6010/7443]	 Loss: 3.122
Epoch [1/50]	 Batch [7010/7443]	 Loss: 3.073
Epoch [1/50] complete. Loss: 3.052
Validating Epoch [1/50]	 Average loss: 2.274	 Perplexity: 9.716
...

Training takes quite a while, so I've included a pretrained model at the link below.

!wget https://www.dropbox.com/s/qu18vt3jisplchd/model.pkl

model.load_state_dict(torch.load('model.pkl'))
model = model.cuda() if use_gpu else model

We can validate that the pretrained model words:

with torch.no_grad():
  val_loss = validate(val_iter, model, criterion) 
  print('Average loss: {l:.3f}\t Perplexity: {p:.3f}'.format(l=val_loss, p=torch.FloatTensor([val_loss]).exp().item()))

Average loss: 1.865	 Perplexity: 6.459

Finally, we can translate some German sentences! Let's try a few from the German newspaper Süddeutsche Zeitung.

input = "Ich kenne nur Berge, ich bleibe in den Bergen und ich liebe die Berge ."
predict_from_text(model, input, SRC, TGT)

German: Ich kenne nur <unk> ich bleibe in den Bergen und ich liebe die Berge .
English: I only know I 'm staying in the hills , and I love the mountains .

input = "Ihre Bergung erwies sich als komplizierter als gedacht ." 
predict_from_text(model, input, SRC, TGT)

German: Ihre <unk> erwies sich als komplizierter als gedacht .
English: Her <unk> turned out to be more complicated than that .

Conclusion¶

In this post, we built a neural machine translation system in PyTorch. With just a few more additions, a library full of training data, and a warehouse full of computing power, you'll have your very own Google Translate!

If you found these results interesting, there are lots of exciting extensions:

Unsupervise machine translation -- translating without paired training data: Lample et al.
Multilingual machine translation -- translating between lots of languages: Johnson et al.
The details of Google's system: Wu et al.

基于深度学习的情感分析和学习梳理
GitHub上最热门的Python项目