Attention-based Sequence-to-Sequence in Keras

All data and code in this article are available on Github.

Last month, I wrote about translate English words into Katakana using Sequence-to-Sequence learning in Keras. For this article, I describe how to improve the Katakana’s Sequence-to-Sequence model by Attention Mechanism.

Attention Mechanism is an extension to Sequence-to-Sequence model

Previous Sequence-to-Sequence Model and its weak point

In the previous article, I have implemented plain Sequence-to-Seqeunce model as described in the paper.

Drawing

In this model, the encoder part reads the input sequence and produce a vector that summarizes the input (highlighted in red on the diagram). That vector is then fed into the decoder part. The decoder generates output sequence with information in the vector. Because the vector is the only information the decoder takes, it has to keep enough information for the decoder to generate the whole output. This makes the vector becomes information bottleneck when we want to generate long names or words.

Bahdanau wrote about the problem in his paper as: “A potential issue with this encoder-decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence to into a fixed-length vector”

There is also an analogy on how human brain works on sequential input/output. Suppose we want to translate a sentence to another language or rewrite a piece of text, we read in a sequence of words and write out a sequence of words (and our brain is indeed a neural network). It is more difficult to read the whole sentence, remember everything in a single “thought”, then write the translated sentence without peeking back. The longer sentence the more difficult it is to remember all details of the original sentence. Even if you have a good memory, imaging what if you have to translate the whole paragraph or the whole page in a single read.

The idea of attention mechanism is having decoder “look back” into the encoder’s information on every input and use that information to make the decision.

Luong et al., 2015’s Attention Mechanism

There are multiple designs for attention mechanism. For example, Bahdanau et al., 2015’s Attention models are pretty common. Zafarali Ahmed has written a very comprehensive article about the mechanism and how to implement it in Keras..

However, I have found that Lonng et al’s paper is the easiest to understand and implement in Keras. As I will show you, we need only the out-of-box components in Keras and don’t need to define any custom layer.

Following the paper and comparing equations with the code are highly advised.

Encoder / Decoder

  • Let E = encoder be the encoder's output tensor of shape (-1, INPUT_LENGTH,64), and let E[i] (encoder[:,i,:]) be the encoder's output vector (of shape (-1, 64)) for i-th English character. This is source-side information (H-s in the paper).

  • Let D = decoder be the decoder's output tensor of shape (-1, OUTPUT_LENGTH,64), and let D[j] (decoder[:,j,:]) be the decoder's output vector (of shape (-1, 64)) for j-th Katakana character. This is target-side information (H-t in the paper).

  • Let O = output be the final Katakana output of shape (-1, OUTPUT_LENGTH, output_dict_size). and let O[j] (output[:,j,:]) be the output (probability distribution of shape (-1, output_dict_size)) of j-th Katakana character.

encoder = Embedding(input_dict_size, 64, input_length=INPUT_LENGTH, mask_zero=True)(encoder_input)
encoder = LSTM(64, return_sequences=True, unroll=True)(encoder)
encoder_last = encoder[:,-1,:]

decoder = Embedding(output_dict_size, 64, input_length=OUTPUT_LENGTH, mask_zero=True)(decoder_input)
decoder = LSTM(64, return_sequences=True, unroll=True)(decoder, initial_state=[encoder_last, encoder_last])

# For the plain Sequence-to-Sequence, we produced the output from directly from decoder
# output = TimeDistributed(Dense(output_dict_size, activation="softmax"))(decoder)

In the plain, Sequence-to-Sequence model, O[j] was computed from D[j] by Dense and Softmax layer. (output = TimeDistributed(Dense(output_dict_size, activation="softmax"))(decoder))

Context and Attention

In addition to using the decoder's output, we also use "context” information. This “context” will be computed by “looking back” into the source-side or the encoder.

Let C = context be that context information and let C[j] be the context vector for j-th Katakana character. The O[j] will be computed by both D[j] concatenated with C[j].

Each of context vector C[j] is the result of combining every encoder output E[i] each with different weight or ratio, A[j][i].

C[j] = sum( [A[j][i] * E[i] for i range(0, INPUT_LENGTH)] )

It’s this A[j][i] weight that is the model’s “attention”-value. It represents of how much the i-th input relatively effect the j-th output, e.g. when the network is writing the j-th output, how much attention it focuses on the input i-th comparing to other inputs.

There are multiple designs on how this attention-value or attention-score could be computed. The paper proposed a few different scores (Section 3.1). In this example, we use Dot-based score:

A[j][i] = softmax( D[j] * E[i] ) # softmax by row

Putting them together in Keras

encoder = ...
decoder = ...

# Equation (7) with 'dot' score from Section 3.1 in the paper. 
# Note that we reuse Softmax-activation layer instead of writing tensor calculation
attention = dot([decoder, encoder], axes=[2, 2])
attention = Activation('softmax')(attention)

context = dot([attention, encoder], axes=[2,1])
decoder_combined_context = concatenate([context, decoder])

# Has another weight + tanh layer as described in equation (5) of the paper
output = TimeDistributed(Dense(64, activation="tanh"))(decoder_combined_context) # equation (5) of the paper
output = TimeDistributed(Dense(output_dict_size, activation="softmax"))(output) # equation (6) of the paper

Training the model

To train the model in Keras, we create a Model object to wrap the defined layers. The model's input will be encoder_input and decoder_input layers, and the model's output is output layer.

Then, we can train the model on transformed English-Katakana pairs. The data transformation is similar to the previous article.

model = Model(inputs=[encoder_input, decoder_input], outputs=[output])
model.compile(optimizer='adam', loss='binary_crossentropy’)

...
training_encoder_input = encoded_english_words
training_decoder_output = encoded_japanese_words

# We are predicting the next character. 
# Thus, the decode’s input is the expected output shifted by START char
training_decoder_input = np.zeros_like(training_decoder_output)
training_decoder_input[:, 1:] = training_decoder_output[:,:-1]
training_decoder_input[:, 0] = encoding.CHAR_CODE_START

# One-hot encode the expected output
training_decoder_output = np.eye(output_dict_size)[encoded_training_output.astype('int’)]

...
model.fit(x=[training_encoder_input, training_decoder_input], y=[training_decoder_output], )

After training the model, we use “greedy” decoding technique (also described in the previous article), which is feeding the output back into the model one character at the time, to generate the unknown output.

def generate(encoder_input):
    encoder_input = transform(input_encoding, [text.lower()], INPUT_LENGTH)
    decoder_input = np.zeros(shape=(len(encoder_input), OUTPUT_LENGTH))
    decoder_input[:,0] = START_CHAR_CODE
    for i in range(1, OUTPUT_LENGTH):
        output = model.predict([encoder_input, decoder_input]).argmax(axis=2)
        decoder_input[:,i] = output[:,i]
    return decoder_input[:,1:]

def decode(decoding, sequence):
    text = ''
    for i in sequence:
        if i == 0:
            break
        text += output_decoding[i]
    return text

def to_katakana(text):
    decoder_output = generate(text)
    return decode(output_decoding, decoder_output[0])

...
to_katakana('Banana')           # バナナ
to_katakana('Peter Parker')     # ピーター・パーカー
to_katakana('Jon Snow')         # ジョン・スノー

Visualize Attention

As described previously, the model's attention or A[i][j] is the representation of the relationship between i-th input and j-th output. If we check the attention value, we should see the relationship or alignment of English-Katakana character e.g. Katakana - “ピ” should correspond to English - “Pe..”.

We can do that by extracting the attention layer of the model (the 7th layer in this case), then creating a new model that takes exactly the same inputs as the previous model but also returns output from the attention layer in addition to the original outputs.

attention_layer = model.layers[7] # or model.get_layer("attention”)
attention_model = Model(inputs=model.inputs, outputs=model.outputs + [attention_layer.output])

We also need to update the generate function to keep track of the attention values as well as the output. Finally, we use matplotlib and seaborn’s heatmap function to display the result.

%matplotlib inline  
import matplotlib
import matplotlib.pyplot as plt
import seaborn

seaborn.set(font=['Osaka'], font_scale=3)

def attent_and_generate(text):
    encoder_input = encoding.transform(input_encoding, [text.lower()], 20)
    decoder_input = np.zeros(shape=(len(encoder_input), OUTPUT_LENGTH))
    decoder_input[:,0] = encoding.CHAR_CODE_START

    for i in range(1, OUTPUT_LENGTH):
        output, attention = attention_model.predict([encoder_input, decoder_input])
        decoder_input[:,i] = output.argmax(axis=2)[:,i]
        attention_density = attention[0]
        decoded_output = decode(output_decoding, decoder_input[0][1:])

    return attention_density, decoded_output


def visualize(text):
    attention_density, katakana = attent_and_generate(text)

    plt.clf()
    plt.figure(figsize=(28,12))

    ax = seaborn.heatmap(attention_density[:len(katakana) + 2, : len(text) + 2],
        xticklabels=[w for w in text],
        yticklabels=[w for w in katakana])

    ax.invert_yaxis()
    plt.show()

The Results

Jon Snow

As expected, the model can learn to align the generated output with its corresponding input.

Jon Snow Jon Snow

It is also interesting to see how the model “think”. For example, when translating “Jon Snow”:

  • It decided to write the first character ジ after seeing “J” followed by “O”, then it wrote “ョ” immediately after without checking the next character (“JO-” is indeed should be written as “ジョ-” ).

  • It decided to write “ン” to end the word after seeing that the character after “N” was a whitespace. That was also the correct decision because if the word is not “Jon”, but “Jonny” that N-sound should be “ニ” not “ン”. In the next word, it also decided to write “ノ” for N-sound only after it saw “NO”.

Conclusion

I hope this article has shown how Attention mechanism help improves Sequence-to-Sequence model and how to implement it in Keras. Also, I hope the English to Katakana example is easy to understand and follow. Please feel free to reach out via a comment or an issue on Github.