Continuing from the previous post, let’s now dive into the second half of the transformer – the decoder. If you recall, the decoder takes the contextualized embeddings produced by the encoder from the input sequence. It then generates the output sequence, one token at a time. Let’s break down how this process unfolds, step by step.

Shifted Output Embeddings

The decoder starts with the output tokens generated so far. These are embedded into vectors, just like in the encoder. However, there’s a twist: the output sequence is shifted right. This means, for each position, the decoder only “sees” the tokens that have already been generated, never the future ones. So our decoder’s initial “start position” is just a special <start> token embedding.

Positional Encoding

Just as in the encoder, these embeddings are combined with positional encoding. This ensures that the decoder is aware of the position of each token in the output sequence, which is crucial for generating coherent and grammatically correct sentences.

Masked Multi-Head Attention

The first major block in the decoder is masked multi-head attention. Here, each output token can attend to all previous tokens in the sequence, but not to any future tokens. This is enforced by a “mask” that blocks attention to subsequent positions. The result: when generating the next word, the model can only use information from what it has already generated, never “peeking” ahead. This is essential for tasks like text generation, where each word must be predicted step by step.

Multi-Head Cross Attention

Also called multi-head encoder-decoder attention, this is where the decoder looks back at the encoder’s output. Each token in the decoder can attend to any position in the encoder’s final output, allowing it to align generated words with relevant parts of the input sequence. This mechanism is what enables the transformer to generate translations that are faithful to the source sentence, or summaries that actually reflect the input.

Feedforward Layer

There’s then another feedforward layer, or FFN. It performs a function similar to that in the encoder, letting the model learn about more complex and abstract relations and concepts.

Linear and Softmax

The embeddings after the Nx repeating blocks in the decoder are trained to contain the data for the model’s prediction of the next token.

The linear layer simply projects the decoder output onto the model’s vocabulary size.

The softmax function (we’ll explain this later), converts that vocabulary-sized vector into a same-sized vector containing the probabilities of the next possible output token.

Posted in

Leave a comment