Continuing from the previous post, let’s dive into the first section of the transformer – **the encoder**. As we discussed, the encoder embeds the input tokens, uses positional encoding and attention to imbue the token embeddings with relevant meaning, and passes the modified embeddings to the decoder. We’ve already covered how token embeddings work, so lets jump to…
Positional Encoding
Since each token is converted to its individual embedding vector, they don’t have any information about surrounding tokens and the order in which they occur. Naturally, this is important to how language is structured, so we need a way to add this information to the embeddings!
Positional encoding does this by adding a vector to each token embedding. This vector is derived from sinusoidal functions and is dependent on the position of the token. The positional meaning of this added vector is “baked” into the latent space of the encoder – the internal abstract space of representations that the model learns.
Multi-Head Attention
As we discussed in the attention post, this section allows for each embedding to attend to every other, allowing the token embeddings to incorporate relevant meaning from other tokens.
The **multi-head** prefix refers to the fact that attention here is carried out by multiple units, or “heads”. Each head conducts attention slightly differently. For instance, one head could deal with attending to nearby words, while another might deal with subject-object relations. Put together, these heads complete comprehensive attention.
Feedforward Layer
The feedforward layer is a neural network that is independently applied to each encoding after the attention step. This neural network transforms the embedding, enhancing its representation. While the attention layer lets each embedding gain data from surrounding ones, the feedforward layer lets the model “think more deeply” about the embedding. Since it’s a non-linear layer, the training process lets the feedforward layers in the transformer learn deeply about complex relations.
Add & Norm
Standing for “Add and normalize”, this unit adds the input of the previous unit to its output and then normalizes the result. Normalization stabilizes training, and adding the input to the output improves gradient flow, which mitigates errors when training. Combined, this smoothly enables the next step…
Nx
You’ll notice a little “Nx” next to a box encapsulating the attention and feedforward layers. This simply means that this mini-sequence is repeated some N times within the architecture, similar to how a neural network has multiple layers. This enables the transformer to learn more abstract and complex concepts.
Leave a comment