When we previously discussed embeddings, we talked about being able to imbue an embedding with relevant semantic meaning from the surrounding context. For example, in the sentence “I spoke to Mark but he …”, an LLM would like to know what the embedding “he” refers to. The method that makes this possible is called attention. Let’s take a high-level overview of what it’s about.
How attention works (at a high level)
At the core of each transformer block within a transformer (the revolutionary architecture that made modern LLMs possible) lies the attention layer. An attention head is a component that conducts the attention mechanism, and several of them run in parallel within an attention layer. Essentially, an attention head assigns relevancy scores with respect to the current token for preceding token embeddings within the context window. In lingo, we say that the current token is attending to other tokens.
These relevancy scores are then combined into the current token embedding to enrich it with relevant semantic meaning from the surrounding text. This method of attention is called self-attention and lies at the heart of GPT models. I’ll dive deeper into how attention works in a later post.
Leave a comment