Introduced in the seminal paper “Attention Is All You Need,” the transformer revolutionized the world of natural language processing (NLP) and supercharged the progress of LLMs today. Let’s take a look at how it works.
Let’s consider transformers used for causal language modelling. Causal refers to the property of depending only on prior and current inputs, not those in the future. So, causal language modelling in this context refers to the goal of predicting the text that comes next, after a given input.
For example, with a given input of “I am 30 years-”, we would like to predict the next word, likely to be the word “old.” Since we’re talking about LLMs, our goal is rather to predict the next token.
This is how the transformer-based LLM works: Given a prompt (input), we try to predict the most suitable token to come next. Then we append it to the initial input and keep repeating the process.
Encoders and Decoders
The transformer, as it is described in “Attention Is All You Need,” works on an encoder-decoder system.
The encoder takes the input passed to the model — if we’re considering an LLM, that would be the prompt. It then transforms this input into context-rich embeddings. These aren’t just word embeddings; they also contain positional encoding, indicating where the embedding is located within the whole corpus, and each token attends to every other one as well.
The decoder generates the output tokens. Each time an output token is generated, the decoder takes in the encodings that have been generated so far as input. It uses masked self-attention to make sure that each token can only attend to tokens that come before it. This is necessary for training, and we’ll discuss it later. The decoder also uses cross-attention to attend to the encoder’s output, so it can refer to the input that way.
Finally, the decoder returns a list of possible next tokens and their corresponding confidence probabilities. From this, the next token is selected, and the decoder step repeats.
Leave a comment